Balancing Precision and Recall with Selective Search Mon Shih Chuang Anagha Kulkarni Department of Computer Science Department of Computer Science San Francisco State University San Francisco State University 1600 Holloway Ave, San Francisco 1600 Holloway Ave, San Francisco CA, USA, 94132 CA, USA, 94132 mchuang@mail.sfsu.edu ak@sfsu.edu Abstract tion retrieval are a few examples of the same. Mo- tivated by these observations, we take a fresh look This work revisits the age-old problem of at the problem of balancing search precision and balancing search precision and recall us- recall through the lenses of a promising new ap- ing the promising new approach of Se- proach of Selective Search (Kulkarni and Callan, lective Search which partitions the doc- 2015). Selective Search is a distributed query pro- ument collection into topic-based shards cessing approach that has been shown to improve and searches a select few shards for any search efficiency tremendously while sustaining query. In prior work Selective Search the effectiveness. To accomplish this, at indexing has demonstrated strong search precision, time Selective Search partitions the document col- however, this improvement has come at lection into shards based on document similarity. the cost of search recall. In this work, The resulting shards are topically homogeneous, we test the hypothesis that improving the that is, documents about the same or related top- effectiveness of shard selection can better ics are in the same shard. At query time, Selec- balance search precision and recall. To- tive Search exploits this topical organization by ward this goal we investigate two new restricting query processing to a select few shards. shard selection approaches, and conduct This is contrary to the traditional distributed query a series of experiments that lead to three processing approach where the query is processed new findings:- 1. Big-document based at all the shards (Exhaustive Search). This is shard selection approaches can substan- needed because Exhaustive Search uses random tially outperform the small-document ap- shards where documents are allocated to shards proaches when provided with richer query at random. As a result, the relevant documents representation, 2. Applying Learning-To- to a query may be spread across many (or all) Rank approach for shard ranking provides shards. For topical shards, however, the relevant the most effective Selective Search setup, documents for a query are likely to be concen- 3. If the relevant documents for a query are trated into a few (or one) shards because these doc- spread across less than 10% of the shards uments are typically similar to each other (Xu and then Selective Search can successfully bal- Croft, 1999). These shards that are likely to con- ance precision and recall. tain relevant documents to the query are identified using shard ranking algorithms. 1 Introduction We believe that the topical organization of the The problem of balancing search precision and re- document collection along with the selective na- call is not new to the IR (Information Retrieval) ture of this search approach, can support a search community. Improving one, almost always re- environment that can balance precision and re- sults in degrading the other. The most promi- call. We test this hypothesis in this paper through nent example of IR, Web search, makes it seem a series of experiments where we apply well- that search precision is more important than re- established shard ranking approach, and propose call. However, there are many other use cases of improvements to these algorithms that leverage IR where search recall is equally, if not more, im- the topic-based organization of the documents. portant than precision. Medical or health informa- Selective Search has consistently demonstrated tion retrieval, prior-art search, and legal informa- good performance on precision-oriented metrics. 151 The reason for this trend is the purity of the search 2.1 Big-document approaches space. The few topical shards that are selected In cooperative environments, the providers of tar- and searched for a query contain much less noise, get resources are willing to share the term statis- that is, false-positive documents, than the com- tics and metadata of the documents collections. plete collection. A purer search space reduces the The methods ranking resources based on the chances of non-relevant documents being included term statistics are called big-document approaches in the search results, which directly improves the since the model can be extended from document precision. In order to improve recall, Selective retrieval model by treat each resource as a entity. Search needs to identify all the shards containing GlOSS(Gravano et al., 1994), and its extended relevant documents. Thus when optimizing for re- versions gGloss (Gravano and Garcia-Molina, call, more than a few shards need to be searched. 1999) and vGlOSS(Gravano et al., 1999) provide As such the accuracy of shard ranking algorithm a solution of text database discovery problem. The at deeper ranks also becomes critical. This ob- algorithms uses the ratio of term frequency and servation suggests that an effective shard ranking database size, also other metadata like field infor- algorithm would be able to improve search re- mation(title, body, links) for choosing the candi- call without degrading precision. We test this hy- date resources. pothesis thoroughly using empirical evaluation as CORI(Callan et al., 1995)(Callan, 2002) algo- early and deeper ranks. Finally, we introduce a rithm keeps document frequency(df) and shard novel shard ranking approach, Learning to Rank frequency(sf) of each term, and computes the Shards (LeToR-S), that is based on the successful score of every shard by a variation of tf.idf for- Learning-to-Rank document approach (Qin et al., mula. 2010). dfi This paper is organized as follows. The prior T = (1) dfi + 50 + 150 ⇤ swi /avg sw work that has informed and influenced our work is described in the next section. The proposed shard log( S+0.5 sf ) ranking approaches are described in Sections 3 I= (2) log(S + 1.0) and 4. The experimental setup used for evalua- tion is described in Section 5, followed by Results Score(tk |Si ) = b + (1 b) ⇤ T ⇤ I (3) and Analysis, Section 6. The conclusions we draw dfi : the document frequency of the term tk in from this work are provided in Section 7. shard i. sf : the shard frequency of the term tk (The number of shards contain tk ). 2 Related Work swi : the number of total words in the shard i. avg sw: the average number of total words in one shard. Selective Search falls under the subfield of dis- S: the number of shards. tributed information retrieval (Callan, 2002; Shok- tk : the kth term in the user query. ouhi et al., 2011) where the goal is to search across b: the minimum belief component, set to 0.4 multiple existing resources, and to aggregate the CORI inherited the query operators from IN- results. One of the important research problems QUERY(Callan et al., 1992) document retrieval in distributed IR is that of ranking the resources system which based upon Bayesian inference net- (shards) based on the number of relevant docu- work model. The operator set used by INQUERY ments they contain for the query. A long line [sum, wsum, and, or, not] can work unchanged for of research has studied this problem. The pro- ranking both documents and databases. posed algorithms can be roughly categorized into Taily(Aly et al., 2013) is another big document two groups: Big-document approaches, which are approach. According to Kanoulas et al’s work based on term frequency statistics of the resources. (Kanoulas et al., 2010), the term frequency based And Small-document approaches, which are based document score across whole collection can be on a small sample of documents from each re- modeled by gamma curve distribution. Thus, Taily source. Algorithms from both these categorize are pre-computes two parameters scaler ✓ and K of described next. gamma distribution to fit the document score for 152 every single-term query against every shard. By CORI uses, although complete are highly mislead- storing the score distribution of single-term query ing. This is so because traditional CORI uses un- against every shard, it can estimate the score dis- igram model where each query term is treated as tribution of user query with multiple terms. an individual entity. This leads to a very poor rep- resentation of the query. For example, the query 2.2 Small-document approaches arizona fish and game when decomposed into un- In uncooperative environments, the important igrams, looses the central topic fishing in arizona. statistics such as term frequency or collection size CORI cannot distinguish between shards are on can not be obtained. Therefore, big-document ap- topic, and shards that contain documents about proaches are not capable to compute the shard arizona in some other context, and about fishing scores. Small-document algorithms solve this is- in some other state. sue by approximating the document distribution These observations motivated the following in- inside a resource by sampling a small subset, vestigation were a richer representation of the which is called centralized sample database, or query is used by both CORI and ReDDE. Given centralized sample index (CSI) structure. a query with n terms, n 1 bigrams are gener- The ReDDE(Si and Callan, 2003) algorithm ated by enumerating all pairs of consecutive terms. runs the user query against the CSI and assumes Bigrams with stopwords are discarded, and the that the top n retrieved documents are relevant. remaining bigrams are added to the original un- The original version of ReDDE compute a score igram query. As an example, for query obama for each shard as follow equation: family tree, this approach generates the following query representation using the Indri Query Lan- |SiCSI | guage: #combine(obama family tree #uw2(obama Score(Siq ) = Count(n, Siq ) ⇥ (4) |Si | family) #uw2(family tree)), where the #combine operator coalesces the scores from all the element Count(n, Siq ) is the count of documents occurred of the query, and the #uwX operator is used for in top n retrieved documents in CSI |Si | is the size specifying an unordered phrase of length X. The of the shard and |SiCSI |is the size of its sample ReDDE Uni+Bi runs the query generated using . The shard scores are then normalized to obtain the above procedure against the CSI, and the rest a valid probability distribution used to rank the of the search process is same as before. In case of shards. CORI Uni+Bi, frequency statistics for bigrams, in CRCS(Shokouhi, 2007) passes the user queries addition to unigrams, are used in order to evalu- to CSI and compute the score of each resource ate the richer query representation. Since the bi- from the returned document rank. Two version of gram statistics can be precomputed off-line, the re- CRCS are introduces by modifying the function sponse time of shard ranking, and query evaluation of rank. CRCS(1) uses a simple linear decreasing is not affected. Single-term queries or the queries model, and CRCS(e) uses an exponential decaying containing no phrases because of stop words (e.g. model for the document score. ”to be or not to be”), remain unchanged. Higher- SUSHI(Thomas and Shokouhi, 2009) passes order n-grams, such as, trigrams were not included the user queries to CSI and uses the returned docu- due to two reasons: 1. cost of the computing the ment scores to estimate the score distribution. For statistics for trigrams is substantially high, and 3. each shard, SUSHI fits one of three types of distri- the benefits from trigram representation are ex- bution curves, linear, logarithmic, and exponential pected to be marginal because most queries are to the scores of returned documents in CSI. short, two or fewer terms. In search scenarios 3 CORI Uni+Bi & ReDDE Uni+Bi where the user queries are longer (legal or med- ical retrieval), trigram query representation could The scenario in which Selective Search operates is be worth the additional cost. This is part of future cooperative, that is, CORI has access to the com- work. plete term statistics for each shard. While ReDDE estimates the shard ranking based on the small 4 Learning to Rank Shards (LeToR-S) sample of each shard. As such, one would ex- pect CORI to perform at least on par, if not out- The Learning-to-Rank approaches have been suc- perform, ReDDE. However, the term statistics that cessfully used to improve the document rank- 153 ing task (Freund et al., 2003)(Xu and Li, contains 50,220,423 documents. The 92 topical 2007)(Burges, 2010). We wished to investigate if shards created for CategoryB dataset by Kulka- a ranking model can be learned for the shard rank- rni and Callan(Kulkarni et al., 2012) are used in ing task as well. To test this intuition we started by this work. The evaluation queries are from TREC defining the set of features for the ranking model, Web Track 2009-2012. Out of the 200 queries, and the target variable. The number of relevant 6 queries do not contain any relevant document documents in the shard for the query, is used as in this dataset, and thus are discarded. The re- the target. Instead of using integer rank values, maining 194 queries are divided into 10-fold for this target captures more information. The shards the LeToR-S experiment to facilitate 10-fold cross are ranked in the descending order of the predicted validation. For the small-document approach, target value. ReDDE, we construct the CSI by randomly sam- LeToR-S Features pling 0.5% of the documents from every shard. The different fields or sections of the document For all the ReDDE experiments the same CSI inform the features. The title, body, heading, was employed, to minimize any spurious effects url, whole document each generate a separate caused by sampling. The search engine used in set of features. Several variants of CORI scores our experiment is Indri 5.9 (Strohman et al., 2005) for the query are evaluated against the document from Lemur Project. For all the Selective Search fields, and are used as features. More specifically, experiments reported in Sections 6.1 through 6.5, CORI SUM, CORI MIN, and CORI VAR are the the top 10 shards were searched for each query. three variants of CORI as defined in Equations This corresponds to a search space of about 5.5 5 through 7. Each of these scores is computed million documents. This is an order of magnitude for two different query representations:- unigram, smaller than the search space of Exhaustive Search and phrasal, against all of the different document (50+ million documents). fields. 6 Results and Analysis X CORI SU M (Q|Si ) = Score(tk |Si ) (5) This section describes the experiments that were tk 2Q conducted to test the various hypotheses about improving search effectives by leveraging topical shards. Table 1 presents a comprehensive set of CORI M IN (Q|Si ) = min Score(tk |Si ) (6) results for four different investigations that we un- tk 2Q dertook. We describe these next in the following 1 X subsections, and analyze the results in Table 1. µ= Score(tk |Si ) (7) n tk 2Q 6.1 Big-Document versus Small-Document Shard Ranking Approach X CORI V AR(Q|Si ) = (Score(tk |Si ) µ)2 Traditionally the choice of small- or big-document tk 2Q approach was dictated by the type of search (8) environment:- uncooperative, or cooperative, re- Where Score(t P k |Si ) is the same as Equation 3, spectively. Although, our scenario would catego- and µ = n1 tk 2Q Score(tk |Si ) rize as cooperative, we choose to also experiment For each query and shard pair, the above vec- with small-document approaches because the con- tor of features is compiled, in order to learn ventional belief has been that small-document the ranking model or to predict the ranking us- approaches provide superior search effectiveness ing the RandomForest(Breiman, 2001) model im- than big-document approaches. This first exper- plemented by RankLib(Dang, 2013) from Lemur iment empirically tests this belief, specifically in Project(Croft and Callan, 2000). the context of topical shards. We also compare Exhaustive Search with Selective Search with big- 5 Experimental Setup document, and small-document approaches. The For the empirical evaluation the experimental first three rows in Table 1 are the focus of this anal- setup that was undertaken is described next. We ysis. use the CategoryB dataset of ClueWeb09, which As compared to Exhaustive Search, CORI and 154 # #Qrys Run P@30 P@100 map R@30 R@100 ndcg 1 194 Exh 0.254 0.189 0.181 0.174 0.374 0.429 2 194 CORI 0.255 0.178 0.164 0.158 0.329 0.370 3 194 ReDDE 0.256 0.182 0.172 0.175 0.360† 0.400† 4 52 (len=1) CORI 0.260 0.189 0.163 0.132 0.296 0.391 5 52 (len=1) ReDDE 0.250 0.184 0.155 0.131 0.295 0.387 6 142 (len>1) CORI 0.254 0.173 0.164 0.167 0.341 0.362 7 142 (len>1) ReDDE 0.258 0.181 0.178 0.191† 0.384† 0.404† 8 194 CORI Uni+Bi 0.271§‡ 0.188§ 0.181§ 0.180§ 0.363§ 0.408§ 9 194 ReDDE Uni+Bi 0.263 0.184 0.175 0.182 0.363 0.398 10 194 LeToR-S 0.270‡ 0.194¶ 0.186¶ 0.179 0.377¶ 0.417¶ Table 1: Search Effectiveness for Exhaustive Search, and for Selective Search with CORI and ReDDE under various configurations. † indicates statistically significant improvements when comparing ReDDE with CORI. § indicates statistically significant improvement when comparing MTD Uni+Bi with MTD. ‡ indicates statistically significant improvement over Exhaustive Search. ¶ indicates statistically sig- nificant improvement when comparing LeToR-S with MTD Uni+Bi. Underline indicates significantly worse values when compared to Exhaustive Search. Statistical significance testing was performed using paired T-test at p<0.05. ReDDE, both struggle at deeper ranks. CORI, gleton queries. Across all the metrics the search the big document approach, is consistently inferior effectiveness with CORI is higher in magnitude to ReDDE, the small document approach, across than that with ReDDE, which is exactly the oppo- all the metrics. In fact, at deeper ranks (R@100 site trend seen with multi-term queries. These re- and ndcg) the improvements over CORI, with sults establish that CORI’s subpar performance is ReDDE are statistically significant. These results restricted to multi-term queries. On the other hand confirm that the conventional unigram language ReDDE struggles more with singleton queries be- model based shard ranking approach adopted by cause estimation errors due to under-sampling are CORI struggles to differentiate the relevant shards more likely when there is only one term in the from non-relevant shards. This is so even for top- query to inform the shard ranking. ical shards where the distribution of relevant doc- uments across shards is highly skewed. Also, note 6.3 Effect of Richer Query Representation that CORI has access to the vocabulary of the CORI’s inferior performance with multi-term complete collection whereas ReDDE is only us- queries motivates the investigation in this section. ing 0.5% subset of the collection for estimating the The results with CORI and ReDDE when using shard ranking. Thus CORI’s inferior performance this richer query representation are given in rows is especially surprising. These observations moti- 8 and 9 of Table 1. These results show an op- vate the experiment described next. posite trend as that with unigram query repre- sentation (rows 2 and 3). The big-document ap- 6.2 Effect of Query Length proach (CORI Uni+Bi) performances better, al- Our hypothesis that we test in this section is though not significantly, than the small-document that for multi-term queries the unigram language approach (ReDDE Uni+Bi). CORI clearly ben- model used by CORI severally misinforms the efits more from the richer query representation shard ranking estimation. This is especially true than ReDDE. CORI Uni+Bi results are signifi- for multi-term queries which consist of phrase(s). cantly better than those with CORI. This is not For example, in the query, obama family tree, it the case for ReDDE Uni+Bi. At early ranks, is critically important to treat the terms family and CORI Uni+Bi is significantly better than even Ex- tree as a phrase and not as unigrams. The results haustive Search. This indicates substantial reduc- in rows 4 through 7 in Table 1 provide evidence tion in false-positives in the retrieved documents in support of the above hypothesis. Rows 4 and at early ranks. This is facilitated by two factors:- 5 are results for 52 queries, all of which are sin- topic-based shards reduce the noise (false-positive 155 matches) in each shard, and CORI Uni+Bi selects score features. Recall that the CORI minimum the shards such that the resulting search space is score feature is lowest CORI score computed for purer than that used by Exhaustive Search. the individual query terms against a shard. This Although, not shown in the table, the results feature models the intuition that all of the query for multi-term queries show similar trends as terms should have high CORI score for a rele- before:- for all the metrics the magnitude of search vant shard. Low CORI score, even if only for effectiveness with CORI Uni+Bi is higher than one of the query terms, indicates less likelihood with ReDDE Uni+Bi, and CORI Uni+Bi is sig- of shard relevance. For query, pacific northwest nificantly better than CORI. However, for ReDDE laboratory only one shard contains all the rele- that is not the case. For singleton queries the re- vant documents, LeToR-S ranks this shard at 8th sults do not change because the richer query repre- place, while CORI Uni+Bi ranks it at 11. Through sentation does not yield a different query. In sum- the CORI minimum score feature, several false- mary, CORI Uni+Bi provides the best search ef- positive shards are eliminated by LeToR-S from fectiveness until now. In the next section we inves- the shard ranking. These false-positive shards tigate if we can improve the performance further. have high overall CORI score because some of the query terms have high CORI score, and thus 6.4 Learning to Rank Shards dominate the cumulative score. However, the CORI minimum score captures that some query The last row in Table 1 reports the results with terms have low CORI score for these false-positive Learning to Rank Shards approach (LeToR-S). shards and thus push them down in the shard rank- The first obvious trend in these results is that ing. LeToR-S significantly outperforms the current The results in Table 1 also indicate that at early best, CORI Uni+Bi, at deeper ranks. At early ranks LeToR-S performs significantly better than ranks, however, the two approaches provide com- Exhaustive Search. This improvement often but parable precision and recall. To understand these not always comes from single term queries that results better we analyze a few queries in detail. may have one than one meaning or aspect associ- For one of the queries, getting organized, ated with them (euclid, avp, iron, unc). The topic- CORI Uni+Bi ranks the shard with most num- based partitioning of the collection organizes the ber of relevant documents at 11th position, while documents with similar meaning or aspect into the LeToR-S ranks it at 3rd. This is a difficult query same shard. Often one of the meanings is more because both the query terms are common terms. dominant than others in the collection, that is also Even when the terms are treated as a phrase, it often the relevant meaning for the query. Shards is still not a focused query. This is reflected with the dominant meaning have higher document in the low scores assigned to relevant shards by frequency (df ) than shards with the rare meaning, CORI Uni+Bi. LeToR-S, however, uses meta- and thus documents with dominant meaning only data in addition to the document contents for are searched. This reduces the false-positive doc- defining the features. One particular meta-data uments (documents with rare meaning) from the feature, url field, proves to be especially valu- result, and thus improves the search precision. able for this query that consists of common terms. Documents that contain getting organized in their 6.5 Effect of Distribution of Relevant field are relevant to the query. In turn, shards that Documents contain such documents should be ranked higher too. In short, LeToR-S benefits from having the When comparing the best performing Selec- field score features, while CORI Uni+Bi suffers tive Search approach, LeToR-S, with Exhaustive because it only uses document contents for shard Search we see in Table 1 that at early ranks, ranking. A few more example queries that high- LeToR-S performs at par or better than Exhaus- light the value of the field score features are bat- tive Search in precision and recall both. However, tles in the civil war and kansas city mo. For at deeper ranks, LeToR-S struggles on recall more both queries, CORI Uni+Bi ranks the most rele- than precision, which is indicated by the signifi- vant shard at a much deeper rank than LeToR-S. cantly lower ndcg value. Our hypothesis for the Another feature category that helps LeToR-S reason behind this trend is that LeToR-S is unable outperform CORI Uni+Bi is the CORI minimum to retrieve all the shards containing relevant docu- 156 Figure 1: Histogram of number of shards containing relevant documents for the query. ments. In order to test this hypothesis we conduct recall. To test this intuition we separate the 194 the following experiment. queries into two groups based on the spread cutoff of  7. This gives us one group of 133 queries for The effectiveness of shard ranking algorithm is which the relevant documents are spread across 7 dependent on the distribution of the relevant doc- or fewer shards, and the other group contains 61 uments across shards. If all the relevant docu- queries. ments are concentrated in a few shards then the shard ranking task is straightforward, however if The results for the two query groups with the the relevant documents are spread across many various search approaches are given in Tables 2 shards then task is much more challenging for any and 3. For the first group of queries (Table 2) shard ranker. Figure 1 provides the histogram LeToR-S (and also CORI Uni+Bi) provides sta- of the spread of relevant documents for the 194 tistically significant improvement over Exhaustive queries. For a large fraction of the queries (27) Search in precision at both, early and deep ranks. the spread of the relevant documents are restricted Furthermore these improvements in precision do to 4 shards. 70% of the total queries have a not come at the cost of recall, the recall (at all spread of 7 or less. For the remaining 30% of ranks) with LeToR-S stays comparable to that with the queries the relevant documents can be spread Exhaustive Search. Even ndcg with LeToR-S is across as many as 31 shards, indicating that the not statistically different from that of Exhaustive topic-based sharding approach failed to concen- Search. This is a rare phenomenon:- a search ap- trate the relevant documents into few shards for proach being able to balance precision and recall. these queries. We are however interested in the The results in Table 3 tell a different story. At 70% of the queries for which the spread of relevant deeper ranks, precision and recall, both suffer with documents is restricted to 7 shards. We believe all the Selective Search approaches. These results that LeToR-S should be able to provide effective clearly establish the importance of concentrating shard ranking for these subset of queries, which the relevant documents into few shards. Doing so in turn should help improve both, precision and not only reduces the search cost but substantially 157 Search Approach P@30 P@100 map R@30 R@100 ndcg Exh 0.218 0.153 0.166 0.198 0.399 0.405 CORI 0.229 0.154 0.165 0.178 0.361 0.369 ReDDE 0.225 0.154 0.169 0.202 0.397 0.391 CORI Uni+Bi 0.241‡ 0.163 0.179‡ 0.205 0.401 0.400 ReDDE Uni+Bi 0.233‡ 0.156 0.171 0.209 0.400 0.385 LeToR-S 0.239 ‡ 0.164‡ 0.178‡ 0.204 0.411 0.401 Table 2: Results on 133 Queries with number of relevant shards  7. ‡ indicates statistically significant improvement over Exhaustive Search. Underline indicates significantly worse values when compared to Exhaustive Search. Statistical significance testing was performed using paired T-test at p<0.05. Search Approach P@30 P@100 map R@30 R@100 ndcg Exh 0.332 0.265 0.215 0.124 0.320 0.482 CORI 0.312 0.228 0.162 0.114 0.258 0.372 ReDDE 0.322 0.243 0.179 0.118 0.279 0.418 CORI Uni+Bi 0.336 0.243 0.187 0.125 0.282 0.424 ReDDE Uni+Bi 0.331 0.246 0.185 0.122 0.283 0.424 LeToR-S 0.340 0.261 ¶ 0.204 ¶ 0.124 0.303 ¶ 0.452 ¶ Table 3: Results on 61 Queries with number of relevant shard > 7. ¶ indicates statistically significant improvement when comparing LeToR-S with MTD Uni+Bi. Underline indicates significantly worse values when compared to Exhaustive Search. Statistical significance testing was performed using paired T-test at p<0.05. improves search precision without degrading the shard cutoff values. Even the search cost for recall. LeToR-S are marginally lower than those with CORI Uni+Bi, indicating a bias toward smaller 6.6 Effect of Number of Shards Searched shards in case of LeToR-S. As more shards are searched the recall at deeper For all the experiments until now we have held the ranks with LeToR-S becomes on par with Exhaus- parameter, shard cutoff (T), constant at 10. That tive. The analysis in the previous section demon- is, for all the Selective Search experiments the top strated that LeToR-S becomes comparable to Ex- 10 shards, out of 92 shards, were searched for haustive Search even on the ndcg metric if the each query. Changing this parameter directly af- spread of the relevant documents is restricted. The fects the cost of Selective Search, and it also in- corresponding search cost of Selective Search ap- fluences the search effectiveness. The influence proaches is at least an order of magnitude lower of this parameter in the general distributed search than that of Exhaustive Search. seup has been extensively investigated by Markov and Crestani (Markov and Crestani, 2014). In this 7 Conclusion section we study the effect of parameter T on the two best performing Selective Search approaches, Our goal for this work was to investigate ways CORI Uni+Bi and LeToR-S, and compare them to to balance search precision and recall. Selective Exhaustive Search. Search proved to be an effective search environ- Table 4 provides the results for this analysis. ment for this investigation, and focusing on the At early ranks, LeToR-S performs on par with shard ranking problem to achieve this goal also Exhaustive Search while searching just the top proved to the correct choice. We revived an old three shards. The corresponding search cost, ap- shard selection approach, CORI, which supported proximated by A, is orders of magnitude lower competitive search performance when it was pro- for LeToR-S than for Exhaustive Search. When vided with richer query representation. We also comparing LeToR-S with CORI Uni+Bi, the for- introduced a novel shard ranking algorithm based mer consistently outperforms the latter at all the on the well-established Learning-To-Ranking ap- 158 T A Search Approach P@30 P@100 map R@30 R@100 ndcg 92 50.22 Exh 0.254 0.189 0.181 0.174 0.374 0.429 1 0.561 CORI Uni-Bi 0.215 0.124 0.120 0.133 0.223 0.247 0.560 LeToR-S 0.230 0.139¶ 0.124 0.139¶ 0.246 0.266 3 1.672 CORI Uni-Bi 0.267 0.174 0.166 0.174 0.327 0.353 1.666 LeToR-S 0.281‡ 0.183 0.174 0.180 0.342 0.380¶ 5 2.773 CORI Uni-Bi 0.272 0.182 0.174 0.177 0.342 0.375 2.768 LeToR-S 0.275‡ 0.185 0.176 0.180 0.351 0.396¶ 7 3.874 CORI Uni-Bi 0.275‡ 0.187 0.178 0.171 0.350 0.393 3.869 LeToR-S 0.274‡ 0.190 0.181 0.177 0.365 0.406 10 5.513 CORI Uni+Bi 0.271‡ 0.188 0.181 0.180 0.363 0.408 5.510 LeToR-S 0.270‡ 0.194 ¶ 0.186 0.179 0.377¶ 0.417¶ Table 4: Results for Exhaustive, and Selective Search with CORI Uni+Bi and with LeToR-S at various shard cutoffs (T). A: The average search space size per query in million documents. ¶ indicates statis- tically significant improvement when comparing LeToR-S with CORI Uni+Bi. ‡ indicates statistically significant improvement when comparing LeToR-S or CORI Uni+Bi with Exhaustive. Underline indi- cates significantly worse values when compared to Exhaustive Search. Statistical significance testing was performed using paired T-test at p<0.05. proach, which provided the best search precision works. In Proceedings of the 18th annual interna- while also sustaining the recall. A thorough anal- tional ACM SIGIR conference on Research and de- velopment in information retrieval. ACM, pages 21– ysis of the results showed that simply searching 28. more shards does not necessarily increase search effectiveness. Instead the two factors that are crit- Jamie Callan. 2002. Distributed information retrieval. ically important for Selective Search to success- In Advances in information retrieval, Springer, pages 127–150. fully balance precision and recall are:- 1. parti- tioning the collection such that the relevant doc- Bruce Croft and Jamie Callan. 2000. Lemur project. uments for the query are spread across less than https://www.lemurproject.org/. 10% of the shards, and 2. to employ an effective Van Dang. 2013. Lemur project components: Ranklib. shard ranking approach, like the ones proposed in https://www.lemurproject.org/ranklib.php. this work. Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. 2003. An efficient boosting algorithm for combining preferences. Journal of machine learn- References ing research 4(Nov):933–969. Robin Aly, Djoerd Hiemstra, and Thomas Demeester. 2013. Taily: shard selection using the tail of score Luis Gravano and Hector Garcia-Molina. 1999. Gen- distributions. In Proceedings of the 36th inter- eralizing gloss to vector-space databases and broker national ACM SIGIR conference on Research and hierarchies. Technical report, Stanford InfoLab. development in information retrieval. ACM, pages Luis Gravano, Hector Garcia-Molina, and Anthony 673–682. Tomasic. 1994. The effectiveness of gioss for the Leo Breiman. 2001. Random forests. Machine learn- text database discovery problem. In ACM SIGMOD ing 45(1):5–32. Record. ACM, volume 23, pages 126–137. Christopher JC Burges. 2010. From ranknet to lamb- Luis Gravano, Héctor Garcı́a-Molina, and Anthony darank to lambdamart: An overview. Learning Tomasic. 1999. Gloss: text-source discovery over 11(23-581):81. the internet. ACM Transactions on Database Sys- tems (TODS) 24(2):229–264. James P Callan, W Bruce Croft, and Stephen M Harding. 1992. The inquery retrieval system. In Evangelos Kanoulas, Keshi Dai, Virgil Pavlu, and Database and expert systems applications. Springer, Javed A Aslam. 2010. Score distribution models: pages 78–83. assumptions, intuition, and robustness to score ma- nipulation. In Proceedings of the 33rd international James P Callan, Zhihong Lu, and W Bruce Croft. 1995. ACM SIGIR conference on Research and develop- Searching distributed collections with inference net- ment in information retrieval. ACM, pages 242–249. 159 Anagha Kulkarni and Jamie Callan. 2015. Selective search: Efficient and effective search of large tex- tual collections. ACM Transactions on Information Systems (TOIS) 33(4):17. Anagha Kulkarni, Almer S Tigelaar, Djoerd Hiemstra, and Jamie Callan. 2012. Shard ranking and cutoff estimation for topically partitioned collections. In Proceedings of the 21st ACM international confer- ence on Information and knowledge management. ACM, pages 555–564. Ilya Markov and Fabio Crestani. 2014. Theo- retical, qualitative, and quantitative analyses of small-document approaches to resource selection. ACM Transactions on Information Systems (TOIS) 32(2):9. Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. 2010. Letor: A benchmark collection for research on learn- ing to rank for information retrieval. Information Retrieval 13(4):346–374. Milad Shokouhi. 2007. Central-rank-based collection selection in uncooperative distributed information retrieval. In European Conference on Information Retrieval. Springer, pages 160–172. Milad Shokouhi, Luo Si, et al. 2011. Federated search. Foundations and Trends R in Information Retrieval 5(1):1–102. Luo Si and Jamie Callan. 2003. Relevant document distribution estimation method for resource selec- tion. In Proceedings of the 26th annual inter- national ACM SIGIR conference on Research and development in informaion retrieval. ACM, pages 298–305. Trevor Strohman, Donald Metzler, Howard Turtle, and W Bruce Croft. 2005. Indri: A language model- based search engine for complex queries. In Pro- ceedings of the International Conference on Intelli- gent Analysis. Citeseer, volume 2, pages 2–6. Paul Thomas and Milad Shokouhi. 2009. Sushi: scor- ing scaled samples for server selection. In Proceed- ings of the 32nd international ACM SIGIR confer- ence on Research and development in information retrieval. ACM, pages 419–426. Jinxi Xu and W Bruce Croft. 1999. Cluster-based lan- guage models for distributed retrieval. In Proceed- ings of the 22nd annual international ACM SIGIR conference on Research and development in infor- mation retrieval. ACM, pages 254–261. Jun Xu and Hang Li. 2007. Adarank: a boosting al- gorithm for information retrieval. In Proceedings of the 30th annual international ACM SIGIR confer- ence on Research and development in information retrieval. ACM, pages 391–398. 160