Balancing Precision and Recall with Selective Search
                 Mon Shih Chuang                                 Anagha Kulkarni
           Department of Computer Science                  Department of Computer Science
            San Francisco State University                  San Francisco State University
          1600 Holloway Ave, San Francisco                1600 Holloway Ave, San Francisco
                  CA, USA, 94132                                  CA, USA, 94132
            mchuang@mail.sfsu.edu                                 ak@sfsu.edu


                    Abstract                             tion retrieval are a few examples of the same. Mo-
                                                         tivated by these observations, we take a fresh look
    This work revisits the age-old problem of            at the problem of balancing search precision and
    balancing search precision and recall us-            recall through the lenses of a promising new ap-
    ing the promising new approach of Se-                proach of Selective Search (Kulkarni and Callan,
    lective Search which partitions the doc-             2015). Selective Search is a distributed query pro-
    ument collection into topic-based shards             cessing approach that has been shown to improve
    and searches a select few shards for any             search efficiency tremendously while sustaining
    query. In prior work Selective Search                the effectiveness. To accomplish this, at indexing
    has demonstrated strong search precision,            time Selective Search partitions the document col-
    however, this improvement has come at                lection into shards based on document similarity.
    the cost of search recall. In this work,             The resulting shards are topically homogeneous,
    we test the hypothesis that improving the            that is, documents about the same or related top-
    effectiveness of shard selection can better          ics are in the same shard. At query time, Selec-
    balance search precision and recall. To-             tive Search exploits this topical organization by
    ward this goal we investigate two new                restricting query processing to a select few shards.
    shard selection approaches, and conduct              This is contrary to the traditional distributed query
    a series of experiments that lead to three           processing approach where the query is processed
    new findings:- 1. Big-document based                 at all the shards (Exhaustive Search). This is
    shard selection approaches can substan-              needed because Exhaustive Search uses random
    tially outperform the small-document ap-             shards where documents are allocated to shards
    proaches when provided with richer query             at random. As a result, the relevant documents
    representation, 2. Applying Learning-To-             to a query may be spread across many (or all)
    Rank approach for shard ranking provides             shards. For topical shards, however, the relevant
    the most effective Selective Search setup,           documents for a query are likely to be concen-
    3. If the relevant documents for a query are         trated into a few (or one) shards because these doc-
    spread across less than 10% of the shards            uments are typically similar to each other (Xu and
    then Selective Search can successfully bal-          Croft, 1999). These shards that are likely to con-
    ance precision and recall.                           tain relevant documents to the query are identified
                                                         using shard ranking algorithms.
1   Introduction
                                                            We believe that the topical organization of the
The problem of balancing search precision and re-        document collection along with the selective na-
call is not new to the IR (Information Retrieval)        ture of this search approach, can support a search
community. Improving one, almost always re-              environment that can balance precision and re-
sults in degrading the other. The most promi-            call. We test this hypothesis in this paper through
nent example of IR, Web search, makes it seem            a series of experiments where we apply well-
that search precision is more important than re-         established shard ranking approach, and propose
call. However, there are many other use cases of         improvements to these algorithms that leverage
IR where search recall is equally, if not more, im-      the topic-based organization of the documents.
portant than precision. Medical or health informa-       Selective Search has consistently demonstrated
tion retrieval, prior-art search, and legal informa-     good performance on precision-oriented metrics.


                                                   151
The reason for this trend is the purity of the search     2.1   Big-document approaches
space. The few topical shards that are selected           In cooperative environments, the providers of tar-
and searched for a query contain much less noise,         get resources are willing to share the term statis-
that is, false-positive documents, than the com-          tics and metadata of the documents collections.
plete collection. A purer search space reduces the        The methods ranking resources based on the
chances of non-relevant documents being included          term statistics are called big-document approaches
in the search results, which directly improves the        since the model can be extended from document
precision. In order to improve recall, Selective          retrieval model by treat each resource as a entity.
Search needs to identify all the shards containing           GlOSS(Gravano et al., 1994), and its extended
relevant documents. Thus when optimizing for re-          versions gGloss (Gravano and Garcia-Molina,
call, more than a few shards need to be searched.         1999) and vGlOSS(Gravano et al., 1999) provide
As such the accuracy of shard ranking algorithm           a solution of text database discovery problem. The
at deeper ranks also becomes critical. This ob-           algorithms uses the ratio of term frequency and
servation suggests that an effective shard ranking        database size, also other metadata like field infor-
algorithm would be able to improve search re-             mation(title, body, links) for choosing the candi-
call without degrading precision. We test this hy-        date resources.
pothesis thoroughly using empirical evaluation as            CORI(Callan et al., 1995)(Callan, 2002) algo-
early and deeper ranks. Finally, we introduce a           rithm keeps document frequency(df) and shard
novel shard ranking approach, Learning to Rank            frequency(sf) of each term, and computes the
Shards (LeToR-S), that is based on the successful         score of every shard by a variation of tf.idf for-
Learning-to-Rank document approach (Qin et al.,           mula.
2010).
                                                                                    dfi
   This paper is organized as follows. The prior                 T =                                      (1)
                                                                       dfi + 50 + 150 ⇤ swi /avg sw
work that has informed and influenced our work is
described in the next section. The proposed shard                               log( S+0.5
                                                                                      sf )
ranking approaches are described in Sections 3                            I=                              (2)
                                                                               log(S + 1.0)
and 4. The experimental setup used for evalua-
tion is described in Section 5, followed by Results              Score(tk |Si ) = b + (1     b) ⇤ T ⇤ I   (3)
and Analysis, Section 6. The conclusions we draw             dfi : the document frequency of the term tk in
from this work are provided in Section 7.                 shard i.
                                                             sf : the shard frequency of the term tk (The
                                                          number of shards contain tk ).
2   Related Work                                             swi : the number of total words in the shard i.
                                                             avg sw: the average number of total words in
                                                          one shard.
Selective Search falls under the subfield of dis-            S: the number of shards.
tributed information retrieval (Callan, 2002; Shok-          tk : the kth term in the user query.
ouhi et al., 2011) where the goal is to search across        b: the minimum belief component, set to 0.4
multiple existing resources, and to aggregate the            CORI inherited the query operators from IN-
results. One of the important research problems           QUERY(Callan et al., 1992) document retrieval
in distributed IR is that of ranking the resources        system which based upon Bayesian inference net-
(shards) based on the number of relevant docu-            work model. The operator set used by INQUERY
ments they contain for the query. A long line             [sum, wsum, and, or, not] can work unchanged for
of research has studied this problem. The pro-            ranking both documents and databases.
posed algorithms can be roughly categorized into             Taily(Aly et al., 2013) is another big document
two groups: Big-document approaches, which are            approach. According to Kanoulas et al’s work
based on term frequency statistics of the resources.      (Kanoulas et al., 2010), the term frequency based
And Small-document approaches, which are based            document score across whole collection can be
on a small sample of documents from each re-              modeled by gamma curve distribution. Thus, Taily
source. Algorithms from both these categorize are         pre-computes two parameters scaler ✓ and K of
described next.                                           gamma distribution to fit the document score for


                                                    152
every single-term query against every shard. By           CORI uses, although complete are highly mislead-
storing the score distribution of single-term query       ing. This is so because traditional CORI uses un-
against every shard, it can estimate the score dis-       igram model where each query term is treated as
tribution of user query with multiple terms.              an individual entity. This leads to a very poor rep-
                                                          resentation of the query. For example, the query
2.2 Small-document approaches                             arizona fish and game when decomposed into un-
In uncooperative environments, the important              igrams, looses the central topic fishing in arizona.
statistics such as term frequency or collection size      CORI cannot distinguish between shards are on
can not be obtained. Therefore, big-document ap-          topic, and shards that contain documents about
proaches are not capable to compute the shard             arizona in some other context, and about fishing
scores. Small-document algorithms solve this is-          in some other state.
sue by approximating the document distribution               These observations motivated the following in-
inside a resource by sampling a small subset,             vestigation were a richer representation of the
which is called centralized sample database, or           query is used by both CORI and ReDDE. Given
centralized sample index (CSI) structure.                 a query with n terms, n 1 bigrams are gener-
   The ReDDE(Si and Callan, 2003) algorithm               ated by enumerating all pairs of consecutive terms.
runs the user query against the CSI and assumes           Bigrams with stopwords are discarded, and the
that the top n retrieved documents are relevant.          remaining bigrams are added to the original un-
The original version of ReDDE compute a score             igram query. As an example, for query obama
for each shard as follow equation:                        family tree, this approach generates the following
                                                          query representation using the Indri Query Lan-
                                      |SiCSI |            guage: #combine(obama family tree #uw2(obama
    Score(Siq ) = Count(n, Siq ) ⇥               (4)
                                        |Si |             family) #uw2(family tree)), where the #combine
                                                          operator coalesces the scores from all the element
Count(n, Siq ) is the count of documents occurred
                                                          of the query, and the #uwX operator is used for
in top n retrieved documents in CSI |Si | is the size
                                                          specifying an unordered phrase of length X. The
of the shard and |SiCSI |is the size of its sample
                                                          ReDDE Uni+Bi runs the query generated using
. The shard scores are then normalized to obtain
                                                          the above procedure against the CSI, and the rest
a valid probability distribution used to rank the
                                                          of the search process is same as before. In case of
shards.
                                                          CORI Uni+Bi, frequency statistics for bigrams, in
   CRCS(Shokouhi, 2007) passes the user queries
                                                          addition to unigrams, are used in order to evalu-
to CSI and compute the score of each resource
                                                          ate the richer query representation. Since the bi-
from the returned document rank. Two version of
                                                          gram statistics can be precomputed off-line, the re-
CRCS are introduces by modifying the function
                                                          sponse time of shard ranking, and query evaluation
of rank. CRCS(1) uses a simple linear decreasing
                                                          is not affected. Single-term queries or the queries
model, and CRCS(e) uses an exponential decaying
                                                          containing no phrases because of stop words (e.g.
model for the document score.
                                                          ”to be or not to be”), remain unchanged. Higher-
   SUSHI(Thomas and Shokouhi, 2009) passes
                                                          order n-grams, such as, trigrams were not included
the user queries to CSI and uses the returned docu-
                                                          due to two reasons: 1. cost of the computing the
ment scores to estimate the score distribution. For
                                                          statistics for trigrams is substantially high, and 3.
each shard, SUSHI fits one of three types of distri-
                                                          the benefits from trigram representation are ex-
bution curves, linear, logarithmic, and exponential
                                                          pected to be marginal because most queries are
to the scores of returned documents in CSI.
                                                          short, two or fewer terms. In search scenarios
3   CORI Uni+Bi & ReDDE Uni+Bi                            where the user queries are longer (legal or med-
                                                          ical retrieval), trigram query representation could
The scenario in which Selective Search operates is        be worth the additional cost. This is part of future
cooperative, that is, CORI has access to the com-         work.
plete term statistics for each shard. While ReDDE
estimates the shard ranking based on the small            4   Learning to Rank Shards (LeToR-S)
sample of each shard. As such, one would ex-
pect CORI to perform at least on par, if not out-         The Learning-to-Rank approaches have been suc-
perform, ReDDE. However, the term statistics that         cessfully used to improve the document rank-


                                                    153
ing task (Freund et al., 2003)(Xu and Li,                      contains 50,220,423 documents. The 92 topical
2007)(Burges, 2010). We wished to investigate if               shards created for CategoryB dataset by Kulka-
a ranking model can be learned for the shard rank-             rni and Callan(Kulkarni et al., 2012) are used in
ing task as well. To test this intuition we started by         this work. The evaluation queries are from TREC
defining the set of features for the ranking model,            Web Track 2009-2012. Out of the 200 queries,
and the target variable. The number of relevant                6 queries do not contain any relevant document
documents in the shard for the query, is used as               in this dataset, and thus are discarded. The re-
the target. Instead of using integer rank values,              maining 194 queries are divided into 10-fold for
this target captures more information. The shards              the LeToR-S experiment to facilitate 10-fold cross
are ranked in the descending order of the predicted            validation. For the small-document approach,
target value.                                                  ReDDE, we construct the CSI by randomly sam-
LeToR-S Features                                               pling 0.5% of the documents from every shard.
The different fields or sections of the document               For all the ReDDE experiments the same CSI
inform the features. The title, body, heading,                 was employed, to minimize any spurious effects
url, whole document each generate a separate                   caused by sampling. The search engine used in
set of features. Several variants of CORI scores               our experiment is Indri 5.9 (Strohman et al., 2005)
for the query are evaluated against the document               from Lemur Project. For all the Selective Search
fields, and are used as features. More specifically,           experiments reported in Sections 6.1 through 6.5,
CORI SUM, CORI MIN, and CORI VAR are the                       the top 10 shards were searched for each query.
three variants of CORI as defined in Equations                 This corresponds to a search space of about 5.5
5 through 7. Each of these scores is computed                  million documents. This is an order of magnitude
for two different query representations:- unigram,             smaller than the search space of Exhaustive Search
and phrasal, against all of the different document             (50+ million documents).
fields.
                                                               6     Results and Analysis
                                X
    CORI SU M (Q|Si ) =                Score(tk |Si ) (5)      This section describes the experiments that were
                               tk 2Q                           conducted to test the various hypotheses about
                                                               improving search effectives by leveraging topical
                                                               shards. Table 1 presents a comprehensive set of
    CORI M IN (Q|Si ) = min Score(tk |Si ) (6)                 results for four different investigations that we un-
                               tk 2Q
                                                               dertook. We describe these next in the following
                  1 X                                          subsections, and analyze the results in Table 1.
             µ=       Score(tk |Si )                   (7)
                  n
                     tk 2Q                                     6.1    Big-Document versus Small-Document
                                                                      Shard Ranking Approach
                             X
CORI V AR(Q|Si ) =                   (Score(tk |Si )   µ)2     Traditionally the choice of small- or big-document
                             tk 2Q                             approach was dictated by the type of search
                                                 (8)           environment:- uncooperative, or cooperative, re-
   Where Score(t
           P      k |Si ) is the same as Equation 3,           spectively. Although, our scenario would catego-
and µ = n1 tk 2Q Score(tk |Si )                                rize as cooperative, we choose to also experiment
   For each query and shard pair, the above vec-               with small-document approaches because the con-
tor of features is compiled, in order to learn                 ventional belief has been that small-document
the ranking model or to predict the ranking us-                approaches provide superior search effectiveness
ing the RandomForest(Breiman, 2001) model im-                  than big-document approaches. This first exper-
plemented by RankLib(Dang, 2013) from Lemur                    iment empirically tests this belief, specifically in
Project(Croft and Callan, 2000).                               the context of topical shards. We also compare
                                                               Exhaustive Search with Selective Search with big-
5    Experimental Setup
                                                               document, and small-document approaches. The
For the empirical evaluation the experimental                  first three rows in Table 1 are the focus of this anal-
setup that was undertaken is described next. We                ysis.
use the CategoryB dataset of ClueWeb09, which                     As compared to Exhaustive Search, CORI and


                                                         154
   #     #Qrys           Run                 P@30       P@100       map       R@30      R@100      ndcg
   1     194             Exh                 0.254      0.189       0.181     0.174     0.374      0.429
   2     194             CORI                0.255      0.178       0.164     0.158     0.329      0.370
   3     194             ReDDE               0.256      0.182       0.172     0.175     0.360†     0.400†
   4     52 (len=1)      CORI                0.260      0.189       0.163     0.132     0.296      0.391
   5     52 (len=1)      ReDDE               0.250      0.184       0.155     0.131     0.295      0.387
   6     142 (len>1)     CORI                0.254      0.173       0.164     0.167     0.341      0.362
   7     142 (len>1)     ReDDE               0.258      0.181       0.178     0.191†    0.384†     0.404†
   8     194             CORI Uni+Bi         0.271§‡    0.188§      0.181§    0.180§    0.363§     0.408§
   9     194             ReDDE Uni+Bi        0.263      0.184       0.175     0.182     0.363      0.398
   10    194             LeToR-S             0.270‡     0.194¶      0.186¶    0.179     0.377¶     0.417¶

Table 1: Search Effectiveness for Exhaustive Search, and for Selective Search with CORI and ReDDE
under various configurations. † indicates statistically significant improvements when comparing ReDDE
with CORI. § indicates statistically significant improvement when comparing MTD Uni+Bi with MTD.
‡ indicates statistically significant improvement over Exhaustive Search. ¶ indicates statistically sig-
nificant improvement when comparing LeToR-S with MTD Uni+Bi. Underline indicates significantly
worse values when compared to Exhaustive Search. Statistical significance testing was performed using
paired T-test at p<0.05.


ReDDE, both struggle at deeper ranks. CORI,               gleton queries. Across all the metrics the search
the big document approach, is consistently inferior       effectiveness with CORI is higher in magnitude
to ReDDE, the small document approach, across             than that with ReDDE, which is exactly the oppo-
all the metrics. In fact, at deeper ranks (R@100          site trend seen with multi-term queries. These re-
and ndcg) the improvements over CORI, with                sults establish that CORI’s subpar performance is
ReDDE are statistically significant. These results        restricted to multi-term queries. On the other hand
confirm that the conventional unigram language            ReDDE struggles more with singleton queries be-
model based shard ranking approach adopted by             cause estimation errors due to under-sampling are
CORI struggles to differentiate the relevant shards       more likely when there is only one term in the
from non-relevant shards. This is so even for top-        query to inform the shard ranking.
ical shards where the distribution of relevant doc-
uments across shards is highly skewed. Also, note         6.3   Effect of Richer Query Representation
that CORI has access to the vocabulary of the
                                                          CORI’s inferior performance with multi-term
complete collection whereas ReDDE is only us-
                                                          queries motivates the investigation in this section.
ing 0.5% subset of the collection for estimating the
                                                          The results with CORI and ReDDE when using
shard ranking. Thus CORI’s inferior performance
                                                          this richer query representation are given in rows
is especially surprising. These observations moti-
                                                          8 and 9 of Table 1. These results show an op-
vate the experiment described next.
                                                          posite trend as that with unigram query repre-
                                                          sentation (rows 2 and 3). The big-document ap-
6.2 Effect of Query Length
                                                          proach (CORI Uni+Bi) performances better, al-
Our hypothesis that we test in this section is            though not significantly, than the small-document
that for multi-term queries the unigram language          approach (ReDDE Uni+Bi). CORI clearly ben-
model used by CORI severally misinforms the               efits more from the richer query representation
shard ranking estimation. This is especially true         than ReDDE. CORI Uni+Bi results are signifi-
for multi-term queries which consist of phrase(s).        cantly better than those with CORI. This is not
For example, in the query, obama family tree, it          the case for ReDDE Uni+Bi. At early ranks,
is critically important to treat the terms family and     CORI Uni+Bi is significantly better than even Ex-
tree as a phrase and not as unigrams. The results         haustive Search. This indicates substantial reduc-
in rows 4 through 7 in Table 1 provide evidence           tion in false-positives in the retrieved documents
in support of the above hypothesis. Rows 4 and            at early ranks. This is facilitated by two factors:-
5 are results for 52 queries, all of which are sin-       topic-based shards reduce the noise (false-positive


                                                    155
matches) in each shard, and CORI Uni+Bi selects           score features. Recall that the CORI minimum
the shards such that the resulting search space is        score feature is lowest CORI score computed for
purer than that used by Exhaustive Search.                the individual query terms against a shard. This
   Although, not shown in the table, the results          feature models the intuition that all of the query
for multi-term queries show similar trends as             terms should have high CORI score for a rele-
before:- for all the metrics the magnitude of search      vant shard. Low CORI score, even if only for
effectiveness with CORI Uni+Bi is higher than             one of the query terms, indicates less likelihood
with ReDDE Uni+Bi, and CORI Uni+Bi is sig-                of shard relevance. For query, pacific northwest
nificantly better than CORI. However, for ReDDE           laboratory only one shard contains all the rele-
that is not the case. For singleton queries the re-       vant documents, LeToR-S ranks this shard at 8th
sults do not change because the richer query repre-       place, while CORI Uni+Bi ranks it at 11. Through
sentation does not yield a different query. In sum-       the CORI minimum score feature, several false-
mary, CORI Uni+Bi provides the best search ef-            positive shards are eliminated by LeToR-S from
fectiveness until now. In the next section we inves-      the shard ranking. These false-positive shards
tigate if we can improve the performance further.         have high overall CORI score because some of
                                                          the query terms have high CORI score, and thus
6.4 Learning to Rank Shards                               dominate the cumulative score. However, the
                                                          CORI minimum score captures that some query
The last row in Table 1 reports the results with          terms have low CORI score for these false-positive
Learning to Rank Shards approach (LeToR-S).               shards and thus push them down in the shard rank-
The first obvious trend in these results is that          ing.
LeToR-S significantly outperforms the current
                                                             The results in Table 1 also indicate that at early
best, CORI Uni+Bi, at deeper ranks. At early
                                                          ranks LeToR-S performs significantly better than
ranks, however, the two approaches provide com-
                                                          Exhaustive Search. This improvement often but
parable precision and recall. To understand these
                                                          not always comes from single term queries that
results better we analyze a few queries in detail.
                                                          may have one than one meaning or aspect associ-
   For one of the queries, getting organized,             ated with them (euclid, avp, iron, unc). The topic-
CORI Uni+Bi ranks the shard with most num-                based partitioning of the collection organizes the
ber of relevant documents at 11th position, while         documents with similar meaning or aspect into the
LeToR-S ranks it at 3rd. This is a difficult query        same shard. Often one of the meanings is more
because both the query terms are common terms.            dominant than others in the collection, that is also
Even when the terms are treated as a phrase, it           often the relevant meaning for the query. Shards
is still not a focused query. This is reflected           with the dominant meaning have higher document
in the low scores assigned to relevant shards by          frequency (df ) than shards with the rare meaning,
CORI Uni+Bi. LeToR-S, however, uses meta-                 and thus documents with dominant meaning only
data in addition to the document contents for             are searched. This reduces the false-positive doc-
defining the features. One particular meta-data           uments (documents with rare meaning) from the
feature, url field, proves to be especially valu-         result, and thus improves the search precision.
able for this query that consists of common terms.
Documents that contain getting organized in their
                                                          6.5   Effect of Distribution of Relevant
field are relevant to the query. In turn, shards that
                                                                Documents
contain such documents should be ranked higher
too. In short, LeToR-S benefits from having the           When comparing the best performing Selec-
field score features, while CORI Uni+Bi suffers           tive Search approach, LeToR-S, with Exhaustive
because it only uses document contents for shard          Search we see in Table 1 that at early ranks,
ranking. A few more example queries that high-            LeToR-S performs at par or better than Exhaus-
light the value of the field score features are bat-      tive Search in precision and recall both. However,
tles in the civil war and kansas city mo. For             at deeper ranks, LeToR-S struggles on recall more
both queries, CORI Uni+Bi ranks the most rele-            than precision, which is indicated by the signifi-
vant shard at a much deeper rank than LeToR-S.            cantly lower ndcg value. Our hypothesis for the
   Another feature category that helps LeToR-S            reason behind this trend is that LeToR-S is unable
outperform CORI Uni+Bi is the CORI minimum                to retrieve all the shards containing relevant docu-


                                                    156
        Figure 1: Histogram of number of shards containing relevant documents for the query.


ments. In order to test this hypothesis we conduct      recall. To test this intuition we separate the 194
the following experiment.                               queries into two groups based on the spread cutoff
                                                        of  7. This gives us one group of 133 queries for
   The effectiveness of shard ranking algorithm is      which the relevant documents are spread across 7
dependent on the distribution of the relevant doc-      or fewer shards, and the other group contains 61
uments across shards. If all the relevant docu-         queries.
ments are concentrated in a few shards then the
shard ranking task is straightforward, however if          The results for the two query groups with the
the relevant documents are spread across many           various search approaches are given in Tables 2
shards then task is much more challenging for any       and 3. For the first group of queries (Table 2)
shard ranker. Figure 1 provides the histogram           LeToR-S (and also CORI Uni+Bi) provides sta-
of the spread of relevant documents for the 194         tistically significant improvement over Exhaustive
queries. For a large fraction of the queries (27)       Search in precision at both, early and deep ranks.
the spread of the relevant documents are restricted     Furthermore these improvements in precision do
to 4 shards. 70% of the total queries have a            not come at the cost of recall, the recall (at all
spread of 7 or less. For the remaining 30% of           ranks) with LeToR-S stays comparable to that with
the queries the relevant documents can be spread        Exhaustive Search. Even ndcg with LeToR-S is
across as many as 31 shards, indicating that the        not statistically different from that of Exhaustive
topic-based sharding approach failed to concen-         Search. This is a rare phenomenon:- a search ap-
trate the relevant documents into few shards for        proach being able to balance precision and recall.
these queries. We are however interested in the            The results in Table 3 tell a different story. At
70% of the queries for which the spread of relevant     deeper ranks, precision and recall, both suffer with
documents is restricted to 7 shards. We believe         all the Selective Search approaches. These results
that LeToR-S should be able to provide effective        clearly establish the importance of concentrating
shard ranking for these subset of queries, which        the relevant documents into few shards. Doing so
in turn should help improve both, precision and         not only reduces the search cost but substantially


                                                  157
              Search Approach      P@30        P@100     map       R@30     R@100      ndcg
              Exh                  0.218       0.153     0.166     0.198    0.399      0.405
              CORI                 0.229       0.154     0.165     0.178    0.361      0.369
              ReDDE                0.225       0.154     0.169     0.202    0.397      0.391
              CORI Uni+Bi          0.241‡      0.163     0.179‡    0.205    0.401      0.400
              ReDDE Uni+Bi         0.233‡      0.156     0.171     0.209    0.400      0.385
              LeToR-S              0.239 ‡     0.164‡    0.178‡    0.204    0.411      0.401

Table 2: Results on 133 Queries with number of relevant shards  7. ‡ indicates statistically significant
improvement over Exhaustive Search. Underline indicates significantly worse values when compared to
Exhaustive Search. Statistical significance testing was performed using paired T-test at p<0.05.

             Search Approach      P@30       P@100      map       R@30     R@100      ndcg
             Exh                  0.332      0.265      0.215     0.124    0.320      0.482
             CORI                 0.312      0.228      0.162     0.114    0.258      0.372
             ReDDE                0.322      0.243      0.179     0.118    0.279      0.418
             CORI Uni+Bi          0.336      0.243      0.187     0.125    0.282      0.424
             ReDDE Uni+Bi         0.331      0.246      0.185     0.122    0.283      0.424
             LeToR-S              0.340      0.261 ¶    0.204 ¶   0.124    0.303 ¶    0.452 ¶

Table 3: Results on 61 Queries with number of relevant shard > 7. ¶ indicates statistically significant
improvement when comparing LeToR-S with MTD Uni+Bi. Underline indicates significantly worse
values when compared to Exhaustive Search. Statistical significance testing was performed using paired
T-test at p<0.05.


improves search precision without degrading the          shard cutoff values. Even the search cost for
recall.                                                  LeToR-S are marginally lower than those with
                                                         CORI Uni+Bi, indicating a bias toward smaller
6.6 Effect of Number of Shards Searched                  shards in case of LeToR-S.
                                                            As more shards are searched the recall at deeper
For all the experiments until now we have held the
                                                         ranks with LeToR-S becomes on par with Exhaus-
parameter, shard cutoff (T), constant at 10. That
                                                         tive. The analysis in the previous section demon-
is, for all the Selective Search experiments the top
                                                         strated that LeToR-S becomes comparable to Ex-
10 shards, out of 92 shards, were searched for
                                                         haustive Search even on the ndcg metric if the
each query. Changing this parameter directly af-
                                                         spread of the relevant documents is restricted. The
fects the cost of Selective Search, and it also in-
                                                         corresponding search cost of Selective Search ap-
fluences the search effectiveness. The influence
                                                         proaches is at least an order of magnitude lower
of this parameter in the general distributed search
                                                         than that of Exhaustive Search.
seup has been extensively investigated by Markov
and Crestani (Markov and Crestani, 2014). In this
                                                         7   Conclusion
section we study the effect of parameter T on the
two best performing Selective Search approaches,         Our goal for this work was to investigate ways
CORI Uni+Bi and LeToR-S, and compare them to             to balance search precision and recall. Selective
Exhaustive Search.                                       Search proved to be an effective search environ-
   Table 4 provides the results for this analysis.       ment for this investigation, and focusing on the
At early ranks, LeToR-S performs on par with             shard ranking problem to achieve this goal also
Exhaustive Search while searching just the top           proved to the correct choice. We revived an old
three shards. The corresponding search cost, ap-         shard selection approach, CORI, which supported
proximated by A, is orders of magnitude lower            competitive search performance when it was pro-
for LeToR-S than for Exhaustive Search. When             vided with richer query representation. We also
comparing LeToR-S with CORI Uni+Bi, the for-             introduced a novel shard ranking algorithm based
mer consistently outperforms the latter at all the       on the well-established Learning-To-Ranking ap-


                                                   158
       T     A        Search Approach        P@30         P@100     map      R@30       R@100      ndcg
       92    50.22    Exh                    0.254        0.189     0.181    0.174      0.374      0.429
       1     0.561    CORI Uni-Bi            0.215        0.124     0.120    0.133      0.223      0.247
             0.560    LeToR-S                0.230        0.139¶    0.124    0.139¶     0.246      0.266
       3     1.672    CORI Uni-Bi            0.267        0.174     0.166    0.174      0.327      0.353
             1.666    LeToR-S                0.281‡       0.183     0.174    0.180      0.342      0.380¶
       5     2.773    CORI Uni-Bi            0.272        0.182     0.174    0.177      0.342      0.375
             2.768    LeToR-S                0.275‡       0.185     0.176    0.180      0.351      0.396¶
       7     3.874    CORI Uni-Bi            0.275‡       0.187     0.178    0.171      0.350      0.393
             3.869    LeToR-S                0.274‡       0.190     0.181    0.177      0.365      0.406
       10    5.513    CORI Uni+Bi            0.271‡       0.188     0.181    0.180      0.363      0.408
             5.510    LeToR-S                0.270‡       0.194 ¶   0.186    0.179      0.377¶     0.417¶

Table 4: Results for Exhaustive, and Selective Search with CORI Uni+Bi and with LeToR-S at various
shard cutoffs (T). A: The average search space size per query in million documents. ¶ indicates statis-
tically significant improvement when comparing LeToR-S with CORI Uni+Bi. ‡ indicates statistically
significant improvement when comparing LeToR-S or CORI Uni+Bi with Exhaustive. Underline indi-
cates significantly worse values when compared to Exhaustive Search. Statistical significance testing
was performed using paired T-test at p<0.05.


proach, which provided the best search precision               works. In Proceedings of the 18th annual interna-
while also sustaining the recall. A thorough anal-             tional ACM SIGIR conference on Research and de-
                                                               velopment in information retrieval. ACM, pages 21–
ysis of the results showed that simply searching
                                                               28.
more shards does not necessarily increase search
effectiveness. Instead the two factors that are crit-        Jamie Callan. 2002. Distributed information retrieval.
ically important for Selective Search to success-              In Advances in information retrieval, Springer,
                                                               pages 127–150.
fully balance precision and recall are:- 1. parti-
tioning the collection such that the relevant doc-           Bruce Croft and Jamie Callan. 2000. Lemur project.
uments for the query are spread across less than               https://www.lemurproject.org/.
10% of the shards, and 2. to employ an effective             Van Dang. 2013. Lemur project components: Ranklib.
shard ranking approach, like the ones proposed in              https://www.lemurproject.org/ranklib.php.
this work.
                                                             Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram
                                                               Singer. 2003. An efficient boosting algorithm for
                                                               combining preferences. Journal of machine learn-
References                                                     ing research 4(Nov):933–969.
Robin Aly, Djoerd Hiemstra, and Thomas Demeester.
  2013. Taily: shard selection using the tail of score       Luis Gravano and Hector Garcia-Molina. 1999. Gen-
  distributions. In Proceedings of the 36th inter-             eralizing gloss to vector-space databases and broker
  national ACM SIGIR conference on Research and                hierarchies. Technical report, Stanford InfoLab.
  development in information retrieval. ACM, pages
                                                             Luis Gravano, Hector Garcia-Molina, and Anthony
  673–682.
                                                               Tomasic. 1994. The effectiveness of gioss for the
Leo Breiman. 2001. Random forests. Machine learn-              text database discovery problem. In ACM SIGMOD
  ing 45(1):5–32.                                              Record. ACM, volume 23, pages 126–137.

Christopher JC Burges. 2010. From ranknet to lamb-           Luis Gravano, Héctor Garcı́a-Molina, and Anthony
  darank to lambdamart: An overview. Learning                  Tomasic. 1999. Gloss: text-source discovery over
  11(23-581):81.                                               the internet. ACM Transactions on Database Sys-
                                                               tems (TODS) 24(2):229–264.
James P Callan, W Bruce Croft, and Stephen M
  Harding. 1992. The inquery retrieval system. In            Evangelos Kanoulas, Keshi Dai, Virgil Pavlu, and
  Database and expert systems applications. Springer,          Javed A Aslam. 2010. Score distribution models:
  pages 78–83.                                                 assumptions, intuition, and robustness to score ma-
                                                               nipulation. In Proceedings of the 33rd international
James P Callan, Zhihong Lu, and W Bruce Croft. 1995.           ACM SIGIR conference on Research and develop-
  Searching distributed collections with inference net-        ment in information retrieval. ACM, pages 242–249.


                                                      159
Anagha Kulkarni and Jamie Callan. 2015. Selective
  search: Efficient and effective search of large tex-
  tual collections. ACM Transactions on Information
  Systems (TOIS) 33(4):17.
Anagha Kulkarni, Almer S Tigelaar, Djoerd Hiemstra,
  and Jamie Callan. 2012. Shard ranking and cutoff
  estimation for topically partitioned collections. In
  Proceedings of the 21st ACM international confer-
  ence on Information and knowledge management.
  ACM, pages 555–564.
Ilya Markov and Fabio Crestani. 2014.           Theo-
   retical, qualitative, and quantitative analyses of
   small-document approaches to resource selection.
   ACM Transactions on Information Systems (TOIS)
   32(2):9.
Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. 2010.
  Letor: A benchmark collection for research on learn-
  ing to rank for information retrieval. Information
  Retrieval 13(4):346–374.
Milad Shokouhi. 2007. Central-rank-based collection
  selection in uncooperative distributed information
  retrieval. In European Conference on Information
  Retrieval. Springer, pages 160–172.
Milad Shokouhi, Luo Si, et al. 2011. Federated search.
  Foundations and Trends R in Information Retrieval
  5(1):1–102.
Luo Si and Jamie Callan. 2003. Relevant document
  distribution estimation method for resource selec-
  tion. In Proceedings of the 26th annual inter-
  national ACM SIGIR conference on Research and
  development in informaion retrieval. ACM, pages
  298–305.
Trevor Strohman, Donald Metzler, Howard Turtle, and
  W Bruce Croft. 2005. Indri: A language model-
  based search engine for complex queries. In Pro-
  ceedings of the International Conference on Intelli-
  gent Analysis. Citeseer, volume 2, pages 2–6.
Paul Thomas and Milad Shokouhi. 2009. Sushi: scor-
  ing scaled samples for server selection. In Proceed-
  ings of the 32nd international ACM SIGIR confer-
  ence on Research and development in information
  retrieval. ACM, pages 419–426.

Jinxi Xu and W Bruce Croft. 1999. Cluster-based lan-
   guage models for distributed retrieval. In Proceed-
   ings of the 22nd annual international ACM SIGIR
   conference on Research and development in infor-
   mation retrieval. ACM, pages 254–261.

Jun Xu and Hang Li. 2007. Adarank: a boosting al-
  gorithm for information retrieval. In Proceedings of
  the 30th annual international ACM SIGIR confer-
  ence on Research and development in information
  retrieval. ACM, pages 391–398.


                                                     160