Collection Selection with Highly Discriminative Keys

                             Sander Bockting                                                 Djoerd Hiemstra
                        Avanade Netherlands B.V.                                           University of Twente
                           Versterkerstraat 6                                                 P.O. Box 217
                      1322 AP, Almere, Netherlands                                   7500 AE, Enschede, Netherlands
                 sander.bockting@avanade.com                                              d.hiemstra@utwente.nl


ABSTRACT
The centralized web search paradigm introduces several prob-
lems, such as large data traffic requirements for crawling,
index freshness problems and problems to index everything.                                    1. Query                4. Merged result set
                                                                                                            Client
In this study, we look at collection selection using highly
discriminative keys and query-driven indexing as part of a
distributed web search system. The approach is evaluated
                                                                                               3. Result                3. Result
on different splits of the TREC WT10g corpus. Experimen-
tal results show that the approach outperforms a Dirichlet                                       2. Query            2. Query
smoothing language modeling approach for collection selec-
tion, if we assume that web servers index their local con-                                                  Broker
tent.

1.    INTRODUCTION
   The web search approach of major search engines, like
Google, Yahoo! and Bing, amounts to crawling, indexing                           Peer 1       Peer 2        Peer i       Peer n-1        Peer n
and searching. We call this approach centralized search,
because all operations are controlled by the search engines                   Figure 1: Peers are accessible via a broker to answer
themselves, be it from a relatively limited number of loca-                   queries from clients
tions on large clusters of thousands of machines. The cen-
tralized web search paradigm poses several problems.
   The amount of web data is estimated to grow exponen-
tially [34]. The changing and growing data requires frequent                     Figure 1 shows three types of entities: peers, brokers and
visits by crawlers, just to keep the index fresh. Crawling                    clients. We assume that peers are collaborative. Every peer
should be done often, but generates a huge amount of traf-                    runs a search engine enabled web server. The search engine
fic, making it impossible to do frequent crawls of all pages.                 indexes the local website(s), but it may also index other
With an estimated four weeks update interval, updates are                     websites. In this scenario, there can be many millions of
performed relatively slow [25, 35]. Also, it is impossible to                 peers. When a user has an information need, he can pose
index everything, as the search engine accessible visible web                 a query at the client. The client sends the query to the
is only a fraction of the total number of web pages [16].                     broker. In a response, the broker tries to identify the most
   Callan [5] identified the distributed information retrieval                promising peers to answer the query. This has to be a small
problem set, consisting of resource description, resource se-                 amount of peers, e.g. five to ten peers, so not a lot of traffic
lection and result merging. We believe that distributing the                  is generated. The query is routed to those peers and the
search efforts may be an approach to solve the problems de-                   results are returned to the broker. The broker merges the
scribed above. This research focuses on resource description                  results and sends the merged list to the client.
and resource selection [10, 22]. Resource description, i.e.                      Peers and brokers cooperate to enable brokers to identify
indexing of the peers, is distributed; resource selection, or                 most promising peers. Therefore, every peer sends a small
collection selection, is centralized in our research.                         part of its index to the broker. This part cannot be too
                                                                              small, to still allow for proper judging about the peers’ abil-
                                                                              ity to satisfactory answer queries. The index part cannot be
                                                                              too large due to index data traffic scalability.
                                                                                 Techniques have been proposed to manage the index size.
                                                                              Podnar et al [20] used the concept of highly discriminative
                                                                              keys (HDKs) for document retrieval in distributed informa-
Copyright c 2009 for the individual papers by the papers’ authors. Copy-      tion retrieval. An HDK is a set of terms that is highly
ing permitted for private and academic purposes. Re-publication of material   discriminative, i.e., that only match a few documents in the
from this volume requires permission by the copyright owners. This volume
is published by its editors.                                                  collection. Because the terms are pre-coordinated (they are
LSDS-IR Workshop. July 2009. Boston, USA.                                     combined at index time, not at search time) and because
only a few document match all terms in a pre-coordinated          tics. Opposed to tf.idf, the probabilities are calculated us-
set, the HDK approach is able to very efficiently retrieve the    ing document frequencies df and inverse collection frequen-
top documents for a query. Although searching can be effi-        cies icf (df.icf). The inverse collection frequency is the
cient, the HDK indexing process, described in more detail in      number of collections that contain the term. An inference
Section 3.1, has the negative side-effect that a huge amount      network with these properties is called a collection retrieval
of highly discriminative keys are generated. To reduce the        inference network (cori net).
number of keys, Skobeltsyn [32] proposed a query-driven in-          Given query q with terms rk , the network is used to obtain
dexing strategy that uses caching techniques to adapt to          a ranked list of collections by calculating the belief p(rk |ci )
the changing querying behavior of users. The combination          in collection ci due to the observation of rk . The collec-
of HDK with query-driven indexing allows for completely           tion ranking score of ci for query q is the sum of all beliefs
distributed document retrieval that in theory grows to web        p(rk |ci ), where r ∈ q. The belief is calculated using For-
scale proportions [31].                                           mula 1. In this formula, b and l are constants, cw i is the
   This paper contributes to the field of distributed informa-    number of words in ci , cw is the mean number of words
tion retrieval by the applying HDKs and query-driven in-          in the collections and |C| is the number of collections. df
dexing to select collections, instead of documents. Such an       and cf respectively are the number of documents and col-
approach would in theory scale the distributed search sce-        lections that contain rk . Finally, dt and db respectively are
nario described above to millions of peers: The broker lists      the minimum term frequency component and minimum be-
for every HDK a small number of peers to send the query to,       lief component when rk occurs in ci .
and the peers retrieve the documents; possibly many. Un-             To improve retrieval, the component L is used to scale
like a traditional inverted file index that typically consists    the document frequency in the calculation of T [6, 23]. L
of huge posting lists and a, in comparison, tiny dictionary       is made sensitive to the number of documents that contain
[36], our HDK index consists of a huge dictionary and, in         term rk (using b) and is made large using l. L should be
comparison, tiny posting lists. The system is fitted into the     large, because df is generally large. Proper tuning of these
previously sketched scenario, which allows for control at the     two parameters is required for every data set, but deemed
broker. This control can for example be used to prevent           impossible as the correct settings are highly sensitive to data
misuse or to allow for domain-specific search.                    set variations [12]; the value of l should be varied largely even
   This paper is organized as follows: the next section dis-      when keeping the data set constant while varying the query
cusses earlier collection selection methods, Section 3 intro-     type.
duces our collection selection system and Section 4 describes        Further research showed that it is not justified to use stan-
the evaluation. The paper concludes with results and con-         dard cori as a collection selection benchmark. D’Souza et
clusions.                                                         al. showed that HighSim outperformed cori in 15 of 21
                                                                  cases [11]. Si and Callan [28] found limitations with differ-
                                                                  ent collection sizes. Large collections are not often ranked
2.    EARLIER WORK ON COLLECTION SE-                              high by cori, although they often are the most promising
      LECTION                                                     collections.
    Collection selection systems have been developed to se-
lect collections containing documents that are relevant to
a user’s query. The generalized Glossary-Of-Servers Server                           L = l · ((1 − b) + b · cw i /cw )
(gGlOSS) is such a system [13]. It uses a vector space model                                               log (df )
representing index items (document collections) and user                            T = dt + (1 − dt ) ·
                                                                                                         log (df + L)
queries as weight vectors in a high dimensional Euclidean                                           
space to calculate the distance (or similarity) between doc-                            log |C|+0.5
                                                                                                 cf
ument collections and queries [24].                                                 I=
                                                                                         log |C| + 1.0
    Another well-known approach is CVV, which exploits the
                                                                              p(rk |ci ) = db + (1 − db ) · T · I              (1)
variation in cue validity to select collections [38]. The cue
validity CV i,j of query term tj for collection ci measures
the extent that tj discriminates ci from the other collec-
tions, by comparing the ratio of documents in ci containing       2.2    Language modeling
tj to the ratios of documents in other collections containing        A language model is a statistical model to assign a proba-
tj . The larger the variation in cue validities for collections   bility to a sequence of words (e.g. a query) being generated
with respect to a term, the better the term is for selecting      by a particular document or document collection. Language
collections.                                                      models can be used for collection selection in distributed
    This section will describe two collection selection methods   search, by creating a language model for each collection [29,
in more detail: inference networks and language modeling.         37]. They have also been used for collection selection for
                                                                  other tasks, for instance for blog search [2, 26].
2.1    Inference networks                                            indri is an improved version of the inquery retrieval sys-
  cori [7] is a collection ranking algorithm for the inquery      tem [33], as it is capable of dealing with larger collections,
retrieval system [6], and uses an inference network to rank       allows for more complex queries due to new query constructs
collections. A simple document inference network has leafs d      and uses formal probabilistic document representations that
representing the document collections. The terms that oc-         use language models. The combined model of inference net-
cur in those collections are represented by representation        works and language modeling is capable of achieving more
nodes r. Flowing along the arcs between the leaves and            favorable retrieval results than inquery [18]. Due to these
nodes are probabilities based on document collection statis-      differences, term representation beliefs are calculated in an-
other way than with cori (as described in Section 2.1):                 Document collections

                                tf r,D + αr
                  P (r|D) =
                              |D| + αr + βr
The belief is calculated for representation concept r of docu-
ment node D (in document collection C). Examples of rep-                                 Generated
resentation concepts are a term in the body or title of a doc-                           index keys
ument. D and r are nodes in the belief network. The term
frequency of representation node r in D is denoted by tf r,D .
                                                                     Prune keys with query log
αr and βr are smoothing parameters. Smoothing is used                 (Query-driven Indexing)
to make the maximum likelihood estimations of a language
                                                                                         Stored
model more accurate, which could have been less accurate                               index keys
due to data sparseness, because not every term occurs in a
                                                                                Broker                                Ranked list of
document [39]. Smoothing ensures that terms that are un-                                              ScoreFunction
                                                                                Index     Query                        collections
seen in the document, are not assigned zero probability. The                              results
default smoothing model for Indri is Dirichlet smoothing,
which affects the smoothing parameter choices [17]. In set-       Figure 2: General overview of the indexing and col-
ting the smoothing parameters, it was assumed that the like-      lection selection system Sophos
liness of observing representation concept r is equal to the
probability of observing it in collection C, given by P (r|C).
This means that αr /(αr + βr ) = P (r|C). The following           term frequencies of every word in the collection, without
parameter values were chosen:                                     looking at document boundaries in the collection. A term
                     αr = µP (r|C)                                set is called frequent when it occurs more times than a term
                                                                  set frequency maximum tf max . Every infrequent single term
                     βr = µ(1 − P (r|C))                          is added to the peer index together with its frequency count.
This results in the following term representation belief, where   The frequent keys are stored for further analysis.
the free parameter µ has a default value of 2500:                    The next step consists of obtaining double term set statis-
                                                                  tics. For every frequent term in the collection, frequency
                             tf r,D + µP (r|C)                    statistics are created for term combinations that consist of
                 P (r|D) =
                                   |D| + µ                        the frequent term and a term that appears after that term
                                                                  within a window size ws. The result is a list of double terms
2.3     Discussion                                                with corresponding frequency counts. Once again, the term
  The language modeling approach of indri has a better            set frequency maximum defines which term sets are frequent
formal probabilistic document representation than cori and        and will be used for further analysis, and which term sets
indri is an improved version of inquery (which is the foun-       are infrequent and will be stored in the peer index. This
dation of cori). We will use the language model on doc-           procedure can be repeated as long as the generated term
ument collections as implemented by indri as our baseline         sets do not contain more than ws terms, or when a prede-
collection selection system. Si et al. [29] showed that a         fined maximum number of terms in a term set, hmax , has
language modeling approach for collection selection outper-       been reached.
forms cori. Furthermore, cori outperforms algorithms like            Summarizing, the peer index consists of tuples with term
cvv and gGlOSS in several studies [9, 21].                        sets and corresponding frequency counts.

                                                                  3.1.2      Broker indexing
3.    SOPHOS                                                        The infrequent term sets from the peer indices are sent
   Sophos is a collection selection prototype that uses HDKs      to the broker. The broker index contains term sets with
to index collections. The keys are used to assign scores to       associated sets of collection identifier counters. A collection
the collections. Using a scoring function, collection scores      identifier counter is a tuple of a collection identifier and a
can be calculated to rank collections for a particular query.     term set frequency counter. A collection identifier is a short
A general overview is depicted in Figure 2. This section de-      representation of a collection where the term set occurs.
scribes how the broker index is created, explains index size        When a term set is inserted into the broker index, it is
reduction using a query-driven indexing approach, identi-         called a highly discriminative key (HDK). The broker index
fies query result parameters, and concludes with a collection     will contain a maximum number of collections per HDK, de-
ranking formula (ScoreFunction in Figure 2).                      noted by the collection maximum cm. As soon as the maxi-
                                                                  mum number of collections is reached for a particular HDK,
3.1     Highly discriminative keys                                the cm collections with the largest term set frequencies will
  Building the index is done in two phases. First, every          be stored in the broker index.
peer builds an index of its document collection and sends
that index to the broker. Second, the broker constructs a         3.2      Query-driven indexing
broker index from all peer indices.                                 A list of popular keys can be created by extracting all
                                                                  unique queries from a query log. Every key that is infrequent
3.1.1    Peer indexing                                            at a peer, and which is present in the unique query list, will
   Peer indexing starts with the generation of term set statis-   be sent to the broker; the other keys are filtered out to reduce
tics. First, single term statistics are generated by counting     the broker index size and to reduce traffic.
  c        sum of query term set occurrence within one col-        calculates a score s for a collection for a given query
           lections (grouped by sets with h terms)
                                                                                         
  h        #terms in an HDK                                                                        n−1
                                                                         s = log10         h−1+             +
  hmax     maximum #terms that HDK can consist of                                                       q
  q        #terms in query                                                        q           c
  n        #distinct query terms found in a collection                              (hmax +1−h)·(n
                                                                                                            · α(q−n)        !
                                                                                                 h)
                                                                                                    ·tf max
  tf max   maximum frequency of a term set in a collection                                                             /hmax     (2)
                                                                                                 q
Table 1: Parameter definitions for query result han-                  It consists of three components; one component per de-
dling                                                              sired ranking property. The properties, and corresponding
                                                                   components, are listed below in order of importance.

                                                                        1. Collections that contain longer HDKs should be ranked
                                                                           higher. Component 1: h − 1.
  This index pruning strategy was used before by Shokouhi
et al. [27]. It is not the most desirable strategy for query-           2. A collection should be ranked higher if it contains more
driven indexing, because it is unable to deal with unseen                  distinct query terms. Component 2: (n − 1)/q.
query terms. However, it will give a good idea about the
possible index size reduction and the loss of retrieval perfor-         3. More occurrences of query term sets within a collection
mance.                                                                     should result
                                                                                   r
                                                                                         in a higher collection ranking. Compo-
                                                                                              c             ·α(q−n)
                                                                                                  n
                                                                                     (hmax +1−h)·( )
                                                                                                  h
                                                                                                    ·tf max
                                                                           nent 3:                q
                                                                                                                      .
3.3    Identifying query result parameters                                 The general intuition behind this component is that
                                                                           a collection is more important when it contains more
   Once the broker index has been built, the system is ready
                                                                           query terms. This is controlled by a damping factor α.
to be queried. The broker index contains HDKs with h
                                                                           The term set counts have less impact when they get
terms, where h varies from 1 to hmax . In Sophos, hmax is set
                                                                           closer to tf max , because longer keys would be generated
to 3. Every key has an associated posting list, which con-
                                                                           for a term set in a collection when its frequency exceeds
tains tuples of collection identifiers and counters. A counter
                                                                           tf max . We refer to earlier work for a more detailed
indicates the number of occurrences of a certain key within
                                                                           explanation about ScoreFunction [4].
the corresponding collection. The counter value cannot ex-
ceed the term set frequency maximum, tf max , as a key would
                                                                      An important detail to mention about querying and rank-
otherwise have been locally frequent and new HDKs would
                                                                   ing is that collection X can be found after looking for HDKs
have been generated when the maximum number of terms,
                                                                   with length h. When the same collection X is found af-
hmax , was not yet reached.
                                                                   ter looking for HDKs with length h − 1, those results are
   When a user poses a query with q terms, e.g. abcde with
                                                                   discarded as the collection was already found using larger
q = 5, the query is first decomposed in query term combi-
                                                                   HDKs. Counts c are only compared with other counts that
nations withlength hmax (i.e. abc, abd, . . . , bde, cde). This
                                                                   are retrieved after looking for HDKs of the same length. The
results in hq combinations. Each combination is looked up
                                                                   motivation for this behavior is that smaller HDKs tend to
in the broker index. Note that this introduces additional
                                                                   have higher counts.
load on the broker, but these lookups do not require net-
                                                                      Each of the three components has a share in the collec-
work access to the peers. The results of each of those smaller
                                                                   tion score. The component share of a less desired property
queries are merged by summing the number of occurrences
                                                                   never exceeds that smallest possible share of a more desired
per collection. The sum of all counters, c, has a potential
                                                                   property’s component value.
maximum of hq · tf max . It may happen that this procedure
results in little or no collections where an HDK occurs. The
procedure is then repeated for smaller term sets; first term       4.     EXPERIMENT SETUP
sets of two terms will be looked up in the index. When even          This section describes the corpus, query set and query logs
this gives too few results, single terms will be looked up in      that were used in the evaluation process, and continues to
the index. In the case that one collection contains two differ-    describe how the collection selection effectiveness of Sophos
ent combinations, e.g. both abc and bce occur in collection        was measured.
X, the number of occurrences are summed (this is c that was
just introduced). However, it also interesting to note that        4.1      Data collections
4 out of 5 query terms can be found in collection X. The
number of query terms that can be found using HDKs of a            4.1.1      WT10G corpus splits
particular length is indicated by n. The different parameters        The Web Track 10GB corpus (WT10g) was developed for
are displayed in Table 1.                                          the Text REtrieval Conference1 and consists of 10GB of web
                                                                   pages (1,692,096 documents on 11,680 servers). Compared
                                                                   to regular TREC corpora, WT10g should be more suited
3.4    ScoreFunction: ranking query results                        for distributed information retrieval experiments, due to the
                                                                   existence of hyperlinks, differences in topics, variation in
  ScoreFunction, given in Formula 2, is a ranking formula          quality and presence of duplicates [3, 8].
that uses the query result parameters to enforce a collection
                                                                   1
ranking conforming to our desired ranking properties. It               http://trec.nist.gov/
   Four splits of the WT10G document corpus were made to           version of the query log that only contains the actual queries
look at the effect of document corpora on collection selec-        issued in random order (and none of the other metadata).
tion. Every split is a set of collections; every collection is a      We also used two query logs that were published by Ex-
set of documents. The numbers 100 and 11512 indicate the           cite2 that were stripped in the same way. One query log
amount of collections in the corpus split.                         from 1997 contains 1,025,907 queries and another query log
                                                                   from 1999 contains 1,777,251 queries.
IP Split: Documents are put in collections based on the IP            Finally, a fourth query log with 3,512,320 unique queries
    addresses of the site where a document was residing.           was created by removing all queries from the AOL query log
    This results in 11,512 collections.                            that were issued only once. This query log will be referred
                                                                   to as AOL2. The other logs are called AOL, Excite97 and
IP Merge 100: A cluster is created by grouping up to 116           Excite99.
    collections, which results in 100 collections. Grouping
    is done in order of IP address. This split simulates           4.2     Method
    the scenario of indexing the search engines that index
                                                                      To evaluate the performance of our collection selection
    many servers.
                                                                   system, we adopted the precision and recall metrics for col-
Random 100 and Random 11512: Two random splits                     lection selection as defined by Gravano et al. [13]. We start
    with 100 collections and with 11,512 collections were          by obtaining the retrieval system ranking (S ) for a query,
    created. Documents were randomly assigned to a col-            which contains up to 1,000 collections. We also create the
    lection. The number of 11,512 collections was chosen           best ranking (B ) which is the best possible ranking for a
    to be able to compare a random split with the IP Split.        particular query; collections are ranked by their amount of
                                                                   merit with respect to a query.
   The number of documents in random splits is relatively             Given query q and collection c i , the merit within a col-
constant, but varies in IP based collections from less than 10     lection can be expressed using merit(q, c i ). The merit of
up to more than 1000 documents. This is typical for the size       the ith ranked collection in rankings S and B is given by
of sites on the Internet; the number of documents per server       S i = merit(q, c si ) and B i = merit(q, c bi ).
follows a Zipf distribution on the Internet [1]. The IP based         The obtained recall after selecting n collections can be
splits show signs of conformance to a Zipf distribution [4].       calculated by dividing the merit selected by the best possible
This means that the IP based splits can be compared to the         retrieval system:
Internet in terms of distribution of the number of documents                                       Pn
                                                                                                        Si
and the sizes of the collections.                                                           Rn = Pni=1                        (3)
                                                                                                        B i
   The merit of a collection is the number of relevant docu-                                        i=1

ments in a collection for a particular query. The IP based           Precision Pn is the fraction of top n collections that have
corpus splits have a larger deviation in merit among the           non-zero merit:
collections. This contrasts with random splits, which by ap-
                                                                                     |{sc ∈ Topn (S)|merit(q, sc) > 0}|
proximation have equal merit for each collection [4].                         Pn =                                             (4)
                                                                                                |Topn (S)|
4.1.2    WT10g retrieval tasks                                        The precision and recall obtained by Sophos is compared
   Fifty WT10g informational ‘ad-hoc’ queries were used for        to the collection selection results from a baseline of Language
evaluation (query numbers 501–550). The queries have a             Modeling with Dirichlet Smoothing (lmds) as implemented
query number, title, description and a narrative description       by indri. The baseline system has one parameter µ that
of the result that is considered relevant. The title is a small    is set to 2500. Krovetz word stemming is applied to the
set of words which was used as the query text. The narra-          collections.
tive descriptions were used by humans to assign relevance             Section 3 introduced Sophos and its parameters. There
judgments to documents. The relevance judgments can be             are three parameters for peers: the maximum key length
used to count the number of relevant documents in the col-         h max , the maximum key frequency before calling a key fre-
lections, which in turn can be used to measure collection          quent (tf max ) and the window size ws. Based on num-
selection effectiveness.                                           bers used by Luu et al. [15], we use the following settings:
   There are three types of relevance judgments: not rele-         tf max = {250, 500, 750} and ws = {6, 12, 18}. The average
vant (0), relevant (1) and authoritative (2). There can be         query length on the Internet is 2.3 [14, 30]. We use this ob-
multiple authoritative documents in the document corpus            servation to set hmax to 3. Setting it smaller would require
for a query, but for some queries there are no authoritative       many intersections of term sets (with associated collections)
documents. All authoritative judgments are converted to 1,         to answer queries. Setting it larger would result in many
so documents are either relevant or not relevant. This allows      term sets that are rarely queried for. For the broker, the
for evaluation of the collected merit.                             collection maximum cm is tested for values 5, 10, 20 and 50.
                                                                      To evaluate Sophos, the collections are processed by re-
4.1.3    Query logs                                                moving stop words – drastically reducing the number of in-
   AOL published a query log with 21,011,340 queries [19].         valuable keys and speeding up term set generation – and
The log has been anonymized and consists of several data           applying Porter word stemming.
fields: the actual query issued and the query submission date         Finally, we will look at the precision and recall with query-
and time, and an anonymous user ID number. The release             driven indexing using four different query logs. By pruning
of the anonymized data set was controversial at the time           the keys from the peer indices that do not occur in a query
because it was proven possible to link an ID to a real person.
                                                                   2
To respect the anonymity of the users, we used a stripped              http://www.excite.com/
log. At the same time, we will look at the number of term                                                                                                                                                                                                                       6e+007
                                                                                                                                                                                                                                                                                                                         Sophos with QDI AOL
sets (keys) in the broker index to get an idea about its size.                                                                                                                                                                                                                                                          Sophos with QDI AOL2


                                                                                                                                                                                                                                              #ColID Counters in global index
                                                                                                                                                                                                                                                                                5e+007                               Sophos with QDI Excite 97
                                                                                                                                                                                                                                                                                                                     Sophos with QDI Excite 99

                                                                                                                                                                                                                                                                                4e+007
5.    RESULTS
                                                                                                                                                                                                                                                                                3e+007

5.1   Index size                                                                                                                                                                                                                                                                2e+007

   Figure 3 shows the number of collection identifier coun-
                                                                                                                                                                                                                                                                                1e+007
ters within the broker index for different indexing settings
of indexing the IP Split. Spread over single, double and                                                                                                                                                                                                                             0
                                                                                                                                                                                                                                                                                         Single terms            Double terms           Triple terms
triple term set collection identifier counters, the number of
counters are a good indication for the actual broker index
size. The figure shows that the number of counters decreases                                                                                                                                Figure 5: Number of collection identifier counters
when the term set frequency maximum is increased.                                                                                                                                           stored per QDI strategy for IP Split


                                                                                                                                                    Single terms                            5.2                                              Collection selection performance
                                                                                                                                                   Double terms
                                          #ColID Counters in global index


                                                                                                                                                    Triple terms                               Table 2 shows the average precision and recall over 50
                                                                            108
                                                                                                                                                                                            queries that were calculated with Formula 3 and Formula 4.
                                                                                                                                                                                            The numbers are calculated for four different corpus splits
                                                                                                                                                                                            with which the baseline (lmds) and Sophos were tested.
                                                                                                                                                                                            Sophos was used with the following settings: query-driven
                                                                                                                                                                                            indexing with the AOL query log, tf max = 250, cm = 20
                                                                                                                                                                                            and ws = 6. Due to memory constraints, we were unable
                                                                                                                                                                                            to run Sophos with a window size larger than 6. The table
                                                                            107                                                                                                             shows that the baseline outperforms Sophos on the Random
                                                                                  tf2

                                                                                           tf2

                                                                                                  tf2

                                                                                                           tf2

                                                                                                                    tf5

                                                                                                                             tf5

                                                                                                                                    tf5

                                                                                                                                             tf5

                                                                                                                                                      tf7

                                                                                                                                                               tf7

                                                                                                                                                                       tf7

                                                                                                                                                                               tf7


                                                                                                                                                                                            11,512 corpus split, but Sophos outperforms the baseline on
                                                                                      50

                                                                                             50

                                                                                                      50

                                                                                                               50

                                                                                                                        00

                                                                                                                               00

                                                                                                                                        00

                                                                                                                                                 00

                                                                                                                                                          50

                                                                                                                                                                 50

                                                                                                                                                                          50

                                                                                                                                                                                   50
                                                                                      cm

                                                                                             cm

                                                                                                      cm

                                                                                                               cm

                                                                                                                        cm

                                                                                                                               cm

                                                                                                                                        cm

                                                                                                                                                 cm

                                                                                                                                                          cm

                                                                                                                                                                 cm

                                                                                                                                                                          cm

                                                                                                                                                                                   cm


                                                                                                                                                                                            the IP split.
                                                                                         5

                                                                                                1

                                                                                                         2

                                                                                                                  5

                                                                                                                           5

                                                                                                                                  1

                                                                                                                                           2

                                                                                                                                                    5

                                                                                                                                                             5

                                                                                                                                                                    1

                                                                                                                                                                             2

                                                                                                                                                                                      5
                                                                                                  0

                                                                                                           0

                                                                                                                    0


                                                                                                                                    0

                                                                                                                                             0

                                                                                                                                                      0


                                                                                                                                                                      0

                                                                                                                                                                               0

                                                                                                                                                                                        0


                                                                                                                                                                                               This is displayed in more detail in Figure 6, which shows
                                                                                                                                                                                            the average recall of Sophos and the baseline after selecting
Figure 3: Number of collection ID counters with IP
                                                                                                                                                                                            n collections. Sophos was tested using different query logs,
Split.
                                                                                                                                                                                            tf max = 250 and cm = 50. Interestingly, pruning with the
                                                                                                                                                                                            smallest Excite query logs results in the best recall figures,
   A more substantial reduction of collection identifier coun-                                                                                                                              possibly because the queries were logged at the same time
ters – of roughly 70% – can be achieved by using query-                                                                                                                                     as when the corpus was crawled.
driven indexing, as shown in Figure 4. The figure shows the
number of collection identifier counters after indexing the                                                                                                                                                                         1
                                                                                                                                                                                                                                                                                                                                       LMDS
Random 11,512 corpus split with or without query-driven in-                                                                                                                                                                                                                                             Sophos with QDI AOL tf250 cm50 ws6
                                                                                                                                                                                                                                                                                                       Sophos with QDI AOL2 tf250 cm50 ws6
dexing. Figure 5 depicts the obtainable index size reduction                                                                                                                                                                                                                                        Sophos with QDI Excite 97 tf250 cm50 ws6
                                                                                                                                                                                                                                                                                                    Sophos with QDI Excite 99 tf250 cm50 ws6
by using different query logs. The Excite query logs contain                                                                                                                                                                       0.8
significantly less query term sets than the AOL query log.
                                                                                                                                                                                            Recall after selecting n collections


The figure shows that more keys are pruned from the peer
indices, resulting in a smaller broker index. The figure shows                                                                                                                                                                     0.6
that using Excite query logs instead of the standard AOL
query log can result in roughly 75% less collection identifier
counters.                                                                                                                                                                                                                          0.4


                                                  4.5e+007
                                                                                                                                              Sophos without QDI
                                                                            4e+007                                                           Sophos with QDI AOL                                                                   0.2
        #ColID Counters in global index


                                                  3.5e+007

                                                                            3e+007

                                                  2.5e+007                                                                                                                                                                          0
                                                                                                                                                                                                                                         1                                                                10                                           100
                                                                            2e+007                                                                                                                                                                                                         Number (n) of selected collection

                                                  1.5e+007

                                                                            1e+007                                                                                                          Figure 6: Recall of different collections selection
                                                                            5e+006                                                                                                          methods on IP Split
                                                                                  0
                                                                                             Single terms                      Double terms                          Triple terms

                                                                                                                                                                                            6.                                           CONCLUSIONS
Figure 4: Number of collection identifier counters
                                                                                                                                                                                              We introduced the collection selection system Sophos that
stored per QDI strategy for Random 11512 corpus
                                                                                                                                                                                            uses highly discriminative keys in peer indices to construct
split
                                                                                                                                                                                            a broker index. The broker index contains keys that are
                                                                                                                                                                                            good discriminators to select collections (or peers). To limit
 Corpus split      Collection selection method          P@1      P@10    P@20    P@50        R@1     R@10     R@20     R@50
 Random 100        lmds                                 0.290    0.251   0.249   0.217       0.314   0.435    0.526    0.683
                   Sophos QDI AOL tf250 cm20            0.330    0.254   0.237   0.208       0.379   0.436    0.493    0.644
 Random 11512      lmds                                 0.140    0.096   0.084   0.069       0.233   0.196    0.186    0.198
                   Sophos QDI AOL tf250 cm20            0.040    0.036   0.037   0.026       0.067   0.073    0.082    0.073
 IP Merge 100      lmds                                 0.280    0.202   0.188   0.154       0.489   0.489    0.567    0.755
                   Sophos QDI AOL tf250 cm20            0.300    0.254   0.214   0.160       0.211   0.485    0.626    0.825
 IP Split          lmds                                 0.170    0.110   0.083   0.056       0.070   0.149    0.183    0.289
                   Sophos QDI AOL tf250 cm20            0.170    0.147   0.121   0.091       0.280   0.466    0.466    0.548

Table 2: Average precision and recall over 50 queries after selecting n collections, high scores shown in bold


the number of keys transferred to the broker, and to reduce         [3] P. Bailey, N. Craswell, and D. Hawking. Engineering a
the broker index size, we employed query-driven indexing to             multi-purpose test collection for web retrieval
only store the keys that are queried for by users. Similar              experiments. Inf. Process. Manage., 39(6):853–871,
studies were performed for document retrieval [15], but to              2003.
the best of our knowledge, we are the first to apply this           [4] S. Bockting. Collection selection for distributed web
approach for collection selection.                                      search. Master’s thesis, University of Twente, Feb.
   Precision and recall was measured using 50 queries on                2009.
the WT10g TREC test corpus and compared to a state-                 [5] J. Callan. Distributed information retrieval. Advances
of-the-art baseline that uses language modeling with Dirich-            in Information Retrieval, pages 127–150, 2000.
let smoothing (lmds). The results showed that our system            [6] J. Callan, W. B. Croft, and S. M. Harding. The
outperformed the baseline with the IP split as test corpus.             inquery retrieval system. In Proc. of the 3rd
This is promising, because the IP based splits most closely             International Conference on Database and Expert
resemble the information structure on the Internet. lmds                Systems Applications, pages 78–83, Valencia, Spain,
was better capable of selecting information in random based             1992. Springer-Verlag.
splits, because it stores all available information about the       [7] J. Callan, Z. Lu, and W. B. Croft. Searching
collections. In random based splits, relevant documents (and            distributed collections with inference networks. In
their corresponding terms) are mixed over many collections,             SIGIR ’95: Proc. of the 18th annual international
making it hard for our approach to select highly discrimi-              ACM SIGIR conference on Research and development
native keys that can discriminate collections with relevant             in information retrieval, pages 21–28, New York, NY,
documents.                                                              USA, 1995. ACM.
   Query-driven indexing is required to keep the broker index
                                                                    [8] N. Craswell, P. Bailey, and D. Hawking. Is it fair to
size manageable; a 70% index size reduction can be obtained
                                                                        evaluate web systems using trec ad hoc methods. In
by pruning keys using the AOL query log, another 75% re-
                                                                        Workshop on Web Evaluation. SIGIR’99, 1999.
duction is possible by using a smaller query log. Our results
on the IP split showed that pruning using the smaller Excite        [9] N. Craswell, P. Bailey, and D. Hawking. Server
query logs resulted in higher precision and recall than with            selection on the world wide web. In DL ’00: Proc. of
AOL query logs. The use of any query log resulted in higher             the 5th ACM conference on Digital libraries, pages
precision and recall than the baseline results. This motivates          37–46, New York, NY, USA, 2000. ACM.
further research using more advanced query-driven indexing         [10] D. D’Souza, J. Thom, and J. Zobel. A comparison of
strategies, such as described by Skobeltsyn [32], to further            techniques for selecting text collections. In ADC ’00:
reduce the index size while improving the performance. It               Proceedings of the Australasian Database Conference,
would also be interesting to make tf max depending on the               page 28, Washington, DC, USA, 2000. IEEE
collection size.                                                        Computer Society.
                                                                   [11] D. D’Souza, J. A. Thom, and J. Zobel. Collection
                                                                        selection for managed distributed document
Acknowledgements                                                        databases. Information Processing & Management,
We are grateful to the Yahoo Research Faculty Grant pro-                40(3):527–546, May 2004.
gramme and to the Netherlands Organisation for Scientific          [12] D. D’Souza, J. Zobel, and J. A. Thom. Is cori effective
Research (NWO, Grant 639.022.809) for supporting this work.             for collection selection? an exploration of parameters,
We would like to thank Berkant Barla Cambazoglu and the                 queries, and data. In Proc. of the 9th Australasian
anonymous reviewers for their valuable comments that im-                Document Computing Symposium, pages 41–46, Dec.
proved this paper.                                                      2004.
                                                                   [13] L. Gravano and H. Garcia-Molina. Generalizing
7.   REFERENCES                                                         GlOSS to vector-space databases and broker
                                                                        hierarchies. In International Conference on Very Large
 [1] L. A. Adamic and B. A. Huberman. Zipf’s law and the
                                                                        Databases, VLDB, pages 78–89, 1995.
     internet. Glottometrics, 3:143–150, 2002.
                                                                   [14] T. Lau and E. Horvitz. Patterns of search: analyzing
 [2] J. Arguello, J. L. Elsas, J. Callan, and J. G. Carbonell.
                                                                        and modeling web query refinement. In UM ’99: Proc.
     Document representation and query expansion models
                                                                        of the 7th international conference on User modeling,
     for blog recommendation. In Proc. of the 2nd Intl.
                                                                        pages 119–128, Secaucus, NJ, USA, 1999.
     Conf. on Weblogs and Social Media (ICWSM), 2008.
     Springer-Verlag New York, Inc.                                  M. Moricz. Analysis of a very large web search engine
[15] T. Luu, F. Klemm, M. Rajman, and K. Aberer. Using               query log. SIGIR Forum, 33(1):6–12, 1999.
     highly discriminative keys for indexing in a               [31] G. Skobeltsyn. Query-driven indexing in large-scale
     peer-to-peer full-text retrieval system. Technical              distributed systems. PhD thesis, EPFL, Lausanne,
     report, TR: 2005041, EPFL Lausanne, 2005.                       2009.
[16] P. Lyman and H. R. Varian. How much information,           [32] G. Skobeltsyn, T. Luu, I. Podnar Zarko, M. Rajman,
     2003. http://www.sims.berkeley.edu/                             and K. Aberer. Query-driven indexing for scalable
     how-much-info-2003, retrieved on April 23, 2008.                peer-to-peer text retrieval. Future Generation
[17] D. Metzler. Indri retrieval model overview, July 2005.          Computer Systems, 25(1):89–99, June 2009.
     http://ciir.cs.umass.edu/~metzler/                         [33] T. Strohman, D. Metzler, H. Turtle, and W. B. Croft.
     indriretmodel.html, retrieved on January 20, 2008.              Indri: A language model-based search engine for
[18] D. Metzler and W. B. Croft. Combining the language              complex queries. In Proc. of the International
     model and inference network approaches to retrieval.            Conference on Intelligence Analysis, 2004.
     Information Processing and Management,                     [34] C. Tang, Z. Xu, and S. Dwarkadas. Peer-to-peer
     40(5):735–750, 2004.                                            information retrieval using self-organizing semantic
[19] G. Pass, A. Chowdhury, and C. Torgeson. A picture of            overlay networks. Proc. of the 2003 conference on
     search. In InfoScale ’06: Proc. of the 1st international        Applications, technologies, architectures, and protocols
     conference on Scalable information systems, page 1,             for computer communications, pages 175–186, 2003.
     New York, NY, USA, 2006. ACM.                              [35] Y. Wang and D. J. DeWitt. Computing pagerank in a
[20] I. Podnar Zarko, M. Rajman, T. Luu, F. Klemm, and               distributed internet search system. In Proc. of the
     K. Aberer. Scalable peer-to-peer web retrieval with             International Conference on Very Large Databases
     highly discriminative keys. IEEE 23rd International             (VLDB), Aug. 2004.
     Conference on Data Engineering (ICDE), 2007.               [36] I. H. Witten, A. Moffat, and T. C. Bell. Managing
[21] A. L. Powell and J. C. French. Comparing the                    Gigabytes: Compressing and Indexing Documents and
     performance of collection selection algorithms. ACM             Images. Morgan Kaufmann, 1999.
     Trans. Inf. Syst., 21(4):412–456, 2003.                    [37] J. Xu and W. B. Croft. Cluster-based language models
[22] D. Puppin, F. Silvestri, and D. Laforenza.                      for distributed retrieval. Proc. of the 22nd annual
     Query-driven document partitioning and collection               international ACM SIGIR conference on Research and
     selection. In InfoScale ’06: Proc. of the 1st                   development in information retrieval, pages 254–261,
     international conference on Scalable information                1999.
     systems, page 34, New York, NY, USA, 2006. ACM.            [38] B. Yuwono and D. L. Lee. Server ranking for
[23] S. E. Robertson, S. Walker, S. Jones, M. M.                     distributed text retrieval systems on the internet. In
     Hancock-beaulieu, and M. Gatford. Okapi at trec-3. In           Proc. of the 5th International Conference on Database
     TREC-3 Proc., pages 109–126, 1995.                              Systems for Advanced Applications (DASFAA), pages
[24] G. Salton and M. J. McGill. Introduction to Modern              41–50. World Scientific Press, 1997.
     Information Retrieval. McGraw-Hill, Inc., New York,        [39] C. Zhai and J. Lafferty. A study of smoothing
     NY, USA, 1986.                                                  methods for language models applied to ad hoc
[25] N. Sato, M. Udagawa, M. Uehara, Y. Sakai, and                   information retrieval. In SIGIR ’01: Proc. of the 24th
     H. Mori. Query based site selection for distributed             annual international ACM SIGIR conference on
     search engines. Distributed Computing Systems                   Research and development in information retrieval,
     Workshops, 2003. Proc.. 23rd International                      pages 334–342, New York, NY, USA, 2001. ACM.
     Conference on, pages 556–561, 2003.
[26] J. Seo and W. B. Croft. Umass at trec 2007 blog
     distillation task. In Proc. of the 2008 Text REtrieval
     Conference. NIST, 2007.
[27] M. Shokouhi, J. Zobel, S. Tahaghoghi, and F. Scholer.
     Using query logs to establish vocabularies in
     distributed information retrieval. Information
     Processing and Management, 43(1):169–180, 2007.
[28] L. Si and J. Callan. Relevant document distribution
     estimation method for resource selection. In SIGIR
     ’03: Proc. of the 26th annual international ACM
     SIGIR conference on Research and development in
     informaion retrieval, pages 298–305, New York, NY,
     USA, 2003. ACM.
[29] L. Si, R. Jin, J. Callan, and P. Ogilvie. A language
     modeling framework for resource selection and results
     merging. Proc. of the 11th international conference on
     Information and knowledge management, pages
     391–397, 2002.
[30] C. Silverstein, H. Marais, M. Henzinger, and