<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Collection Selection with Highly Discriminative Keys</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sander Bockting</string-name>
          <email>sander.bockting@avanade.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Djoerd Hiemstra</string-name>
          <email>d.hiemstra@utwente.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Avanade Netherlands B.V.</institution>
          ,
          <addr-line>Versterkerstraat 6, 1322 AP, Almere</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Twente</institution>
          ,
          <addr-line>P.O. Box 217, 7500 AE, Enschede</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The centralized web search paradigm introduces several problems, such as large data tra c requirements for crawling, index freshness problems and problems to index everything. In this study, we look at collection selection using highly discriminative keys and query-driven indexing as part of a distributed web search system. The approach is evaluated on di erent splits of the TREC WT10g corpus. Experimental results show that the approach outperforms a Dirichlet smoothing language modeling approach for collection selection, if we assume that web servers index their local content.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The web search approach of major search engines, like
Google, Yahoo! and Bing, amounts to crawling, indexing
and searching. We call this approach centralized search,
because all operations are controlled by the search engines
themselves, be it from a relatively limited number of
locations on large clusters of thousands of machines. The
centralized web search paradigm poses several problems.</p>
      <p>
        The amount of web data is estimated to grow
exponentially [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ]. The changing and growing data requires frequent
visits by crawlers, just to keep the index fresh. Crawling
should be done often, but generates a huge amount of
trafc, making it impossible to do frequent crawls of all pages.
With an estimated four weeks update interval, updates are
performed relatively slow [
        <xref ref-type="bibr" rid="ref25 ref35">25, 35</xref>
        ]. Also, it is impossible to
index everything, as the search engine accessible visible web
is only a fraction of the total number of web pages [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        Callan [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] identi ed the distributed information retrieval
problem set, consisting of resource description, resource
selection and result merging. We believe that distributing the
search e orts may be an approach to solve the problems
described above. This research focuses on resource description
and resource selection [
        <xref ref-type="bibr" rid="ref10 ref22">10, 22</xref>
        ]. Resource description, i.e.
indexing of the peers, is distributed; resource selection, or
collection selection, is centralized in our research.
Copyright c 2009 for the individual papers by the papers’ authors.
Copying permitted for private and academic purposes. Re-publication of material
from this volume requires permission by the copyright owners. This volume
is published by its editors.
      </p>
      <p>LSDS-IR Workshop. July 2009. Boston, USA.</p>
      <sec id="sec-1-1">
        <title>1. Query</title>
      </sec>
      <sec id="sec-1-2">
        <title>3. Result</title>
      </sec>
      <sec id="sec-1-3">
        <title>2. Query</title>
        <sec id="sec-1-3-1">
          <title>Client</title>
        </sec>
        <sec id="sec-1-3-2">
          <title>Broker</title>
        </sec>
      </sec>
      <sec id="sec-1-4">
        <title>4. Merged result set</title>
      </sec>
      <sec id="sec-1-5">
        <title>3. Result 2. Query</title>
        <sec id="sec-1-5-1">
          <title>Peer 1</title>
        </sec>
        <sec id="sec-1-5-2">
          <title>Peer 2</title>
        </sec>
        <sec id="sec-1-5-3">
          <title>Peer i</title>
        </sec>
        <sec id="sec-1-5-4">
          <title>Peer n-1</title>
        </sec>
        <sec id="sec-1-5-5">
          <title>Peer n</title>
          <p>Figure 1 shows three types of entities: peers, brokers and
clients. We assume that peers are collaborative. Every peer
runs a search engine enabled web server. The search engine
indexes the local website(s), but it may also index other
websites. In this scenario, there can be many millions of
peers. When a user has an information need, he can pose
a query at the client. The client sends the query to the
broker. In a response, the broker tries to identify the most
promising peers to answer the query. This has to be a small
amount of peers, e.g. ve to ten peers, so not a lot of tra c
is generated. The query is routed to those peers and the
results are returned to the broker. The broker merges the
results and sends the merged list to the client.</p>
          <p>Peers and brokers cooperate to enable brokers to identify
most promising peers. Therefore, every peer sends a small
part of its index to the broker. This part cannot be too
small, to still allow for proper judging about the peers'
ability to satisfactory answer queries. The index part cannot be
too large due to index data tra c scalability.</p>
          <p>
            Techniques have been proposed to manage the index size.
Podnar et al [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ] used the concept of highly discriminative
keys (HDKs) for document retrieval in distributed
information retrieval. An HDK is a set of terms that is highly
discriminative, i.e., that only match a few documents in the
collection. Because the terms are pre-coordinated (they are
combined at index time, not at search time) and because
only a few document match all terms in a pre-coordinated
set, the HDK approach is able to very e ciently retrieve the
top documents for a query. Although searching can be e
cient, the HDK indexing process, described in more detail in
Section 3.1, has the negative side-e ect that a huge amount
of highly discriminative keys are generated. To reduce the
number of keys, Skobeltsyn [
            <xref ref-type="bibr" rid="ref32">32</xref>
            ] proposed a query-driven
indexing strategy that uses caching techniques to adapt to
the changing querying behavior of users. The combination
of HDK with query-driven indexing allows for completely
distributed document retrieval that in theory grows to web
scale proportions [
            <xref ref-type="bibr" rid="ref31">31</xref>
            ].
          </p>
          <p>
            This paper contributes to the eld of distributed
information retrieval by the applying HDKs and query-driven
indexing to select collections, instead of documents. Such an
approach would in theory scale the distributed search
scenario described above to millions of peers: The broker lists
for every HDK a small number of peers to send the query to,
and the peers retrieve the documents; possibly many.
Unlike a traditional inverted le index that typically consists
of huge posting lists and a, in comparison, tiny dictionary
[
            <xref ref-type="bibr" rid="ref36">36</xref>
            ], our HDK index consists of a huge dictionary and, in
comparison, tiny posting lists. The system is tted into the
previously sketched scenario, which allows for control at the
broker. This control can for example be used to prevent
misuse or to allow for domain-speci c search.
          </p>
          <p>This paper is organized as follows: the next section
discusses earlier collection selection methods, Section 3
introduces our collection selection system and Section 4 describes
the evaluation. The paper concludes with results and
conclusions.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>EARLIER WORK ON COLLECTION SE</title>
    </sec>
    <sec id="sec-3">
      <title>LECTION</title>
      <p>
        Collection selection systems have been developed to
select collections containing documents that are relevant to
a user's query. The generalized Glossary-Of-Servers Server
(gGlOSS) is such a system [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. It uses a vector space model
representing index items (document collections) and user
queries as weight vectors in a high dimensional Euclidean
space to calculate the distance (or similarity) between
document collections and queries [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
      </p>
      <p>
        Another well-known approach is CVV, which exploits the
variation in cue validity to select collections [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ]. The cue
validity CV i;j of query term tj for collection ci measures
the extent that tj discriminates ci from the other
collections, by comparing the ratio of documents in ci containing
tj to the ratios of documents in other collections containing
tj . The larger the variation in cue validities for collections
with respect to a term, the better the term is for selecting
collections.
      </p>
      <p>This section will describe two collection selection methods
in more detail: inference networks and language modeling.
2.1</p>
      <p>
        Inference networks
cori [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is a collection ranking algorithm for the inquery
retrieval system [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and uses an inference network to rank
collections. A simple document inference network has leafs d
representing the document collections. The terms that
occur in those collections are represented by representation
nodes r. Flowing along the arcs between the leaves and
nodes are probabilities based on document collection
statistics. Opposed to tf.idf, the probabilities are calculated
using document frequencies df and inverse collection
frequencies icf (df.icf). The inverse collection frequency is the
number of collections that contain the term. An inference
network with these properties is called a collection retrieval
inference network (cori net).
      </p>
      <p>Given query q with terms rk, the network is used to obtain
a ranked list of collections by calculating the belief p(rkjci)
in collection ci due to the observation of rk. The
collection ranking score of ci for query q is the sum of all beliefs
p(rkjci), where r 2 q. The belief is calculated using
Formula 1. In this formula, b and l are constants, cw i is the
number of words in ci, cw is the mean number of words
in the collections and jCj is the number of collections. df
and cf respectively are the number of documents and
collections that contain rk. Finally, dt and db respectively are
the minimum term frequency component and minimum
belief component when rk occurs in ci.</p>
      <p>
        To improve retrieval, the component L is used to scale
the document frequency in the calculation of T [
        <xref ref-type="bibr" rid="ref23 ref6">6, 23</xref>
        ]. L
is made sensitive to the number of documents that contain
term rk (using b) and is made large using l. L should be
large, because df is generally large. Proper tuning of these
two parameters is required for every data set, but deemed
impossible as the correct settings are highly sensitive to data
set variations [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]; the value of l should be varied largely even
when keeping the data set constant while varying the query
type.
      </p>
      <p>
        Further research showed that it is not justi ed to use
standard cori as a collection selection benchmark. D'Souza et
al. showed that HighSim outperformed cori in 15 of 21
cases [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Si and Callan [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] found limitations with di
erent collection sizes. Large collections are not often ranked
high by cori, although they often are the most promising
collections.
      </p>
      <p>L = l ((1</p>
      <p>b) + b cw i=cw )
T = dt + (1</p>
      <p>dt)
I =
log jCj+0:5</p>
      <p>cf
log jCj + 1:0</p>
      <p>log (df )
log (df + L)
p(rkjci) = db + (1
db) T I
(1)
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Language modeling</title>
      <p>
        A language model is a statistical model to assign a
probability to a sequence of words (e.g. a query) being generated
by a particular document or document collection. Language
models can be used for collection selection in distributed
search, by creating a language model for each collection [
        <xref ref-type="bibr" rid="ref29 ref37">29,
37</xref>
        ]. They have also been used for collection selection for
other tasks, for instance for blog search [
        <xref ref-type="bibr" rid="ref2 ref26">2, 26</xref>
        ].
      </p>
      <p>
        indri is an improved version of the inquery retrieval
system [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ], as it is capable of dealing with larger collections,
allows for more complex queries due to new query constructs
and uses formal probabilistic document representations that
use language models. The combined model of inference
networks and language modeling is capable of achieving more
favorable retrieval results than inquery [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Due to these
di erences, term representation beliefs are calculated in
another way than with cori (as described in Section 2.1):
P (rjD) =
      </p>
      <p>tf r;D + r
jDj + r + r
The belief is calculated for representation concept r of
document node D (in document collection C). Examples of
representation concepts are a term in the body or title of a
document. D and r are nodes in the belief network. The term
frequency of representation node r in D is denoted by tf r;D.</p>
      <p>
        r and r are smoothing parameters. Smoothing is used
to make the maximum likelihood estimations of a language
model more accurate, which could have been less accurate
due to data sparseness, because not every term occurs in a
document [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ]. Smoothing ensures that terms that are
unseen in the document, are not assigned zero probability. The
default smoothing model for Indri is Dirichlet smoothing,
which a ects the smoothing parameter choices [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. In
setting the smoothing parameters, it was assumed that the
likeliness of observing representation concept r is equal to the
probability of observing it in collection C, given by P (rjC).
This means that r=( r + r) = P (rjC). The following
parameter values were chosen:
r = P (rjC)
r = (1
      </p>
      <p>P (rjC))
This results in the following term representation belief, where
the free parameter has a default value of 2500:
P (rjD) =
tf r;D + P (rjC)
jDj +
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>
        The language modeling approach of indri has a better
formal probabilistic document representation than cori and
indri is an improved version of inquery (which is the
foundation of cori). We will use the language model on
document collections as implemented by indri as our baseline
collection selection system. Si et al. [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] showed that a
language modeling approach for collection selection
outperforms cori. Furthermore, cori outperforms algorithms like
cvv and gGlOSS in several studies [
        <xref ref-type="bibr" rid="ref21 ref9">9, 21</xref>
        ].
      </p>
    </sec>
    <sec id="sec-6">
      <title>3. SOPHOS</title>
      <p>Sophos is a collection selection prototype that uses HDKs
to index collections. The keys are used to assign scores to
the collections. Using a scoring function, collection scores
can be calculated to rank collections for a particular query.
A general overview is depicted in Figure 2. This section
describes how the broker index is created, explains index size
reduction using a query-driven indexing approach,
identies query result parameters, and concludes with a collection
ranking formula (ScoreFunction in Figure 2).
3.1</p>
    </sec>
    <sec id="sec-7">
      <title>Highly discriminative keys</title>
      <p>Building the index is done in two phases. First, every
peer builds an index of its document collection and sends
that index to the broker. Second, the broker constructs a
broker index from all peer indices.
3.1.1</p>
      <p>Peer indexing</p>
      <p>Peer indexing starts with the generation of term set
statistics. First, single term statistics are generated by counting</p>
      <sec id="sec-7-1">
        <title>Document collections</title>
      </sec>
      <sec id="sec-7-2">
        <title>Generated index keys</title>
        <sec id="sec-7-2-1">
          <title>Prune keys with query log (Query-driven Indexing)</title>
        </sec>
      </sec>
      <sec id="sec-7-3">
        <title>Stored index keys</title>
      </sec>
      <sec id="sec-7-4">
        <title>Broker</title>
      </sec>
      <sec id="sec-7-5">
        <title>Index</title>
      </sec>
      <sec id="sec-7-6">
        <title>Query results</title>
      </sec>
      <sec id="sec-7-7">
        <title>ScoreFunction</title>
      </sec>
      <sec id="sec-7-8">
        <title>Ranked list of collections</title>
        <p>term frequencies of every word in the collection, without
looking at document boundaries in the collection. A term
set is called frequent when it occurs more times than a term
set frequency maximum tf max. Every infrequent single term
is added to the peer index together with its frequency count.
The frequent keys are stored for further analysis.</p>
        <p>The next step consists of obtaining double term set
statistics. For every frequent term in the collection, frequency
statistics are created for term combinations that consist of
the frequent term and a term that appears after that term
within a window size ws. The result is a list of double terms
with corresponding frequency counts. Once again, the term
set frequency maximum de nes which term sets are frequent
and will be used for further analysis, and which term sets
are infrequent and will be stored in the peer index. This
procedure can be repeated as long as the generated term
sets do not contain more than ws terms, or when a
predened maximum number of terms in a term set, hmax, has
been reached.</p>
        <p>Summarizing, the peer index consists of tuples with term
sets and corresponding frequency counts.
3.1.2</p>
        <p>Broker indexing</p>
        <p>The infrequent term sets from the peer indices are sent
to the broker. The broker index contains term sets with
associated sets of collection identi er counters. A collection
identi er counter is a tuple of a collection identi er and a
term set frequency counter. A collection identi er is a short
representation of a collection where the term set occurs.</p>
        <p>When a term set is inserted into the broker index, it is
called a highly discriminative key (HDK). The broker index
will contain a maximum number of collections per HDK,
denoted by the collection maximum cm. As soon as the
maximum number of collections is reached for a particular HDK,
the cm collections with the largest term set frequencies will
be stored in the broker index.
3.2</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Query-driven indexing</title>
      <p>A list of popular keys can be created by extracting all
unique queries from a query log. Every key that is infrequent
at a peer, and which is present in the unique query list, will
be sent to the broker; the other keys are ltered out to reduce
the broker index size and to reduce tra c.
h
hmax
q
n
tf max
sum of query term set occurrence within one
collections (grouped by sets with h terms)
#terms in an HDK
maximum #terms that HDK can consist of
#terms in query
#distinct query terms found in a collection
maximum frequency of a term set in a collection</p>
      <p>
        This index pruning strategy was used before by Shokouhi
et al. [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. It is not the most desirable strategy for
querydriven indexing, because it is unable to deal with unseen
query terms. However, it will give a good idea about the
possible index size reduction and the loss of retrieval
performance.
3.3
      </p>
    </sec>
    <sec id="sec-9">
      <title>Identifying query result parameters</title>
      <p>Once the broker index has been built, the system is ready
to be queried. The broker index contains HDKs with h
terms, where h varies from 1 to hmax. In Sophos, hmax is set
to 3. Every key has an associated posting list, which
contains tuples of collection identi ers and counters. A counter
indicates the number of occurrences of a certain key within
the corresponding collection. The counter value cannot
exceed the term set frequency maximum, tf max, as a key would
otherwise have been locally frequent and new HDKs would
have been generated when the maximum number of terms,
hmax, was not yet reached.</p>
      <p>When a user poses a query with q terms, e.g. abcde with
q = 5, the query is rst decomposed in query term
combinations with length hmax (i.e. abc, abd, . . . , bde, cde). This
results in hq combinations. Each combination is looked up
in the broker index. Note that this introduces additional
load on the broker, but these lookups do not require
network access to the peers. The results of each of those smaller
queries are merged by summing the number of occurrences
per collection. The sum of all counters, c, has a potential
maximum of hq tf max. It may happen that this procedure
results in little or no collections where an HDK occurs. The
procedure is then repeated for smaller term sets; rst term
sets of two terms will be looked up in the index. When even
this gives too few results, single terms will be looked up in
the index. In the case that one collection contains two di
erent combinations, e.g. both abc and bce occur in collection
X, the number of occurrences are summed (this is c that was
just introduced). However, it also interesting to note that
4 out of 5 query terms can be found in collection X. The
number of query terms that can be found using HDKs of a
particular length is indicated by n. The di erent parameters
are displayed in Table 1.
3.4</p>
    </sec>
    <sec id="sec-10">
      <title>ScoreFunction: ranking query results</title>
      <p>ScoreFunction, given in Formula 2, is a ranking formula
that uses the query result parameters to enforce a collection
ranking conforming to our desired ranking properties. It
calculates a score s for a collection for a given query</p>
      <p>It consists of three components; one component per
desired ranking property. The properties, and corresponding
components, are listed below in order of importance.
1. Collections that contain longer HDKs should be ranked
higher. Component 1: h 1.
2. A collection should be ranked higher if it contains more
distinct query terms. Component 2: (n 1)=q.
3. More occurrences of query term sets within a collection
should result in a higher collection ranking.
Compor c (q n)
nent 3: (hmax+1 h) (qnh) tf max .</p>
      <p>
        The general intuition behind this component is that
a collection is more important when it contains more
query terms. This is controlled by a damping factor .
The term set counts have less impact when they get
closer to tf max, because longer keys would be generated
for a term set in a collection when its frequency exceeds
tf max. We refer to earlier work for a more detailed
explanation about ScoreFunction [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>An important detail to mention about querying and
ranking is that collection X can be found after looking for HDKs
with length h. When the same collection X is found
after looking for HDKs with length h 1, those results are
discarded as the collection was already found using larger
HDKs. Counts c are only compared with other counts that
are retrieved after looking for HDKs of the same length. The
motivation for this behavior is that smaller HDKs tend to
have higher counts.</p>
      <p>Each of the three components has a share in the
collection score. The component share of a less desired property
never exceeds that smallest possible share of a more desired
property's component value.
4.</p>
    </sec>
    <sec id="sec-11">
      <title>EXPERIMENT SETUP</title>
      <p>This section describes the corpus, query set and query logs
that were used in the evaluation process, and continues to
describe how the collection selection e ectiveness of Sophos
was measured.
4.1</p>
    </sec>
    <sec id="sec-12">
      <title>Data collections</title>
      <p>4.1.1</p>
      <p>WT10G corpus splits</p>
      <p>
        The Web Track 10GB corpus (WT10g) was developed for
the Text REtrieval Conference1 and consists of 10GB of web
pages (1,692,096 documents on 11,680 servers). Compared
to regular TREC corpora, WT10g should be more suited
for distributed information retrieval experiments, due to the
existence of hyperlinks, di erences in topics, variation in
quality and presence of duplicates [
        <xref ref-type="bibr" rid="ref3 ref8">3, 8</xref>
        ].
1http://trec.nist.gov/
version of the query log that only contains the actual queries
issued in random order (and none of the other metadata).
      </p>
      <p>We also used two query logs that were published by
Excite2 that were stripped in the same way. One query log
from 1997 contains 1,025,907 queries and another query log
from 1999 contains 1,777,251 queries.</p>
      <p>Finally, a fourth query log with 3,512,320 unique queries
was created by removing all queries from the AOL query log
that were issued only once. This query log will be referred
to as AOL2. The other logs are called AOL, Excite97 and
Excite99.
4.2</p>
    </sec>
    <sec id="sec-13">
      <title>Method</title>
      <p>
        To evaluate the performance of our collection selection
system, we adopted the precision and recall metrics for
collection selection as de ned by Gravano et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. We start
by obtaining the retrieval system ranking (S ) for a query,
which contains up to 1,000 collections. We also create the
best ranking (B ) which is the best possible ranking for a
particular query; collections are ranked by their amount of
merit with respect to a query.
      </p>
      <p>Given query q and collection ci, the merit within a
collection can be expressed using merit (q; ci). The merit of
the ith ranked collection in rankings S and B is given by
S i = merit (q; csi ) and B i = merit (q; cbi ).</p>
      <p>The obtained recall after selecting n collections can be
calculated by dividing the merit selected by the best possible
retrieval system:</p>
      <p>Rn =</p>
      <p>Pn</p>
      <p>i=1 Si
Pn</p>
      <p>i=1 Bi</p>
      <p>Precision Pn is the fraction of top n collections that have
non-zero merit:
(3)
(4)</p>
      <p>Four splits of the WT10G document corpus were made to
look at the e ect of document corpora on collection
selection. Every split is a set of collections; every collection is a
set of documents. The numbers 100 and 11512 indicate the
amount of collections in the corpus split.</p>
      <p>IP Split: Documents are put in collections based on the IP
addresses of the site where a document was residing.</p>
      <p>This results in 11,512 collections.</p>
      <p>IP Merge 100: A cluster is created by grouping up to 116
collections, which results in 100 collections. Grouping
is done in order of IP address. This split simulates
the scenario of indexing the search engines that index
many servers.</p>
      <p>Random 100 and Random 11512: Two random splits
with 100 collections and with 11,512 collections were
created. Documents were randomly assigned to a
collection. The number of 11,512 collections was chosen
to be able to compare a random split with the IP Split.</p>
      <p>
        The number of documents in random splits is relatively
constant, but varies in IP based collections from less than 10
up to more than 1000 documents. This is typical for the size
of sites on the Internet; the number of documents per server
follows a Zipf distribution on the Internet [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The IP based
splits show signs of conformance to a Zipf distribution [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
This means that the IP based splits can be compared to the
Internet in terms of distribution of the number of documents
and the sizes of the collections.
      </p>
      <p>
        The merit of a collection is the number of relevant
documents in a collection for a particular query. The IP based
corpus splits have a larger deviation in merit among the
collections. This contrasts with random splits, which by
approximation have equal merit for each collection [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
4.1.2
      </p>
      <p>WT10g retrieval tasks</p>
      <p>Fifty WT10g informational `ad-hoc' queries were used for
evaluation (query numbers 501{550). The queries have a
query number, title, description and a narrative description
of the result that is considered relevant. The title is a small
set of words which was used as the query text. The
narrative descriptions were used by humans to assign relevance
judgments to documents. The relevance judgments can be
used to count the number of relevant documents in the
collections, which in turn can be used to measure collection
selection e ectiveness.</p>
      <p>There are three types of relevance judgments: not
relevant (0), relevant (1) and authoritative (2). There can be
multiple authoritative documents in the document corpus
for a query, but for some queries there are no authoritative
documents. All authoritative judgments are converted to 1,
so documents are either relevant or not relevant. This allows
for evaluation of the collected merit.
4.1.3</p>
      <p>Query logs</p>
      <p>
        AOL published a query log with 21,011,340 queries [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
The log has been anonymized and consists of several data
elds: the actual query issued and the query submission date
and time, and an anonymous user ID number. The release
of the anonymized data set was controversial at the time
because it was proven possible to link an ID to a real person.
To respect the anonymity of the users, we used a stripped
Pn = jfsc 2 Topn(S)jmerit(q; sc) &gt; 0gj
      </p>
      <p>jTopn(S)j</p>
      <p>The precision and recall obtained by Sophos is compared
to the collection selection results from a baseline of Language
Modeling with Dirichlet Smoothing (lmds) as implemented
by indri. The baseline system has one parameter that
is set to 2500. Krovetz word stemming is applied to the
collections.</p>
      <p>
        Section 3 introduced Sophos and its parameters. There
are three parameters for peers: the maximum key length
hmax, the maximum key frequency before calling a key
frequent (tf max) and the window size ws. Based on
numbers used by Luu et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], we use the following settings:
tf max = f250; 500; 750g and ws = f6; 12; 18g. The average
query length on the Internet is 2.3 [
        <xref ref-type="bibr" rid="ref14 ref30">14, 30</xref>
        ]. We use this
observation to set hmax to 3. Setting it smaller would require
many intersections of term sets (with associated collections)
to answer queries. Setting it larger would result in many
term sets that are rarely queried for. For the broker, the
collection maximum cm is tested for values 5, 10, 20 and 50.
      </p>
      <p>To evaluate Sophos, the collections are processed by
removing stop words { drastically reducing the number of
invaluable keys and speeding up term set generation { and
applying Porter word stemming.</p>
      <p>Finally, we will look at the precision and recall with
querydriven indexing using four di erent query logs. By pruning
the keys from the peer indices that do not occur in a query
2http://www.excite.com/
log. At the same time, we will look at the number of term
sets (keys) in the broker index to get an idea about its size.</p>
    </sec>
    <sec id="sec-14">
      <title>5. RESULTS</title>
    </sec>
    <sec id="sec-15">
      <title>5.1 Index size</title>
      <p>Figure 3 shows the number of collection identi er
counters within the broker index for di erent indexing settings
of indexing the IP Split. Spread over single, double and
triple term set collection identi er counters, the number of
counters are a good indication for the actual broker index
size. The gure shows that the number of counters decreases
when the term set frequency maximum is increased.
Single terms
Double terms
Triple terms
107 tf250cm5tf250cm10tf250cm20tf250cm50tf500cm5tf500cm10tf500cm20tf500cm50tf750cm5tf750cm10tf750cm20tf750cm50</p>
      <p>A more substantial reduction of collection identi er
counters { of roughly 70% { can be achieved by using
querydriven indexing, as shown in Figure 4. The gure shows the
number of collection identi er counters after indexing the
Random 11,512 corpus split with or without query-driven
indexing. Figure 5 depicts the obtainable index size reduction
by using di erent query logs. The Excite query logs contain
signi cantly less query term sets than the AOL query log.
The gure shows that more keys are pruned from the peer
indices, resulting in a smaller broker index. The gure shows
that using Excite query logs instead of the standard AOL
query log can result in roughly 75% less collection identi er
counters.</p>
      <p>Sophos without QDI</p>
      <p>Sophos with QDI AOL
x
e
ind108
l
a
b
o
l
g
n
i
s
tr
e
n
u
o
C
D
lI
o
C
#
4.5e+007
x 4e+007
e
ilnd 3.5e+007
loba 3e+007
ing 2.5e+007
trsen 2e+007
u
oC 1.5e+007
lIoD 1e+007
#C 5e+006
0</p>
      <p>Single terms</p>
      <p>Double terms</p>
      <p>Triple terms</p>
    </sec>
    <sec id="sec-16">
      <title>5.2 Collection selection performance</title>
      <p>Table 2 shows the average precision and recall over 50
queries that were calculated with Formula 3 and Formula 4.
The numbers are calculated for four di erent corpus splits
with which the baseline (lmds) and Sophos were tested.
Sophos was used with the following settings: query-driven
indexing with the AOL query log, tf max = 250, cm = 20
and ws = 6. Due to memory constraints, we were unable
to run Sophos with a window size larger than 6. The table
shows that the baseline outperforms Sophos on the Random
11,512 corpus split, but Sophos outperforms the baseline on
the IP split.</p>
      <p>This is displayed in more detail in Figure 6, which shows
the average recall of Sophos and the baseline after selecting
n collections. Sophos was tested using di erent query logs,
tf max = 250 and cm = 50. Interestingly, pruning with the
smallest Excite query logs results in the best recall gures,
possibly because the queries were logged at the same time
as when the corpus was crawled.</p>
      <p>LMDS
Sophos with QDI AOL tf250 cm50 ws6</p>
      <p>Sophos with QDI AOL2 tf250 cm50 ws6
Sophos with QDI Excite 97 tf250 cm50 ws6</p>
      <p>Sophos with QDI Excite 99 tf250 cm50 ws6
1
0.8
s
n
o
it
c
e
llco 0.6
n
g
n
it
c
e
l
e
s
tre 0.4
llf
a
a
c
e
R
0.2
0
1</p>
      <p>10
Number (n) of selected collection
100</p>
    </sec>
    <sec id="sec-17">
      <title>6. CONCLUSIONS</title>
      <p>
        We introduced the collection selection system Sophos that
uses highly discriminative keys in peer indices to construct
a broker index. The broker index contains keys that are
good discriminators to select collections (or peers). To limit
the number of keys transferred to the broker, and to reduce
the broker index size, we employed query-driven indexing to
only store the keys that are queried for by users. Similar
studies were performed for document retrieval [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], but to
the best of our knowledge, we are the rst to apply this
approach for collection selection.
      </p>
      <p>Precision and recall was measured using 50 queries on
the WT10g TREC test corpus and compared to a
stateof-the-art baseline that uses language modeling with
Dirichlet smoothing (lmds). The results showed that our system
outperformed the baseline with the IP split as test corpus.
This is promising, because the IP based splits most closely
resemble the information structure on the Internet. lmds
was better capable of selecting information in random based
splits, because it stores all available information about the
collections. In random based splits, relevant documents (and
their corresponding terms) are mixed over many collections,
making it hard for our approach to select highly
discriminative keys that can discriminate collections with relevant
documents.</p>
      <p>
        Query-driven indexing is required to keep the broker index
size manageable; a 70% index size reduction can be obtained
by pruning keys using the AOL query log, another 75%
reduction is possible by using a smaller query log. Our results
on the IP split showed that pruning using the smaller Excite
query logs resulted in higher precision and recall than with
AOL query logs. The use of any query log resulted in higher
precision and recall than the baseline results. This motivates
further research using more advanced query-driven indexing
strategies, such as described by Skobeltsyn [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ], to further
reduce the index size while improving the performance. It
would also be interesting to make tf max depending on the
collection size.
      </p>
    </sec>
    <sec id="sec-18">
      <title>Acknowledgements</title>
      <p>We are grateful to the Yahoo Research Faculty Grant
programme and to the Netherlands Organisation for Scienti c
Research (NWO, Grant 639.022.809) for supporting this work.
We would like to thank Berkant Barla Cambazoglu and the
anonymous reviewers for their valuable comments that
improved this paper.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Adamic</surname>
          </string-name>
          and
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Huberman</surname>
          </string-name>
          .
          <article-title>Zipf's law and the internet</article-title>
          .
          <source>Glottometrics</source>
          ,
          <volume>3</volume>
          :
          <fpage>143</fpage>
          {
          <fpage>150</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Arguello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Elsas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          , and J. G. Carbonell.
          <article-title>Document representation and query expansion models for blog recommendation</article-title>
          .
          <source>In Proc. of the 2nd Intl. Conf. on Weblogs and Social Media (ICWSM)</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bailey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Hawking</surname>
          </string-name>
          .
          <article-title>Engineering a multi-purpose test collection for web retrieval experiments</article-title>
          .
          <source>Inf</source>
          . Process. Manage.,
          <volume>39</volume>
          (
          <issue>6</issue>
          ):
          <volume>853</volume>
          {
          <fpage>871</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bockting</surname>
          </string-name>
          .
          <article-title>Collection selection for distributed web search</article-title>
          .
          <source>Master's thesis</source>
          , University of Twente, Feb.
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          .
          <article-title>Distributed information retrieval</article-title>
          .
          <source>Advances in Information Retrieval</source>
          , pages
          <volume>127</volume>
          {
          <fpage>150</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Croft</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Harding</surname>
          </string-name>
          .
          <article-title>The inquery retrieval system</article-title>
          .
          <source>In Proc. of the 3rd International Conference on Database and Expert Systems Applications</source>
          , pages
          <volume>78</volume>
          {
          <fpage>83</fpage>
          ,
          <string-name>
            <surname>Valencia</surname>
          </string-name>
          , Spain,
          <year>1992</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Searching distributed collections with inference networks</article-title>
          .
          <source>In SIGIR '95: Proc. of the 18th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>21</volume>
          {
          <fpage>28</fpage>
          , New York, NY, USA,
          <year>1995</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bailey</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Hawking</surname>
          </string-name>
          .
          <article-title>Is it fair to evaluate web systems using trec ad hoc methods</article-title>
          .
          <source>In Workshop on Web Evaluation. SIGIR'99</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bailey</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Hawking</surname>
          </string-name>
          .
          <article-title>Server selection on the world wide web</article-title>
          .
          <source>In DL '00: Proc. of the 5th ACM conference on Digital libraries</source>
          , pages
          <volume>37</volume>
          {
          <fpage>46</fpage>
          , New York, NY, USA,
          <year>2000</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>D. D'Souza</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Thom</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Zobel</surname>
          </string-name>
          .
          <article-title>A comparison of techniques for selecting text collections</article-title>
          .
          <source>In ADC '00: Proceedings of the Australasian Database Conference, page 28</source>
          , Washington, DC, USA,
          <year>2000</year>
          . IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>D. D'Souza</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          <string-name>
            <surname>Thom</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Zobel</surname>
          </string-name>
          .
          <article-title>Collection selection for managed distributed document databases</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>40</volume>
          (
          <issue>3</issue>
          ):
          <volume>527</volume>
          {
          <fpage>546</fpage>
          , May
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>D. D'Souza</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Zobel</surname>
            , and
            <given-names>J. A.</given-names>
          </string-name>
          <string-name>
            <surname>Thom</surname>
          </string-name>
          .
          <article-title>Is cori e ective for collection selection? an exploration of parameters, queries, and data</article-title>
          .
          <source>In Proc. of the 9th Australasian Document Computing Symposium</source>
          , pages
          <volume>41</volume>
          {
          <fpage>46</fpage>
          ,
          <string-name>
            <surname>Dec</surname>
          </string-name>
          .
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gravano</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          .
          <article-title>Generalizing GlOSS to vector-space databases and broker hierarchies</article-title>
          .
          <source>In International Conference on Very Large Databases, VLDB</source>
          , pages
          <volume>78</volume>
          {
          <fpage>89</fpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Lau</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Horvitz</surname>
          </string-name>
          .
          <article-title>Patterns of search: analyzing and modeling web query re nement</article-title>
          .
          <source>In UM '99: Proc. of the 7th international conference on User modeling</source>
          , pages
          <volume>119</volume>
          {
          <fpage>128</fpage>
          ,
          <string-name>
            <surname>Secaucus</surname>
          </string-name>
          , NJ, USA,
          <year>1999</year>
          . Springer-Verlag New York, Inc.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Luu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Klemm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rajman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Aberer</surname>
          </string-name>
          .
          <article-title>Using highly discriminative keys for indexing in a peer-to-peer full-text retrieval system</article-title>
          .
          <source>Technical report, TR: 2005041</source>
          ,
          <string-name>
            <given-names>EPFL</given-names>
            <surname>Lausanne</surname>
          </string-name>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lyman</surname>
          </string-name>
          and
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Varian</surname>
          </string-name>
          . How much information,
          <year>2003</year>
          . http://www.sims.berkeley.edu/ how-much-info-2003
          <source>, retrieved on April 23</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          .
          <article-title>Indri retrieval model overview</article-title>
          ,
          <year>July 2005</year>
          . http://ciir.cs.umass.edu/~metzler/ indriretmodel.html,
          <source>retrieved on January 20</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          and
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Combining the language model and inference network approaches to retrieval</article-title>
          .
          <source>Information Processing and Management</source>
          ,
          <volume>40</volume>
          (
          <issue>5</issue>
          ):
          <volume>735</volume>
          {
          <fpage>750</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>G.</given-names>
            <surname>Pass</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdhury</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Torgeson</surname>
          </string-name>
          .
          <article-title>A picture of search</article-title>
          .
          <source>In InfoScale '06: Proc. of the 1st international conference on Scalable information systems, page 1</source>
          , New York, NY, USA,
          <year>2006</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>I. Podnar</given-names>
            <surname>Zarko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rajman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Luu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Klemm</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Aberer</surname>
          </string-name>
          .
          <article-title>Scalable peer-to-peer web retrieval with highly discriminative keys</article-title>
          .
          <source>IEEE 23rd International Conference on Data Engineering (ICDE)</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Powell</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. C.</given-names>
            <surname>French</surname>
          </string-name>
          .
          <article-title>Comparing the performance of collection selection algorithms</article-title>
          .
          <source>ACM Trans. Inf</source>
          . Syst.,
          <volume>21</volume>
          (
          <issue>4</issue>
          ):
          <volume>412</volume>
          {
          <fpage>456</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>D.</given-names>
            <surname>Puppin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Laforenza</surname>
          </string-name>
          .
          <article-title>Query-driven document partitioning and collection selection</article-title>
          .
          <source>In InfoScale '06: Proc. of the 1st international conference on Scalable information systems, page 34</source>
          , New York, NY, USA,
          <year>2006</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>M. Hancock-beaulieu, and</article-title>
          <string-name>
            <given-names>M.</given-names>
            <surname>Gatford</surname>
          </string-name>
          .
          <article-title>Okapi at trec-3</article-title>
          . In TREC-3
          <string-name>
            <surname>Proc</surname>
          </string-name>
          ., pages
          <volume>109</volume>
          {
          <fpage>126</fpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>G.</given-names>
            <surname>Salton and M. J. McGill</surname>
          </string-name>
          .
          <article-title>Introduction to Modern Information Retrieval</article-title>
          .
          <string-name>
            <surname>McGraw-Hill</surname>
          </string-name>
          , Inc., New York, NY, USA,
          <year>1986</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>N.</given-names>
            <surname>Sato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Udagawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Uehara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sakai</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Mori</surname>
          </string-name>
          .
          <article-title>Query based site selection for distributed search engines</article-title>
          .
          <source>Distributed Computing Systems Workshops</source>
          ,
          <year>2003</year>
          . Proc.. 23rd International Conference on, pages
          <volume>556</volume>
          {
          <fpage>561</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>J.</given-names>
            <surname>Seo</surname>
          </string-name>
          and
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Umass at trec 2007 blog distillation task</article-title>
          .
          <source>In Proc. of the 2008 Text REtrieval Conference. NIST</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shokouhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tahaghoghi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Scholer</surname>
          </string-name>
          .
          <article-title>Using query logs to establish vocabularies in distributed information retrieval</article-title>
          .
          <source>Information Processing and Management</source>
          ,
          <volume>43</volume>
          (
          <issue>1</issue>
          ):
          <volume>169</volume>
          {
          <fpage>180</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>L.</given-names>
            <surname>Si</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          .
          <article-title>Relevant document distribution estimation method for resource selection</article-title>
          .
          <source>In SIGIR '03: Proc. of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval</source>
          , pages
          <volume>298</volume>
          {
          <fpage>305</fpage>
          , New York, NY, USA,
          <year>2003</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>L.</given-names>
            <surname>Si</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Ogilvie</surname>
          </string-name>
          .
          <article-title>A language modeling framework for resource selection and results merging</article-title>
          .
          <source>Proc. of the 11th international conference on Information and knowledge management</source>
          , pages
          <volume>391</volume>
          {
          <fpage>397</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>C.</given-names>
            <surname>Silverstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Marais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Henzinger</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Moricz</surname>
          </string-name>
          .
          <article-title>Analysis of a very large web search engine query log</article-title>
          .
          <source>SIGIR Forum</source>
          ,
          <volume>33</volume>
          (
          <issue>1</issue>
          ):6{
          <fpage>12</fpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>G.</given-names>
            <surname>Skobeltsyn</surname>
          </string-name>
          .
          <article-title>Query-driven indexing in large-scale distributed systems</article-title>
          .
          <source>PhD thesis</source>
          , EPFL, Lausanne,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>G.</given-names>
            <surname>Skobeltsyn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Luu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. Podnar</given-names>
            <surname>Zarko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rajman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Aberer</surname>
          </string-name>
          .
          <article-title>Query-driven indexing for scalable peer-to-peer text retrieval</article-title>
          .
          <source>Future Generation Computer Systems</source>
          ,
          <volume>25</volume>
          (
          <issue>1</issue>
          ):
          <volume>89</volume>
          {
          <fpage>99</fpage>
          ,
          <year>June 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>T.</given-names>
            <surname>Strohman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Turtle</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Indri: A language model-based search engine for complex queries</article-title>
          .
          <source>In Proc. of the International Conference on Intelligence Analysis</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>C.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Dwarkadas</surname>
          </string-name>
          .
          <article-title>Peer-to-peer information retrieval using self-organizing semantic overlay networks</article-title>
          .
          <source>Proc. of the 2003 conference on Applications</source>
          , technologies, architectures, and
          <article-title>protocols for computer communications</article-title>
          , pages
          <volume>175</volume>
          {
          <fpage>186</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. J.</given-names>
            <surname>DeWitt</surname>
          </string-name>
          .
          <article-title>Computing pagerank in a distributed internet search system</article-title>
          .
          <source>In Proc. of the International Conference on Very Large Databases (VLDB)</source>
          ,
          <year>Aug</year>
          .
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Mo at, and
          <string-name>
            <given-names>T. C.</given-names>
            <surname>Bell</surname>
          </string-name>
          . Managing Gigabytes:
          <article-title>Compressing and Indexing Documents and Images</article-title>
          . Morgan Kaufmann,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          and
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Cluster-based language models for distributed retrieval</article-title>
          .
          <source>Proc. of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>254</volume>
          {
          <fpage>261</fpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>B.</given-names>
            <surname>Yuwono</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Server ranking for distributed text retrieval systems on the internet</article-title>
          .
          <source>In Proc. of the 5th International Conference on Database Systems for Advanced Applications (DASFAA)</source>
          , pages
          <fpage>41</fpage>
          {
          <fpage>50</fpage>
          . World Scienti c Press,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhai</surname>
          </string-name>
          and
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>La erty. A study of smoothing methods for language models applied to ad hoc information retrieval</article-title>
          .
          <source>In SIGIR '01: Proc. of the 24th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>334</volume>
          {
          <fpage>342</fpage>
          , New York, NY, USA,
          <year>2001</year>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>