<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Collection Selection with Highly Discriminative Keys</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Sander</forename><surname>Bockting</surname></persName>
							<email>sander.bockting@avanade.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Avanade B.V</orgName>
								<address>
									<addrLine>Versterkerstraat 6, AP</addrLine>
									<postCode>1322</postCode>
									<settlement>Almere</settlement>
									<country>Netherlands, Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Djoerd</forename><surname>Hiemstra</surname></persName>
							<email>d.hiemstra@utwente.nl</email>
							<affiliation key="aff1">
								<orgName type="institution">University of Twente</orgName>
								<address>
									<postBox>P.O. Box 217 7500</postBox>
									<postCode>AE</postCode>
									<settlement>Enschede</settlement>
									<country key="NL">Netherlands</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Collection Selection with Highly Discriminative Keys</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">ED427C99B78A4F144797DBADEEDD3B09</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T08:33+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The centralized web search paradigm introduces several problems, such as large data traffic requirements for crawling, index freshness problems and problems to index everything. In this study, we look at collection selection using highly discriminative keys and query-driven indexing as part of a distributed web search system. The approach is evaluated on different splits of the TREC WT10g corpus. Experimental results show that the approach outperforms a Dirichlet smoothing language modeling approach for collection selection, if we assume that web servers index their local content.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>The web search approach of major search engines, like Google, Yahoo! and Bing, amounts to crawling, indexing and searching. We call this approach centralized search, because all operations are controlled by the search engines themselves, be it from a relatively limited number of locations on large clusters of thousands of machines. The centralized web search paradigm poses several problems.</p><p>The amount of web data is estimated to grow exponentially <ref type="bibr" target="#b34">[34]</ref>. The changing and growing data requires frequent visits by crawlers, just to keep the index fresh. Crawling should be done often, but generates a huge amount of traffic, making it impossible to do frequent crawls of all pages. With an estimated four weeks update interval, updates are performed relatively slow <ref type="bibr" target="#b25">[25,</ref><ref type="bibr" target="#b35">35]</ref>. Also, it is impossible to index everything, as the search engine accessible visible web is only a fraction of the total number of web pages <ref type="bibr" target="#b16">[16]</ref>.</p><p>Callan <ref type="bibr" target="#b5">[5]</ref> identified the distributed information retrieval problem set, consisting of resource description, resource selection and result merging. We believe that distributing the search efforts may be an approach to solve the problems described above. This research focuses on resource description and resource selection <ref type="bibr" target="#b10">[10,</ref><ref type="bibr" target="#b22">22]</ref>. Resource description, i.e. indexing of the peers, is distributed; resource selection, or collection selection, is centralized in our research.</p><p>Copyright c 2009 for the individual papers by the papers' authors. Copy- ing permitted for private and academic purposes. Re-publication of material from this volume requires permission by the copyright owners. This volume is published by its editors.  Figure <ref type="figure" target="#fig_0">1</ref> shows three types of entities: peers, brokers and clients. We assume that peers are collaborative. Every peer runs a search engine enabled web server. The search engine indexes the local website(s), but it may also index other websites. In this scenario, there can be many millions of peers. When a user has an information need, he can pose a query at the client. The client sends the query to the broker. In a response, the broker tries to identify the most promising peers to answer the query. This has to be a small amount of peers, e.g. five to ten peers, so not a lot of traffic is generated. The query is routed to those peers and the results are returned to the broker. The broker merges the results and sends the merged list to the client.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>LSDS-IR</head><p>Peers and brokers cooperate to enable brokers to identify most promising peers. Therefore, every peer sends a small part of its index to the broker. This part cannot be too small, to still allow for proper judging about the peers' ability to satisfactory answer queries. The index part cannot be too large due to index data traffic scalability.</p><p>Techniques have been proposed to manage the index size. Podnar et al <ref type="bibr" target="#b20">[20]</ref> used the concept of highly discriminative keys (HDKs) for document retrieval in distributed information retrieval. An HDK is a set of terms that is highly discriminative, i.e., that only match a few documents in the collection. Because the terms are pre-coordinated (they are combined at index time, not at search time) and because only a few document match all terms in a pre-coordinated set, the HDK approach is able to very efficiently retrieve the top documents for a query. Although searching can be efficient, the HDK indexing process, described in more detail in Section 3.1, has the negative side-effect that a huge amount of highly discriminative keys are generated. To reduce the number of keys, Skobeltsyn <ref type="bibr" target="#b32">[32]</ref> proposed a query-driven indexing strategy that uses caching techniques to adapt to the changing querying behavior of users. The combination of HDK with query-driven indexing allows for completely distributed document retrieval that in theory grows to web scale proportions <ref type="bibr" target="#b31">[31]</ref>.</p><p>This paper contributes to the field of distributed information retrieval by the applying HDKs and query-driven indexing to select collections, instead of documents. Such an approach would in theory scale the distributed search scenario described above to millions of peers: The broker lists for every HDK a small number of peers to send the query to, and the peers retrieve the documents; possibly many. Unlike a traditional inverted file index that typically consists of huge posting lists and a, in comparison, tiny dictionary <ref type="bibr" target="#b36">[36]</ref>, our HDK index consists of a huge dictionary and, in comparison, tiny posting lists. The system is fitted into the previously sketched scenario, which allows for control at the broker. This control can for example be used to prevent misuse or to allow for domain-specific search.</p><p>This paper is organized as follows: the next section discusses earlier collection selection methods, Section 3 introduces our collection selection system and Section 4 describes the evaluation. The paper concludes with results and conclusions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">EARLIER WORK ON COLLECTION SE-LECTION</head><p>Collection selection systems have been developed to select collections containing documents that are relevant to a user's query. The generalized Glossary-Of-Servers Server (gGlOSS) is such a system <ref type="bibr" target="#b13">[13]</ref>. It uses a vector space model representing index items (document collections) and user queries as weight vectors in a high dimensional Euclidean space to calculate the distance (or similarity) between document collections and queries <ref type="bibr" target="#b24">[24]</ref>.</p><p>Another well-known approach is CVV, which exploits the variation in cue validity to select collections <ref type="bibr" target="#b38">[38]</ref>. The cue validity CV i,j of query term tj for collection ci measures the extent that tj discriminates ci from the other collections, by comparing the ratio of documents in ci containing tj to the ratios of documents in other collections containing tj. The larger the variation in cue validities for collections with respect to a term, the better the term is for selecting collections.</p><p>This section will describe two collection selection methods in more detail: inference networks and language modeling.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Inference networks</head><p>cori <ref type="bibr">[7]</ref> is a collection ranking algorithm for the inquery retrieval system <ref type="bibr" target="#b6">[6]</ref>, and uses an inference network to rank collections. A simple document inference network has leafs d representing the document collections. The terms that occur in those collections are represented by representation nodes r. Flowing along the arcs between the leaves and nodes are probabilities based on document collection statis-tics. Opposed to tf.idf, the probabilities are calculated using document frequencies df and inverse collection frequencies icf (df.icf). The inverse collection frequency is the number of collections that contain the term. An inference network with these properties is called a collection retrieval inference network (cori net).</p><p>Given query q with terms r k , the network is used to obtain a ranked list of collections by calculating the belief p(r k |ci) in collection ci due to the observation of r k . The collection ranking score of ci for query q is the sum of all beliefs p(r k |ci), where r ∈ q. The belief is calculated using Formula 1. In this formula, b and l are constants, cw i is the number of words in ci, cw is the mean number of words in the collections and |C| is the number of collections. df and cf respectively are the number of documents and collections that contain r k . Finally, dt and d b respectively are the minimum term frequency component and minimum belief component when r k occurs in ci.</p><p>To improve retrieval, the component L is used to scale the document frequency in the calculation of T <ref type="bibr" target="#b6">[6,</ref><ref type="bibr" target="#b23">23]</ref>. L is made sensitive to the number of documents that contain term r k (using b) and is made large using l. L should be large, because df is generally large. Proper tuning of these two parameters is required for every data set, but deemed impossible as the correct settings are highly sensitive to data set variations <ref type="bibr" target="#b12">[12]</ref>; the value of l should be varied largely even when keeping the data set constant while varying the query type.</p><p>Further research showed that it is not justified to use standard cori as a collection selection benchmark. D'Souza et al. showed that HighSim outperformed cori in 15 of 21 cases <ref type="bibr" target="#b11">[11]</ref>. Si and Callan <ref type="bibr" target="#b28">[28]</ref> found limitations with different collection sizes. Large collections are not often ranked high by cori, although they often are the most promising collections.</p><formula xml:id="formula_0">L = l • ((1 − b) + b • cw i/cw ) T = dt + (1 − dt) • log (df ) log (df + L) I = log |C|+0.5 cf log |C| + 1.0 p(r k |ci) = d b + (1 − d b ) • T • I (1)</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Language modeling</head><p>A language model is a statistical model to assign a probability to a sequence of words (e.g. a query) being generated by a particular document or document collection. Language models can be used for collection selection in distributed search, by creating a language model for each collection <ref type="bibr" target="#b29">[29,</ref><ref type="bibr" target="#b37">37]</ref>. They have also been used for collection selection for other tasks, for instance for blog search <ref type="bibr" target="#b2">[2,</ref><ref type="bibr" target="#b26">26]</ref>.</p><p>indri is an improved version of the inquery retrieval system <ref type="bibr" target="#b33">[33]</ref>, as it is capable of dealing with larger collections, allows for more complex queries due to new query constructs and uses formal probabilistic document representations that use language models. The combined model of inference networks and language modeling is capable of achieving more favorable retrieval results than inquery <ref type="bibr" target="#b18">[18]</ref>. Due to these differences, term representation beliefs are calculated in an-other way than with cori (as described in Section 2.1):</p><formula xml:id="formula_1">P (r|D) = tf r,D + αr |D| + αr + βr</formula><p>The belief is calculated for representation concept r of document node D (in document collection C). Examples of representation concepts are a term in the body or title of a document. D and r are nodes in the belief network. The term frequency of representation node r in D is denoted by tf r,D . αr and βr are smoothing parameters. Smoothing is used to make the maximum likelihood estimations of a language model more accurate, which could have been less accurate due to data sparseness, because not every term occurs in a document <ref type="bibr" target="#b39">[39]</ref>. Smoothing ensures that terms that are unseen in the document, are not assigned zero probability. The default smoothing model for Indri is Dirichlet smoothing, which affects the smoothing parameter choices <ref type="bibr" target="#b17">[17]</ref>. In setting the smoothing parameters, it was assumed that the likeliness of observing representation concept r is equal to the probability of observing it in collection C, given by P (r|C). This means that αr/(αr + βr) = P (r|C). The following parameter values were chosen:</p><formula xml:id="formula_2">αr = µP (r|C) βr = µ(1 − P (r|C))</formula><p>This results in the following term representation belief, where the free parameter µ has a default value of 2500:</p><formula xml:id="formula_3">P (r|D) = tf r,D + µP (r|C) |D| + µ 2.</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Discussion</head><p>The language modeling approach of indri has a better formal probabilistic document representation than cori and indri is an improved version of inquery (which is the foundation of cori). We will use the language model on document collections as implemented by indri as our baseline collection selection system. Si et al. <ref type="bibr" target="#b29">[29]</ref> showed that a language modeling approach for collection selection outperforms cori. Furthermore, cori outperforms algorithms like cvv and gGlOSS in several studies <ref type="bibr" target="#b9">[9,</ref><ref type="bibr" target="#b21">21]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">SOPHOS</head><p>Sophos is a collection selection prototype that uses HDKs to index collections. The keys are used to assign scores to the collections. Using a scoring function, collection scores can be calculated to rank collections for a particular query. A general overview is depicted in Figure <ref type="figure" target="#fig_1">2</ref>. This section describes how the broker index is created, explains index size reduction using a query-driven indexing approach, identifies query result parameters, and concludes with a collection ranking formula (ScoreFunction in Figure <ref type="figure" target="#fig_1">2</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Highly discriminative keys</head><p>Building the index is done in two phases. First, every peer builds an index of its document collection and sends that index to the broker. Second, the broker constructs a broker index from all peer indices.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.1">Peer indexing</head><p>Peer indexing starts with the generation of term set statistics. First, single term statistics are generated by counting term frequencies of every word in the collection, without looking at document boundaries in the collection. A term set is called frequent when it occurs more times than a term set frequency maximum tf max . Every infrequent single term is added to the peer index together with its frequency count. The frequent keys are stored for further analysis.</p><p>The next step consists of obtaining double term set statistics. For every frequent term in the collection, frequency statistics are created for term combinations that consist of the frequent term and a term that appears after that term within a window size ws. The result is a list of double terms with corresponding frequency counts. Once again, the term set frequency maximum defines which term sets are frequent and will be used for further analysis, and which term sets are infrequent and will be stored in the peer index. This procedure can be repeated as long as the generated term sets do not contain more than ws terms, or when a predefined maximum number of terms in a term set, hmax, has been reached.</p><p>Summarizing, the peer index consists of tuples with term sets and corresponding frequency counts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.2">Broker indexing</head><p>The infrequent term sets from the peer indices are sent to the broker. The broker index contains term sets with associated sets of collection identifier counters. A collection identifier counter is a tuple of a collection identifier and a term set frequency counter. A collection identifier is a short representation of a collection where the term set occurs.</p><p>When a term set is inserted into the broker index, it is called a highly discriminative key (HDK). The broker index will contain a maximum number of collections per HDK, denoted by the collection maximum cm. As soon as the maximum number of collections is reached for a particular HDK, the cm collections with the largest term set frequencies will be stored in the broker index.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Query-driven indexing</head><p>A list of popular keys can be created by extracting all unique queries from a query log. Every key that is infrequent at a peer, and which is present in the unique query list, will be sent to the broker; the other keys are filtered out to reduce the broker index size and to reduce traffic. c sum of query term set occurrence within one collections (grouped by sets with h terms) h #terms in an HDK hmax maximum #terms that HDK can consist of q #terms in query n #distinct query terms found in a collection tf max maximum frequency of a term set in a collection</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1: Parameter definitions for query result handling</head><p>This index pruning strategy was used before by Shokouhi et al. <ref type="bibr" target="#b27">[27]</ref>. It is not the most desirable strategy for querydriven indexing, because it is unable to deal with unseen query terms. However, it will give a good idea about the possible index size reduction and the loss of retrieval performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Identifying query result parameters</head><p>Once the broker index has been built, the system is ready to be queried. The broker index contains HDKs with h terms, where h varies from 1 to hmax. In Sophos, hmax is set to 3. Every key has an associated posting list, which contains tuples of collection identifiers and counters. A counter indicates the number of occurrences of a certain key within the corresponding collection. The counter value cannot exceed the term set frequency maximum, tf max , as a key would otherwise have been locally frequent and new HDKs would have been generated when the maximum number of terms, hmax, was not yet reached.</p><p>When a user poses a query with q terms, e.g. abcde with q = 5, the query is first decomposed in query term combinations with length hmax (i.e. abc, abd, . . . , bde, cde). This results in q h combinations. Each combination is looked up in the broker index. Note that this introduces additional load on the broker, but these lookups do not require network access to the peers. The results of each of those smaller queries are merged by summing the number of occurrences per collection. The sum of all counters, c, has a potential maximum of q h • tf max . It may happen that this procedure results in little or no collections where an HDK occurs. The procedure is then repeated for smaller term sets; first term sets of two terms will be looked up in the index. When even this gives too few results, single terms will be looked up in the index. In the case that one collection contains two different combinations, e.g. both abc and bce occur in collection X, the number of occurrences are summed (this is c that was just introduced). However, it also interesting to note that 4 out of 5 query terms can be found in collection X. The number of query terms that can be found using HDKs of a particular length is indicated by n. The different parameters are displayed in Table <ref type="table">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">ScoreFunction: ranking query results</head><p>ScoreFunction, given in Formula 2, is a ranking formula that uses the query result parameters to enforce a collection ranking conforming to our desired ranking properties. It calculates a score s for a collection for a given query</p><formula xml:id="formula_4">s = log 10 h − 1 + n − 1 q + c (hmax+1−h)•( n h )•tf max • α (q−n) q /hmax<label>(2)</label></formula><p>It consists of three components; one component per desired ranking property. The properties, and corresponding components, are listed below in order of importance.</p><p>1. Collections that contain longer HDKs should be ranked higher. Component 1: h − 1.</p><p>2. A collection should be ranked higher if it contains more distinct query terms. Component 2: (n − 1)/q.</p><p>3. More occurrences of query term sets within a collection should result in a higher collection ranking. Component 3:</p><formula xml:id="formula_5">c (hmax+1−h)•( n h )•tf max •α (q−n) q .</formula><p>The general intuition behind this component is that a collection is more important when it contains more query terms. This is controlled by a damping factor α.</p><p>The term set counts have less impact when they get closer to tf max , because longer keys would be generated for a term set in a collection when its frequency exceeds tf max . We refer to earlier work for a more detailed explanation about ScoreFunction <ref type="bibr" target="#b4">[4]</ref>.</p><p>An important detail to mention about querying and ranking is that collection X can be found after looking for HDKs with length h. When the same collection X is found after looking for HDKs with length h − 1, those results are discarded as the collection was already found using larger HDKs. Counts c are only compared with other counts that are retrieved after looking for HDKs of the same length. The motivation for this behavior is that smaller HDKs tend to have higher counts.</p><p>Each of the three components has a share in the collection score. The component share of a less desired property never exceeds that smallest possible share of a more desired property's component value.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">EXPERIMENT SETUP</head><p>This section describes the corpus, query set and query logs that were used in the evaluation process, and continues to describe how the collection selection effectiveness of Sophos was measured.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Data collections</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1">WT10G corpus splits</head><p>The Web Track 10GB corpus (WT10g) was developed for the Text REtrieval Conference<ref type="foot" target="#foot_0">1</ref> and consists of 10GB of web pages (1,692,096 documents on 11,680 servers). Compared to regular TREC corpora, WT10g should be more suited for distributed information retrieval experiments, due to the existence of hyperlinks, differences in topics, variation in quality and presence of duplicates <ref type="bibr" target="#b3">[3,</ref><ref type="bibr" target="#b8">8]</ref>.</p><p>Four splits of the WT10G document corpus were made to look at the effect of document corpora on collection selection. Every split is a set of collections; every collection is a set of documents. The numbers 100 and 11512 indicate the amount of collections in the corpus split.</p><p>IP Split: Documents are put in collections based on the IP addresses of the site where a document was residing. This results in 11,512 collections.</p><p>IP Merge 100: A cluster is created by grouping up to 116 collections, which results in 100 collections. Grouping is done in order of IP address. This split simulates the scenario of indexing the search engines that index many servers.</p><p>Random 100 and Random 11512: Two random splits with 100 collections and with 11,512 collections were created. Documents were randomly assigned to a collection. The number of 11,512 collections was chosen to be able to compare a random split with the IP Split.</p><p>The number of documents in random splits is relatively constant, but varies in IP based collections from less than 10 up to more than 1000 documents. This is typical for the size of sites on the Internet; the number of documents per server follows a Zipf distribution on the Internet <ref type="bibr" target="#b1">[1]</ref>. The IP based splits show signs of conformance to a Zipf distribution <ref type="bibr" target="#b4">[4]</ref>. This means that the IP based splits can be compared to the Internet in terms of distribution of the number of documents and the sizes of the collections.</p><p>The merit of a collection is the number of relevant documents in a collection for a particular query. The IP based corpus splits have a larger deviation in merit among the collections. This contrasts with random splits, which by approximation have equal merit for each collection <ref type="bibr" target="#b4">[4]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.2">WT10g retrieval tasks</head><p>Fifty WT10g informational 'ad-hoc' queries were used for evaluation (query numbers 501-550). The queries have a query number, title, description and a narrative description of the result that is considered relevant. The title is a small set of words which was used as the query text. The narrative descriptions were used by humans to assign relevance judgments to documents. The relevance judgments can be used to count the number of relevant documents in the collections, which in turn can be used to measure collection selection effectiveness.</p><p>There are three types of relevance judgments: not relevant (0), relevant (1) and authoritative <ref type="bibr" target="#b2">(2)</ref>. There can be multiple authoritative documents in the document corpus for a query, but for some queries there are no authoritative documents. All authoritative judgments are converted to 1, so documents are either relevant or not relevant. This allows for evaluation of the collected merit.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.3">Query logs</head><p>AOL published a query log with 21,011,340 queries <ref type="bibr" target="#b19">[19]</ref>. The log has been anonymized and consists of several data fields: the actual query issued and the query submission date and time, and an anonymous user ID number. The release of the anonymized data set was controversial at the time because it was proven possible to link an ID to a real person. To respect the anonymity of the users, we used a stripped version of the query log that only contains the actual queries issued in random order (and none of the other metadata).</p><p>We also used two query logs that were published by Excite<ref type="foot" target="#foot_1">2</ref> that were stripped in the same way. One query log from 1997 contains 1,025,907 queries and another query log from 1999 contains 1,777,251 queries.</p><p>Finally, a fourth query log with 3,512,320 unique queries was created by removing all queries from the AOL query log that were issued only once. This query log will be referred to as AOL2. The other logs are called AOL, Excite97 and Excite99.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Method</head><p>To evaluate the performance of our collection selection system, we adopted the precision and recall metrics for collection selection as defined by Gravano et al. <ref type="bibr" target="#b13">[13]</ref>. We start by obtaining the retrieval system ranking (S ) for a query, which contains up to 1,000 collections. We also create the best ranking (B ) which is the best possible ranking for a particular query; collections are ranked by their amount of merit with respect to a query.</p><p>Given query q and collection ci, the merit within a collection can be expressed using merit(q, ci). The merit of the i th ranked collection in rankings S and B is given by S i = merit(q, cs i ) and B i = merit(q, c b i ).</p><p>The obtained recall after selecting n collections can be calculated by dividing the merit selected by the best possible retrieval system:</p><formula xml:id="formula_6">Rn = n i=1 Si n i=1 Bi (3)</formula><p>Precision Pn is the fraction of top n collections that have non-zero merit:</p><formula xml:id="formula_7">Pn = |{sc ∈ Top n (S)|merit(q, sc) &gt; 0}| |Top n (S)|<label>(4)</label></formula><p>The precision and recall obtained by Sophos is compared to the collection selection results from a baseline of Language Modeling with Dirichlet Smoothing (lmds) as implemented by indri. The baseline system has one parameter µ that is set to 2500. Krovetz word stemming is applied to the collections.</p><p>Section 3 introduced Sophos and its parameters. There are three parameters for peers: the maximum key length hmax, the maximum key frequency before calling a key frequent (tf max ) and the window size ws. Based on numbers used by Luu et al. <ref type="bibr" target="#b15">[15]</ref>, we use the following settings: tf max = {250, 500, 750} and ws = {6, 12, 18}. The average query length on the Internet is 2.3 <ref type="bibr" target="#b14">[14,</ref><ref type="bibr" target="#b30">30]</ref>. We use this observation to set hmax to 3. Setting it smaller would require many intersections of term sets (with associated collections) to answer queries. Setting it larger would result in many term sets that are rarely queried for. For the broker, the collection maximum cm is tested for values 5, 10, 20 and 50.</p><p>To evaluate Sophos, the collections are processed by removing stop words -drastically reducing the number of invaluable keys and speeding up term set generation -and applying Porter word stemming.</p><p>Finally, we will look at the precision and recall with querydriven indexing using four different query logs. By pruning the keys from the peer indices that do not occur in a query log. At the same time, we will look at the number of term sets (keys) in the broker index to get an idea about its size.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">RESULTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Index size</head><p>Figure <ref type="figure" target="#fig_2">3</ref> shows the number of collection identifier counters within the broker index for different indexing settings of indexing the IP Split. Spread over single, double and triple term set collection identifier counters, the number of counters are a good indication for the actual broker index size. The figure shows that the number of counters decreases when the term set frequency maximum is increased. A more substantial reduction of collection identifier counters -of roughly 70% -can be achieved by using querydriven indexing, as shown in Figure <ref type="figure" target="#fig_4">4</ref>. The figure shows the number of collection identifier counters after indexing the Random 11,512 corpus split with or without query-driven indexing. Figure <ref type="figure" target="#fig_5">5</ref> depicts the obtainable index size reduction by using different query logs. The Excite query logs contain significantly less query term sets than the AOL query log. The figure shows that more keys are pruned from the peer indices, resulting in a smaller broker index. The figure shows that using Excite query logs instead of the standard AOL query log can result in roughly 75% less collection identifier counters.   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Collection selection performance</head><p>Table <ref type="table" target="#tab_1">2</ref> shows the average precision and recall over 50 queries that were calculated with Formula 3 and Formula 4. The numbers are calculated for four different corpus splits with which the baseline (lmds) and Sophos were tested. Sophos was used with the following settings: query-driven indexing with the AOL query log, tf max = 250, cm = 20 and ws = 6. Due to memory constraints, we were unable to run Sophos with a window size larger than 6. The table shows that the baseline outperforms Sophos on the Random 11,512 corpus split, but Sophos outperforms the baseline on the IP split. This is displayed in more detail in Figure <ref type="figure" target="#fig_7">6</ref>, which shows the average recall of Sophos and the baseline after selecting n collections. Sophos was tested using different query logs, tf max = 250 and cm = 50. Interestingly, pruning with the smallest Excite query logs results in the best recall figures, possibly because the queries were logged at the same time as when the corpus was crawled.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">CONCLUSIONS</head><p>We introduced the collection selection system Sophos that uses highly discriminative keys in peer indices to construct a broker index. The broker index contains keys that are good discriminators to select collections (or peers). To limit</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Corpus split</head><p>Collection selection method P@1 P@10 P@20 P@50 R@1 R@10 R@20 R@50 Random 100 lmds the number of keys transferred to the broker, and to reduce the broker index size, we employed query-driven indexing to only store the keys that are queried for by users. Similar studies were performed for document retrieval <ref type="bibr" target="#b15">[15]</ref>, but to the best of our knowledge, we are the first to apply this approach for collection selection. Precision and recall was measured using 50 queries on the WT10g TREC test corpus and compared to a stateof-the-art baseline that uses language modeling with Dirichlet smoothing (lmds). The results showed that our system outperformed the baseline with the IP split as test corpus. This is promising, because the IP based splits most closely resemble the information structure on the Internet. lmds was better capable of selecting information in random based splits, because it stores all available information about the collections. In random based splits, relevant documents (and their corresponding terms) are mixed over many collections, making it hard for our approach to select highly discriminative keys that can discriminate collections with relevant documents.</p><p>Query-driven indexing is required to keep the broker index size manageable; a 70% index size reduction can be obtained by pruning keys using the AOL query log, another 75% reduction is possible by using a smaller query log. Our results on the IP split showed that pruning using the smaller Excite query logs resulted in higher precision and recall than with AOL query logs. The use of any query log resulted in higher precision and recall than the baseline results. This motivates further research using more advanced query-driven indexing strategies, such as described by Skobeltsyn <ref type="bibr" target="#b32">[32]</ref>, to further reduce the index size while improving the performance. It would also be interesting to make tf max depending on the collection size.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Peers are accessible via a broker to answer queries from clients</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: General overview of the indexing and collection selection system Sophos</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Number of collection ID counters with IP Split.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Number of collection identifier counters stored per QDI strategy for Random 11512 corpus split</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Number of collection identifier counters stored per QDI strategy for IP Split</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head></head><label></label><figDesc>of selected collection LMDS Sophos with QDI AOL tf250 cm50 ws6 Sophos with QDI AOL2 tf250 cm50 ws6 Sophos with QDI Excite 97 tf250 cm50 ws6 Sophos with QDI Excite 99 tf250 cm50 ws6</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Recall of different collections selection methods on IP Split</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>Workshop. July 2009. Boston, USA.</figDesc><table><row><cell></cell><cell>1. Query</cell><cell>Client</cell><cell cols="2">4. Merged result set</cell></row><row><cell></cell><cell>3. Result</cell><cell></cell><cell>3. Result</cell><cell></cell></row><row><cell></cell><cell>2. Query</cell><cell></cell><cell>2. Query</cell><cell></cell></row><row><cell></cell><cell></cell><cell>Broker</cell><cell></cell><cell></cell></row><row><cell>Peer 1</cell><cell>Peer 2</cell><cell>Peer i</cell><cell>Peer n-1</cell><cell>Peer n</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Average precision and recall over 50 queries after selecting n collections, high scores shown in bold</figDesc><table><row><cell></cell><cell></cell><cell>0.290</cell><cell>0.251</cell><cell cols="2">0.249 0.217</cell><cell>0.314</cell><cell>0.435</cell><cell cols="2">0.526 0.683</cell></row><row><cell></cell><cell>Sophos QDI AOL tf250 cm20</cell><cell cols="3">0.330 0.254 0.237</cell><cell>0.208</cell><cell cols="3">0.379 0.436 0.493</cell><cell>0.644</cell></row><row><cell cols="2">Random 11512 lmds</cell><cell cols="4">0.140 0.096 0.084 0.069</cell><cell cols="4">0.233 0.196 0.186 0.198</cell></row><row><cell></cell><cell>Sophos QDI AOL tf250 cm20</cell><cell>0.040</cell><cell>0.036</cell><cell>0.037</cell><cell>0.026</cell><cell>0.067</cell><cell>0.073</cell><cell>0.082</cell><cell>0.073</cell></row><row><cell>IP Merge 100</cell><cell>lmds</cell><cell>0.280</cell><cell>0.202</cell><cell>0.188</cell><cell>0.154</cell><cell cols="3">0.489 0.489 0.567</cell><cell>0.755</cell></row><row><cell></cell><cell>Sophos QDI AOL tf250 cm20</cell><cell cols="4">0.300 0.254 0.214 0.160</cell><cell>0.211</cell><cell>0.485</cell><cell cols="2">0.626 0.825</cell></row><row><cell>IP Split</cell><cell>lmds</cell><cell cols="2">0.170 0.110</cell><cell>0.083</cell><cell>0.056</cell><cell>0.070</cell><cell>0.149</cell><cell>0.183</cell><cell>0.289</cell></row><row><cell></cell><cell>Sophos QDI AOL tf250 cm20</cell><cell cols="4">0.170 0.147 0.121 0.091</cell><cell cols="4">0.280 0.466 0.466 0.548</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://trec.nist.gov/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">http://www.excite.com/</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>We are grateful to the Yahoo Research Faculty Grant programme and to the Netherlands Organisation for Scientific Research (NWO, Grant 639.022.809) for supporting this work. We would like to thank Berkant Barla Cambazoglu and the anonymous reviewers for their valuable comments that improved this paper.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title/>
		<author>
			<persName><surname>References</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Zipf&apos;s law and the internet</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">A</forename><surname>Adamic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">A</forename><surname>Huberman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Glottometrics</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="143" to="150" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Document representation and query expansion models for blog recommendation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Arguello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">L</forename><surname>Elsas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Callan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">G</forename><surname>Carbonell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 2nd Intl. Conf. on Weblogs and Social Media (ICWSM)</title>
				<meeting>of the 2nd Intl. Conf. on Weblogs and Social Media (ICWSM)</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Engineering a multi-purpose test collection for web retrieval experiments</title>
		<author>
			<persName><forename type="first">P</forename><surname>Bailey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Craswell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Hawking</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Inf. Process. Manage</title>
		<imprint>
			<biblScope unit="volume">39</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="853" to="871" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Collection selection for distributed web search</title>
		<author>
			<persName><forename type="first">S</forename><surname>Bockting</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009-02">Feb. 2009</date>
		</imprint>
		<respStmt>
			<orgName>University of Twente</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Master&apos;s thesis</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Distributed information retrieval</title>
		<author>
			<persName><forename type="first">J</forename><surname>Callan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Information Retrieval</title>
				<imprint>
			<date type="published" when="2000">2000</date>
			<biblScope unit="page" from="127" to="150" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">The inquery retrieval system</title>
		<author>
			<persName><forename type="first">J</forename><surname>Callan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">B</forename><surname>Croft</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Harding</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 3rd International Conference on Database and Expert Systems Applications</title>
				<meeting>of the 3rd International Conference on Database and Expert Systems Applications<address><addrLine>Valencia, Spain</addrLine></address></meeting>
		<imprint>
			<publisher>Springer-Verlag</publisher>
			<date type="published" when="1992">1992</date>
			<biblScope unit="page" from="78" to="83" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Searching distributed collections with inference networks</title>
		<author>
			<persName><forename type="first">J</forename><surname>Callan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">B</forename><surname>Croft</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGIR &apos;95: Proc. of the 18th annual international ACM SIGIR conference on Research and development in information retrieval</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="1995">1995</date>
			<biblScope unit="page" from="21" to="28" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Is it fair to evaluate web systems using trec ad hoc methods</title>
		<author>
			<persName><forename type="first">N</forename><surname>Craswell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bailey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Hawking</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Workshop on Web Evaluation. SIGIR&apos;99</title>
				<imprint>
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Server selection on the world wide web</title>
		<author>
			<persName><forename type="first">N</forename><surname>Craswell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bailey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Hawking</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 5th ACM conference on Digital libraries</title>
				<meeting>of the 5th ACM conference on Digital libraries<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2000">2000</date>
			<biblScope unit="page" from="37" to="46" />
		</imprint>
	</monogr>
	<note>DL &apos;00</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">A comparison of techniques for selecting text collections</title>
		<author>
			<persName><forename type="first">D</forename><surname>Souza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Thom</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zobel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ADC &apos;00: Proceedings of the Australasian Database Conference</title>
				<meeting><address><addrLine>Washington, DC, USA</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE Computer Society</publisher>
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Collection selection for managed distributed document databases</title>
		<author>
			<persName><forename type="first">D</forename><surname>Souza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Thom</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zobel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Processing &amp; Management</title>
		<imprint>
			<biblScope unit="volume">40</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="527" to="546" />
			<date type="published" when="2004-05">May 2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Is cori effective for collection selection? an exploration of parameters, queries, and data</title>
		<author>
			<persName><forename type="first">D</forename><surname>Souza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zobel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Thom</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 9th Australasian Document Computing Symposium</title>
				<meeting>of the 9th Australasian Document Computing Symposium</meeting>
		<imprint>
			<date type="published" when="2004-12">Dec. 2004</date>
			<biblScope unit="page" from="41" to="46" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Generalizing GlOSS to vector-space databases and broker hierarchies</title>
		<author>
			<persName><forename type="first">L</forename><surname>Gravano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Garcia-Molina</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Very Large Databases</title>
				<imprint>
			<publisher>VLDB</publisher>
			<date type="published" when="1995">1995</date>
			<biblScope unit="page" from="78" to="89" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Patterns of search: analyzing and modeling web query refinement</title>
		<author>
			<persName><forename type="first">T</forename><surname>Lau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Horvitz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">UM &apos;99: Proc. of the 7th international conference on User modeling</title>
				<meeting><address><addrLine>Secaucus, NJ, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1999">1999</date>
			<biblScope unit="page" from="119" to="128" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Using highly discriminative keys for indexing in a peer-to-peer full-text retrieval system</title>
		<author>
			<persName><forename type="first">T</forename><surname>Luu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Klemm</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rajman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Aberer</surname></persName>
		</author>
		<idno>TR: 2005041</idno>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
		<respStmt>
			<orgName>EPFL Lausanne</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical report</note>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">How much information</title>
		<author>
			<persName><forename type="first">P</forename><surname>Lyman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">R</forename><surname>Varian</surname></persName>
		</author>
		<ptr target="http://www.sims.berkeley.edu/how-much-info-2003" />
		<imprint>
			<date type="published" when="2003-04-23">2003. April 23, 2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">Indri retrieval model overview</title>
		<author>
			<persName><forename type="first">D</forename><surname>Metzler</surname></persName>
		</author>
		<ptr target="http://ciir.cs.umass.edu/~metzler/indriretmodel.html" />
		<imprint>
			<date type="published" when="2005-07">July 2005. January 20, 2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Combining the language model and inference network approaches to retrieval</title>
		<author>
			<persName><forename type="first">D</forename><surname>Metzler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">B</forename><surname>Croft</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Processing and Management</title>
		<imprint>
			<biblScope unit="volume">40</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="735" to="750" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">A picture of search</title>
		<author>
			<persName><forename type="first">G</forename><surname>Pass</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Chowdhury</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Torgeson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">InfoScale &apos;06: Proc. of the 1st international conference on Scalable information systems</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Scalable peer-to-peer web retrieval with highly discriminative keys</title>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">Podnar</forename><surname>Zarko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rajman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Luu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Klemm</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Aberer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE 23rd International Conference on Data Engineering (ICDE)</title>
				<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Comparing the performance of collection selection algorithms</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">L</forename><surname>Powell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">C</forename><surname>French</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Trans. Inf. Syst</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="412" to="456" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Query-driven document partitioning and collection selection</title>
		<author>
			<persName><forename type="first">D</forename><surname>Puppin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Silvestri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Laforenza</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">InfoScale &apos;06: Proc. of the 1st international conference on Scalable information systems</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Okapi at trec-3</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">E</forename><surname>Robertson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Walker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">M</forename><surname>Hancock-Beaulieu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gatford</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">TREC-3 Proc</title>
				<imprint>
			<date type="published" when="1995">1995</date>
			<biblScope unit="page" from="109" to="126" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<title level="m" type="main">Introduction to Modern Information Retrieval</title>
		<author>
			<persName><forename type="first">G</forename><surname>Salton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Mcgill</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1986">1986</date>
			<publisher>McGraw-Hill, Inc</publisher>
			<pubPlace>New York, NY, USA</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Query based site selection for distributed search engines</title>
		<author>
			<persName><forename type="first">N</forename><surname>Sato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Udagawa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Uehara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Sakai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Mori</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Distributed Computing Systems Workshops</title>
				<imprint>
			<date type="published" when="2003">2003. 2003</date>
			<biblScope unit="page" from="556" to="561" />
		</imprint>
	</monogr>
	<note>Proc.. 23rd International Conference on</note>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Umass at trec 2007 blog distillation task</title>
		<author>
			<persName><forename type="first">J</forename><surname>Seo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">B</forename><surname>Croft</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 2008 Text REtrieval Conference</title>
				<meeting>of the 2008 Text REtrieval Conference</meeting>
		<imprint>
			<publisher>NIST</publisher>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Using query logs to establish vocabularies in distributed information retrieval</title>
		<author>
			<persName><forename type="first">M</forename><surname>Shokouhi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zobel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Tahaghoghi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Scholer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Processing and Management</title>
		<imprint>
			<biblScope unit="volume">43</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="169" to="180" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Relevant document distribution estimation method for resource selection</title>
		<author>
			<persName><forename type="first">L</forename><surname>Si</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Callan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGIR &apos;03: Proc. of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="298" to="305" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">A language modeling framework for resource selection and results merging</title>
		<author>
			<persName><forename type="first">L</forename><surname>Si</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Jin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Callan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Ogilvie</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 11th international conference on Information and knowledge management</title>
				<meeting>of the 11th international conference on Information and knowledge management</meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="391" to="397" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Analysis of a very large web search engine query log</title>
		<author>
			<persName><forename type="first">C</forename><surname>Silverstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Marais</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Henzinger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Moricz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SIGIR Forum</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="6" to="12" />
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<monogr>
		<title level="m" type="main">Query-driven indexing in large-scale distributed systems</title>
		<author>
			<persName><forename type="first">G</forename><surname>Skobeltsyn</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009">2009</date>
			<pubPlace>Lausanne</pubPlace>
		</imprint>
		<respStmt>
			<orgName>EPFL</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">PhD thesis</note>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">Query-driven indexing for scalable peer-to-peer text retrieval</title>
		<author>
			<persName><forename type="first">G</forename><surname>Skobeltsyn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Luu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Podnar Zarko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rajman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Aberer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Future Generation Computer Systems</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="89" to="99" />
			<date type="published" when="2009-06">June 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Indri: A language model-based search engine for complex queries</title>
		<author>
			<persName><forename type="first">T</forename><surname>Strohman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Metzler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Turtle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">B</forename><surname>Croft</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the International Conference on Intelligence Analysis</title>
				<meeting>of the International Conference on Intelligence Analysis</meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">Peer-to-peer information retrieval using self-organizing semantic overlay networks</title>
		<author>
			<persName><forename type="first">C</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Dwarkadas</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications</title>
				<meeting>of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications</meeting>
		<imprint>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="175" to="186" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<analytic>
		<title level="a" type="main">Computing pagerank in a distributed internet search system</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">J</forename><surname>Dewitt</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the International Conference on Very Large Databases (VLDB)</title>
				<meeting>of the International Conference on Very Large Databases (VLDB)</meeting>
		<imprint>
			<date type="published" when="2004-08">Aug. 2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<monogr>
		<title level="m" type="main">Managing Gigabytes: Compressing and Indexing Documents and Images</title>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">H</forename><surname>Witten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Moffat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">C</forename><surname>Bell</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1999">1999</date>
			<publisher>Morgan Kaufmann</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<analytic>
		<title level="a" type="main">Cluster-based language models for distributed retrieval</title>
		<author>
			<persName><forename type="first">J</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">B</forename><surname>Croft</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval</title>
				<meeting>of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval</meeting>
		<imprint>
			<date type="published" when="1999">1999</date>
			<biblScope unit="page" from="254" to="261" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b38">
	<analytic>
		<title level="a" type="main">Server ranking for distributed text retrieval systems on the internet</title>
		<author>
			<persName><forename type="first">B</forename><surname>Yuwono</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">L</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 5th International Conference on Database Systems for Advanced Applications (DASFAA)</title>
				<meeting>of the 5th International Conference on Database Systems for Advanced Applications (DASFAA)</meeting>
		<imprint>
			<publisher>World Scientific Press</publisher>
			<date type="published" when="1997">1997</date>
			<biblScope unit="page" from="41" to="50" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b39">
	<analytic>
		<title level="a" type="main">A study of smoothing methods for language models applied to ad hoc information retrieval</title>
		<author>
			<persName><forename type="first">C</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lafferty</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGIR &apos;01: Proc. of the 24th annual international ACM SIGIR conference on Research and development in information retrieval</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="334" to="342" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
