8th Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR’10) Query-Based Sampling using Snippets Almer S. Tigelaar Djoerd Hiemstra Database Group, University of Twente, Database Group, University of Twente, Enschede, The Netherlands Enschede, The Netherlands a.s.tigelaar@cs.utwente.nl hiemstra@cs.utwente.nl ABSTRACT Query-based sampling is a commonly used approach to model the content of servers. Conventionally, queries are sent to a server and the documents in the search results returned are downloaded in full as representation of the server’s content. We present an approach that uses the document snippets in the search results as samples instead of downloading the entire documents. We show this yields equal or better mod- eling performance for the same bandwidth consumption de- Figure 1: Example snippets. From top to bottom: pending on collection characteristics, like document length each snippet consists of an underlined title, a two distribution and homogeneity. Query-based sampling using line summary and a link. snippets is a useful approach for real-world systems, since it requires no extra operations beyond exchanging queries and search results. Disadvantages of downloading entire documents are that it consumes more bandwidth, is impossible if servers do not 1. INTRODUCTION return full documents, and does not work when the full docu- Query-based sampling is a technique for obtaining a re- ments themselves are non-text: multimedia with short sum- source description of a search server. This description is mary descriptions. In contrast, some data always comes based on the downloaded content of a small subset of doc- along ‘for free’ in the returned search results: the snippets. uments the server returns in response to queries [8]. We A snippet is a short piece of text consisting of a document present an approach that requires no additional download- title, a short summary and a link as shown in Figure 1. A ing beyond the returned results, but instead relies solely on summary can be either dynamically generated in response information returned as part of the results: the snippets. to a query or is statically defined [16, p. 157]. We postulate Knowing what server offers what content allows a central that these snippets can also be used for query-based sam- server to forward queries to the most suitable server for han- pling to build a language model. This way we can avoid dling a query. This task is commonly referred to as resource downloading entire documents and thus reduce bandwidth selection [6]. Selection is based on a representation of the usage and cope with servers that return only search results content of a server: a resource description. Most servers or contain multimedia content. However, since snippets are on the web are uncooperative and do not provide such a de- small we need to see many of them. This means that we scription, thus query-based sampling exploits only the native need to send more queries compared with the full document search functionality provided by such servers. approach. While this increases the query load on the remote In conventional query-based sampling, the first step is servers, it is an advantage for live systems that need to sam- sending a query to a server. The server returns a ranked ple from document collections that change over time, since list of results of which the top N most relevant documents it allows continously updating the language model, based on are downloaded and used to build a resource description. the results of live queries. Queries are randomly chosen, the first from an external re- Whether the documents returned in response to random source and subsequent queries from the description built so queries are a truly random part of the underlying collection far. This repeats until a stopping criterion is reached [7, 8]. is doubtful. Servers have a propensity to return documents that users indicate as important and the number of in-links has a substantial correlation with this importance [1]. This may not be a problem, as it is preferable to know only the language model represented by these important documents, since the user is likely to look for those [3]. Recent work [5] focuses on obtaining uniform random samples from large search engines in order to estimate their size and overlap. Copyright c 2010 for the individual papers by the papers’ authors. Copy- Others [20] have evaluated this in the context of obtaining ing permitted only for private and academic purposes. This volume is pub- lished and copyrighted by its editors. resource descriptions and found that it does not consistently LSDS-IR Workshop, July 2010. Geneva, Switzerland. work well across collections. 7 LSDS-IR’10 Query-Based Sampling using Snippets The foundational work for acquiring resource descriptions via query-based sampling was done by Callan et al. [7, 8]. Table 1: Properties of the data sets used. They show that a small sample of several hundred docu- Name Raw Index #Docs # Terms # Unique ments can be used for obtaining a good quality resource de- OANC 97M 117M 8,824 14,567,719 176,691 scription of large collections consisting of hundreds of thou- TREC123 2.6G 3.5G 1,078,166 432,134,562 969,061 sands of documents. The test collection used in their re- WT2G 1.6G 2.1G 247,413 247,833,426 1,545,707 search, TREC123, is not a web data collection. While this WIKIL 163M 84M 30,006 9,507,759 108,712 initially casts doubt on the applicability of the query-based WIKIM 58M 25M 6,821 3,003,418 56,330 sampling approach to the web, Monroe et al. [18] show that it also works very well for web data. The approach we take has some similarities with prior (a) For the full document strategy: download all the research by Paltoglou et al. [19]. They show that download- returned documents and use all their content to ing only a part of a document can also yield good modelling update the local language model. performance. However, they download the first two to three kilobytes of each document in the result list, whereas we use (b) For the snippet strategy: use the snippet of each small snippets and thus avoid any extra downloading beyond document in the search results to update the local the search results. language model. If a document appears multiple Our main research question is: times in search results, use its snippet only if it differs from previously seen snippets of that doc- “How does query-based sampling using only snip- ument. pets compare to downloading full documents in terms of the learned language model?” 4. Evaluate the iteration by comparing the unstemmed language model of the remote server with the local We show that query-based sampling using snippets offers model (see metrics described in Section 2.2). similar performance compared to using full documents. How- ever, using snippets uses less bandwidth and enables con- 5. Terminate if a stopping criterion has been reached, stantly updating the resource description at no extra cost. otherwise go to step 1. Additionally, we introduce a new metric for comparing lan- guage models in the context of resource descriptions and a Since the snippet approach uses the title and summary of method to establish the homogeneity of a corpus. each document returned in the search result, the way in We describe our experimental setup in section 2. This is which the summary is generated affects the performance. followed by section 3 which shows the results. Finally, the Our simulation environment uses Apache Lucene which gen- paper concludes with sections 4 and 5. erates keyword-in-context document summaries [16, p. 158]. These summaries are constructed by using words surround- ing a query term in a document, without keeping into ac- 2. METHODOLOGY count sentence boundaries. For all experiments the sum- In our experimental set-up we have one remote server maries consisted of two keyword-in-context segments of max- which content we wish to estimate by sampling. This server imally ninety characters. This length boundary is similar to can only take queries and return search results. For each the one modern web search engines use to generate their document a title, snippet and download link is returned. summaries. One might be tempted to believe that snippets These results are used to locally build a resource description are biased due to the fact that they commonly also con- in the form of a vocabulary with frequency information, also tain the query terms. However, in full-document sampling called a language model [7]. The act of submitting a query the returned documents also contain the query and have a to the remote server, obtaining search results, updating the similar bias, although mitigated by document length. local language model and calculating values for the evalua- tion metrics is called an iteration. An iteration consists of 2.1 Data sets the following steps: We used the following data sets to conduct our tests: 1. Pick a one-term query. OANC-1.1: The Open American National Corpus: A het- erogeneous collection. We use it exclusively for (a) In the first iteration our local language model is selecting bootstrap terms [14]. empty and has no terms. In this case we pick a random term from an external resource as query. TREC123: A heterogeneous collection consisting of TREC Volumes 1–3. Contains: short newspaper and (b) In subsequent iterations we pick a random term magazine articles, scientific abstracts, and gov- from our local language model that we have not ernment documents [12]. Used in previous ex- yet submitted previously as query. periments by Callan et al. [7] 2. Send the query to the remote server, requesting a max- WT2G: Web Track 2G: A small subset of the Very Large imum number of results (n = 10). In our set-up, Corpus web crawl conducted in 1997 [13]. the maximum length of the document summaries may be no more than 2 fragments of 90 characters each WIKIL: The large Memory Alpha Wiki. (s ≤ 2 · 90). http://memory-alpha.org 3. Update the resource description using the returned re- WIKIM: The medium sized Fallout Wiki. sults (1 ≤ n ≤ 10). http://fallout.wikia.com 8 LSDS-IR’10 Query-Based Sampling using Snippets CT F function returns the number of times a term t occurs in the given model. The symbol α represents the sum of the CTF of all terms in the actual model T , which is simply 0.0008 TREC123 WT2G WIKIL the number of tokens in T . The higher the CTF ratio, the WIKIM more of the important terms have been found. Density 0.0004 The Kullback-Leibler Divergence (KLD) gives an indica- tion of the extent to which two probability models, in this case our local and remote language models, will produce the 0.0000 same predictions. The output is the number of additional 0 2000 4000 6000 8000 10000 bits it would take to encode one model into the other. It is Document Length (Bytes) defined as follows [16, p. 231]: “ ” X P (t | T ) Figure 2: Kernel density plot of document lengths KLD T k Tˆ = P (t | T ) · log “ ” (2) up to 10 Kilobytes for each collection. t∈T P t | Tˆ The OANC is used as external resource to select a boot- where Tˆ is the learned model and T the actual model. KLD strap term on the first iteration: we pick a random term out has several disadvantages. Firstly, if a term occurs in one of the top 25 most-frequent terms (excluding stop words). model, but not in the other it will produce zero or infinite TREC123 is for comparison with Callan’s work [7]. WT2G numbers. Therefore, we apply Laplace smoothing, which is a representative subset of the web. It has some deficien- simply adds one to all counts of the learned model Tˆ . This cies, such as missing inter-server links [2]. However, since ensures that each term in the remote model exists at least we use only the page data, this is not a major problem for once in the local model, thereby avoiding divisions by zero this experiment. [3]. Secondly, the KLD is asymmetric, which is expressed Our experiment is part of a scenario where many sites using the double bar notation. Manning [17, p. 304] argues offer searchable content. With this in mind using larger that using Jensen-Shannon Divergence (JSD) solves both monolithic collections, like ClueWeb, offers little extra in- problems. It is defined in terms of the KLD as [9]: sights. After all: there are relatively few websites that pro- ! ! vide gigabytes or terabytes of information, whereas there is “ ” T + Tˆ T + Tˆ a long tail that offers smaller amounts. For this purpose we JSD T , Tˆ = KLD Tk +KLD ˆ Tk 2 2 have included two Wiki collections in our tests: WIKIL and WIKIM. All Wiki collection were obtained from Wikia, on (3) October 5th 2009. Wikis contain many pages in addition The Jensen-Shannon Divergence (JSD) expresses how much to normal content pages. However, we index only content information is lost if we describe two distributions with their pages which is the reason the raw sizes of these corpora are average distribution. This distribution is formed by sum- bigger than the indices. ming the counts for each term that occurs in either model Table 1 shows some properties of the data sets. We have and taking the average by dividing this by two. Using the also included Figure 2 which shows a kernel density plot average is a form of smoothing which avoids changing the of the size distributions of the collections [21]. We see that original counts in contrast with the KLD. Other differences WT2G has a more gradual distribution of document lengths, with the KLD are that the JSD is symmetric and finite. Con- whereas TREC123 shows a sharper decline near two kilo- veniently, when using a logarithm of base 2 in the underlying bytes. Both collections consist primarily of many small doc- KLD, the JSD ranges from 0.0 for identical distributions to uments. This is also true for the Wiki collections. Especially 2.0 for maximally different distributions. the WIKIL collection has many very small documents. 3. RESULTS 2.2 Metrics In this section we report the results of our experiments. Evaluation is done by comparing the complete remote lan- Because the queries are chosen randomly, we repeated the guage model with the subset local language model each it- experiment 30 times. eration. We discard stop words, and compare terms un- Figure 3 shows our results on TREC123 in the conven- stemmed. Various metrics exist to conduct this compari- tional way for query-based sampling: a metric against the son. For comparability with earlier work we use two metrics number of iterations on the horizontal axis [7]. We have and introduce one new metric in this context: the Jensen- omitted graphs for WT2G and the Wikia collections as they Shannon Divergence (JSD), which we believe is a better are highly similar in shape. choice than the others for reasons outlined below. As the bottom right graph shows, the amount of band- We first discuss the Collection Term Frequency (CTF) width consumed when using full documents is much larger ratio. This metric expresses the coverage of the terms of the than when using snippets. Full documents downloads each locally learned language model as a ratio of the terms of the of the ten documents in the search results, which can be po- actual remote model. It is defined as follows [8]: tentially large. Downloading all these documents also uses “ ” many connections to the server: one for the search results 1 X plus ten for the documents, whereas the snippet approach CT Fratio T , Tˆ = · CT F (t, T ) (1) α uses only one connection for transferring the search results t∈Tˆ and performs no additional downloads. where T is the actual model and Tˆ the learned model. The The fact that the full documents approach downloads a 9 LSDS-IR’10 Query-Based Sampling using Snippets TREC123 WT2G Collection Term Frequency (CTF) Ratio Collection Term Frequency (CTF) Ratio Kullback Leibler Divergence (KLD) 1.0 1.0 1.0 8 0.8 0.8 0.8 6 0.6 0.6 0.6 0.4 0.4 4 0.4 0.2 0.2 Full Documents 2 Snippets 0.2 0.0 0.0 Full Documents Snippets 0 200 400 600 800 1000 0 200 400 600 800 1000 0.0 0 0 20 40 60 80 100 0 20 40 60 80 100 Iterations Iterations Kullback Leibler Divergence (KLD) 10 10 8 8 Jensen Shannon Divergence (JSD) 2.0 Bandwidth Usage (Kilobytes) 6 6 2000 4 4 1.5 2 2 1.0 500 1000 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 0.5 0.0 0 0 20 40 60 80 100 0 20 40 60 80 100 Jensen Shannon Divergence (JSD) 2.0 2.0 Iterations Iterations 1.5 1.5 1.0 1.0 Figure 3: Results for TREC123. Shows CTF, KLD, JSD and bandwidth usage, plotted against the num- 0.5 0.5 ber of iterations. Shows both the full document and 0.0 0.0 snippet-based approach. The legend is shown in the 0 200 400 600 800 1000 0 200 400 600 800 1000 top left graph. Bandwidth Usage (Kilobytes) Bandwidth Usage (Kilobytes) lot of extra information results in it outperforming the snip- Figure 4: Interpolated plots for all metrics against pet approach for the defined metrics as shown in the other bandwidth usage up to 1000 KB. The left graphs graphs of Figure 3. However, comparing this way is unfair. show results for TREC123, the right for WT2G. Full document sampling performs better, simply because it Axis titles are shown on the left and bottom graphs, acquires more data in fewer iterations. A more interesting the legend in the top left graph. question is: how effectively do the approaches use band- width? tribution characteristics. In Figure 5 we see that the per- 3.1 Bandwidth formance of snippets on the WIKIL corpus is worse for the Figures 4 and 5 show the metrics plotted against band- JSD, but undecided for the other metrics. For WIKIM per- width usage. The graphs are 41-point interpolated plots formance measured with CTF is slightly better and unde- based on experiment data. These plots are generated in a cided for the other metrics. Why this difference? We con- similar same way as recall-precision graphs, but they con- ducted tests on several other large size Wiki collections to tain more points: 41 instead of 11, one every 25 kilobytes. verify our results. The results suggest that there is some Additionally, the recall-precision graphs, as frequently used relation between the distribution of document lengths and in TREC, use the maximum value at each point [11]. We the performance of query-based sampling using snippets. In use linear interpolation instead which uses averages. Figure 2 we see a peak at the low end of documents lengths Figure 4 shows that snippets outperform the full docu- for WIKIL. Collections that exhibit this type of peak all ment approach for all metrics. This seems to be more pro- showed similar performance as WIKIL: snippets performing nounced for WT2G. The underlying data reveals that snip- slightly worse especially for the JSD. In contrast, collections pets yield much more stable performance increments per that have a distribution like WIKIM, also show similar per- unit of bandwidth. Partially, this is due to a larger quan- formance: slightly better for CTF. Collections that have a tity of queries. The poorer performance of full documents is less pronounced peak at higher document lengths, or a more caused by variations in document length and quality. Down- gradual distribution appear to perform at least as good or loading a long document that poorly represents the under- better using snippets compared to full documents. lying collection is heavily penalised. The snippet approach The reason for this is that as the document size decreases never makes very large ‘mistakes’ like this, because its doc- and approaches the snippet summary size, the full docu- ument length is bound to the maximum summary size. ment strategy is less heavily penalised by mistakes. It can TREC123 and WT2G are very large heterogeneous test no longer download very large unrepresentative documents, collections as we will show later. The WIKI collections are only small ones. However, this advantage is offset if the more homogeneous and have different document length dis- document sizes equal the summary size. In that case the 10 LSDS-IR’10 Query-Based Sampling using Snippets WIKIL WIKIM Collection Term Frequency (CTF) Ratio Table 2: Collection homogeneity expressed as 1.0 1.0 Jensen-Shannon Divergence (JSD): Lower scores in- 0.8 0.8 dicate more homogeneity (n = 100, σ = 0.01). 0.6 0.6 Collection name µ JSD 0.4 0.4 TREC123 1.11 0.2 0.2 Full Documents WT2G 1.04 Snippets WIKIL 0.97 0.0 0.0 0 200 400 600 800 1000 0 200 400 600 800 1000 WIKIM 0.85 Kullback Leibler Divergence (KLD) 1. Select a random sample S of 5000 documents from a 10 10 collection. 8 8 2. Randomly divide the documents in the sample S into 6 6 ten bins: s1 . . . s10 . Each bin contains approximately 4 4 500 documents. 2 2 3. For each bin si calculate the Jensen-Shannon Diver- 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 gence (JSD) between the bigram language model de- fined by the documents in bin si and the language model defined by the documents in the remaining nine bins. Meaning: the language model of documents in Jensen Shannon Divergence (JSD) 2.0 2.0 s1 would be compared to that of those in s2 . . . s10 , et 1.5 1.5 cetera. This is known as a leave-one-out test. 1.0 1.0 4. Average the ten JSD scores obtained in step 3. The outcome represents the homogeneity. The lower the 0.5 0.5 number, the more self similarity within the corpus, 0.0 0.0 0 200 400 600 800 1000 0 200 400 600 800 1000 thus the more homogeneous the corpus is. Bandwidth Usage (Kilobytes) Bandwidth Usage (Kilobytes) Because we select documents from the collection randomly in step 1, we repeated the experiment ten times for each Figure 5: Interpolated plots for all metrics against collection. Results are shown in Table 2. bandwidth usage up to 1000 KB. The left graphs Table 2 shows that the large collections we used, TREC123 show results for WIKIL, the right for WIKIM. Axis and WT2G, are more heterogeneous compared to the smaller titles are shown on the left and bottom graphs, the collections WIKIL and WIKIM. It appears that WIKIL is legend in the top left graph. more heterogeneous than WIKIM, yet snippet-based sam- pling performs better on WIKIM. We conjecture that this is caused by the difference in document length distributions full document approach would actually use double the band- discussed earlier: see Figure 2. Overall, it appears that width with no advantage: once to obtain the search results, query-based sampling using snippets is better suited towards with summaries, and once again to download the entire doc- heterogeneous collections with a smooth distribution of doc- uments which are the same as the summaries in the search ument lengths. results. 4. CONCLUSION 3.2 Homogeneity We have shown that query-based sampling using snippets While WIKIM has a fairly smooth document length dis- is a viable alternative for conventional query-based sampling tribution, the performance increase of snippets over full doc- using entire documents. This opens the way for distributed uments with regard to the JSD and KLD metrics is not the search systems that do not need to download documents at same as that obtained with TREC123 and WT2G. This is all, but instead solely operate by exchanging queries and likely caused by the homogeneous nature of the collection. search results. Few adjustments are needed to existing op- Consider that if a collection is highly homogeneous, only erational distributed information retrieval systems, that use a few samples are needed to obtain a good representation. a central server, as the remote search engines and the cen- Every additional sample can only slightly improve such a tral server already exchange snippets. Our research implies model. In contrast, for a heterogeneous collection, each new that the significant overhead incurred by downloading docu- sample can improve the model significantly. ments in today’s prototype distributed information retrieval So, how homogeneous are the collections that we used? systems can be completely eliminated. This also enables We adopt the approach of Kilgariff and Rose [15] of split- modeling of servers from which full documents can not be ting the corpus into parts and comparing those, with some obtained and those which index multimedia content. Fur- slight adjustments. As metric we use the Jensen-Shannon thermore, the central server can continuously use the search Divergence (JSD) explained in Section 2.2 and also used by result data, the snippets, to keep its resource descriptions Eiron and McCurley [10] for the same task. The exact pro- up to date without imposing additional overhead, naturally cedure we used is as follows: coping with changes in document collections that occur over 11 LSDS-IR’10 Query-Based Sampling using Snippets time. This also provides the extra iterations that snippet distributed information retrieval. In Proceedings of query-based sampling requires without extra latency. ECIR (Apr. 2009), vol. 5478 of Lecture Notes in Compared to the conventional query-based sampling ap- Computer Science, Springer, pp. 485–497. proach our snippet approach shows equal or better perfor- [5] Bar-Yossef, Z., and Gurevich, M. Random mance per unit of bandwidth consumed for most of the test sampling from a search engine’s index. Journal of the collections. The performance also appears to be more sta- ACM 55, 5 (2008), 1–74. ble per unit of bandwidth consumed. Factors influencing [6] Callan, J. Distributed Information Retrieval. the performance are document length distribution and the Advances in Information Retrieval. Kluwer Academic homogeneity of the data. Snippet query-based sampling per- Publishers, 2000, ch. 5. forms best when document lengths are smoothly distributed, [7] Callan, J., and Connell, M. Query-based without a large peak at the low-end of document sizes, and sampling of text databases. ACM Transactions on when the data is heterogeneous. Information Systems 19, 2 (2001), 97–130. Even though the performance of snippet query-based sam- [8] Callan, J., Connell, M., and Du, A. Automatic pling depends on the underlying collection, the information discovery of language models for text databases. In that is used always comes along ‘for free’ with search results. Proceedings of SIGMOD (June 1999), ACM Press, No extra bandwidth, connections or operations are required pp. 479–490. beyond simply sending a query and obtaining a list of search [9] Dagan, I., Lee, L., and Pereira, F. results. Herein lies the strength of the approach. Similarity-based methods for word sense disambiguation. In Proceedings of ACL (Morristown, 5. FUTURE WORK NJ, US, Aug. 1997), Association for Computational We believe that the performance gains seen in the var- Linguistics, pp. 56–63. ious metrics leads to improved selection and merging per- [10] Eiron, N., and McCurley, K. S. Analysis of anchor formance. However, this is something that could be further text for web search. In Proceedings of SIGIR (New explored. A measure for how representative the resource York, NY, US, July 2003), ACM, pp. 459–460. descriptions obtained by sampling are for real-world usage [11] Harman, D. Overview of the first trec conference. In would be very useful. This remains an open problem, also Proceedings of SIGIR (New York, NY, US, June for full document sampling, even though some attempts have 1993), ACM, pp. 36–47. been made to solve it [4]. [12] Harman, D. K. Overview of the Third Text Retrieval An other research direction is the snippets themselves. Conference (TREC-3). National Institute of Standards Firstly, how snippet generation affects modeling performance. and Technology, 1995. Secondly, how a query can be generated from the snippets [13] Hawking, D., Voorhees, E., Craswell, N., and seen so far in more sophisticated ways. This could be done Bailey, P. Overview of the trec-8 web track. Tech. by attaching a different priority to different words in a snip- rep., National Institute of Standards and Technology, pet. Finally, the influence of the ratio of snippet to docu- Gaithersburg, MD, US, 2000. ment size could be further investigated. [14] Ide, N., and Suderman, K. The open american national corpus, 2007. 6. ACKNOWLEDGEMENTS [15] Kilgarriff, A., and Rose, T. Measures for corpus We thank the USI Lugano Information Retrieval group similarity and homogeneity. In Proceedings of EMNLP for their comments, notably Mark Carman and Cyrus Hall. (Morristown, NJ, US, June 1998), ACL-SIGDAT, We also thank Dolf Trieschnigg, Kien Tjin-Kam-Jet and Jan pp. 46–52. Flokstra. This paper, and the experiments, were created us- [16] Manning, C. D., Raghavan, P., and Schütze, H. ing only Free and Open Source Software. Finally, we grate- Introduction to Information Retrieval. Cambridge fully acknowledge the support of the Netherlands Organisa- University Press, New York, NY, US, 2008. tion for Scientific Research (NWO) under project DIRKA [17] Manning, C. D., and Schütze, H. Foundations of (NWO-Vidi), Number 639.022.809. Statistical Natural Language Processing. MIT Press, Cambridge, MA, US, June 1999. 7. REFERENCES [18] Monroe, G., French, J. C., and Powell, A. L. [1] Azzopardi, L., de Rijke, M., and Balog, K. Obtaining language models of web collections using Building simulated queries for known-item topics: An query-based sampling techniques. In Proceedings of analysis using six european languages. In Proceedings HICSS (Washington, DC, US, Jan. 2002), vol. 3, of SIGIR (New York, NY, US, July 2007), ACM, IEEE Computer Society, p. 67. pp. 455–462. [19] Paltoglou, G., Salampasis, M., and Satratzemi, [2] Bailey, P., Craswell, N., and Hawking, D. M. Hybrid results merging. In Proceedings of CIKM Engineering a multi-purpose test collection for web (New York, NY, US, Nov. 2007), ACM, pp. 321–330. retrieval experiments. Information Processing & [20] Thomas, P., and Hawking, D. Evaluating sampling Management 39, 6 (2003), 853–871. methods for uncooperative collections. In Proceedings [3] Baillie, M., Azzopardi, L., and Crestani, F. of SIGIR (New York, NY, US, July 2007), ACM, Adaptive Query-Based Sampling of Distributed pp. 503–510. Collections, vol. 4209 of Lecture Notes in Computer [21] Venables, W. N., and Smith, D. M. An Science. Springer, 2006, pp. 316–328. Introduction to R, Aug. 2009. [4] Baillie, M., Carman, M. J., and Crestani, F. A topic-based measure of resource description quality for 12