8th Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR’10)


                          Query-Based Sampling using Snippets

                            Almer S. Tigelaar                                                  Djoerd Hiemstra
                Database Group, University of Twente,                               Database Group, University of Twente,
                    Enschede, The Netherlands                                           Enschede, The Netherlands
                    a.s.tigelaar@cs.utwente.nl                                           hiemstra@cs.utwente.nl


ABSTRACT
Query-based sampling is a commonly used approach to model
the content of servers. Conventionally, queries are sent to a
server and the documents in the search results returned are
downloaded in full as representation of the server’s content.
We present an approach that uses the document snippets
in the search results as samples instead of downloading the
entire documents. We show this yields equal or better mod-
eling performance for the same bandwidth consumption de-                        Figure 1: Example snippets. From top to bottom:
pending on collection characteristics, like document length                     each snippet consists of an underlined title, a two
distribution and homogeneity. Query-based sampling using                        line summary and a link.
snippets is a useful approach for real-world systems, since it
requires no extra operations beyond exchanging queries and
search results.
                                                                                   Disadvantages of downloading entire documents are that
                                                                                it consumes more bandwidth, is impossible if servers do not
1.    INTRODUCTION                                                              return full documents, and does not work when the full docu-
   Query-based sampling is a technique for obtaining a re-                      ments themselves are non-text: multimedia with short sum-
source description of a search server. This description is                      mary descriptions. In contrast, some data always comes
based on the downloaded content of a small subset of doc-                       along ‘for free’ in the returned search results: the snippets.
uments the server returns in response to queries [8]. We                        A snippet is a short piece of text consisting of a document
present an approach that requires no additional download-                       title, a short summary and a link as shown in Figure 1. A
ing beyond the returned results, but instead relies solely on                   summary can be either dynamically generated in response
information returned as part of the results: the snippets.                      to a query or is statically defined [16, p. 157]. We postulate
   Knowing what server offers what content allows a central                     that these snippets can also be used for query-based sam-
server to forward queries to the most suitable server for han-                  pling to build a language model. This way we can avoid
dling a query. This task is commonly referred to as resource                    downloading entire documents and thus reduce bandwidth
selection [6]. Selection is based on a representation of the                    usage and cope with servers that return only search results
content of a server: a resource description. Most servers                       or contain multimedia content. However, since snippets are
on the web are uncooperative and do not provide such a de-                      small we need to see many of them. This means that we
scription, thus query-based sampling exploits only the native                   need to send more queries compared with the full document
search functionality provided by such servers.                                  approach. While this increases the query load on the remote
   In conventional query-based sampling, the first step is                      servers, it is an advantage for live systems that need to sam-
sending a query to a server. The server returns a ranked                        ple from document collections that change over time, since
list of results of which the top N most relevant documents                      it allows continously updating the language model, based on
are downloaded and used to build a resource description.                        the results of live queries.
Queries are randomly chosen, the first from an external re-                        Whether the documents returned in response to random
source and subsequent queries from the description built so                     queries are a truly random part of the underlying collection
far. This repeats until a stopping criterion is reached [7, 8].                 is doubtful. Servers have a propensity to return documents
                                                                                that users indicate as important and the number of in-links
                                                                                has a substantial correlation with this importance [1]. This
                                                                                may not be a problem, as it is preferable to know only the
                                                                                language model represented by these important documents,
                                                                                since the user is likely to look for those [3]. Recent work
                                                                                [5] focuses on obtaining uniform random samples from large
                                                                                search engines in order to estimate their size and overlap.
Copyright c 2010 for the individual papers by the papers’ authors. Copy-        Others [20] have evaluated this in the context of obtaining
ing permitted only for private and academic purposes. This volume is pub-
lished and copyrighted by its editors.                                          resource descriptions and found that it does not consistently
LSDS-IR Workshop, July 2010. Geneva, Switzerland.                               work well across collections.


                                                                            7
LSDS-IR’10                                                                                 Query-Based Sampling using Snippets


   The foundational work for acquiring resource descriptions
via query-based sampling was done by Callan et al. [7, 8].                 Table 1: Properties of the data sets used.
They show that a small sample of several hundred docu-                Name          Raw     Index     #Docs       # Terms     # Unique
ments can be used for obtaining a good quality resource de-           OANC          97M     117M       8,824     14,567,719    176,691
scription of large collections consisting of hundreds of thou-        TREC123       2.6G     3.5G   1,078,166   432,134,562    969,061
sands of documents. The test collection used in their re-             WT2G          1.6G     2.1G    247,413    247,833,426   1,545,707
search, TREC123, is not a web data collection. While this             WIKIL        163M      84M      30,006      9,507,759    108,712
initially casts doubt on the applicability of the query-based         WIKIM         58M      25M       6,821      3,003,418     56,330
sampling approach to the web, Monroe et al. [18] show that
it also works very well for web data.
   The approach we take has some similarities with prior
                                                                             (a) For the full document strategy: download all the
research by Paltoglou et al. [19]. They show that download-
                                                                                 returned documents and use all their content to
ing only a part of a document can also yield good modelling
                                                                                 update the local language model.
performance. However, they download the first two to three
kilobytes of each document in the result list, whereas we use                (b) For the snippet strategy: use the snippet of each
small snippets and thus avoid any extra downloading beyond                       document in the search results to update the local
the search results.                                                              language model. If a document appears multiple
   Our main research question is:                                                times in search results, use its snippet only if it
                                                                                 differs from previously seen snippets of that doc-
       “How does query-based sampling using only snip-                           ument.
       pets compare to downloading full documents in
       terms of the learned language model?”                           4. Evaluate the iteration by comparing the unstemmed
                                                                          language model of the remote server with the local
We show that query-based sampling using snippets offers                   model (see metrics described in Section 2.2).
similar performance compared to using full documents. How-
ever, using snippets uses less bandwidth and enables con-              5. Terminate if a stopping criterion has been reached,
stantly updating the resource description at no extra cost.               otherwise go to step 1.
Additionally, we introduce a new metric for comparing lan-
guage models in the context of resource descriptions and a           Since the snippet approach uses the title and summary of
method to establish the homogeneity of a corpus.                     each document returned in the search result, the way in
   We describe our experimental setup in section 2. This is          which the summary is generated affects the performance.
followed by section 3 which shows the results. Finally, the          Our simulation environment uses Apache Lucene which gen-
paper concludes with sections 4 and 5.                               erates keyword-in-context document summaries [16, p. 158].
                                                                     These summaries are constructed by using words surround-
                                                                     ing a query term in a document, without keeping into ac-
2.     METHODOLOGY                                                   count sentence boundaries. For all experiments the sum-
   In our experimental set-up we have one remote server              maries consisted of two keyword-in-context segments of max-
which content we wish to estimate by sampling. This server           imally ninety characters. This length boundary is similar to
can only take queries and return search results. For each            the one modern web search engines use to generate their
document a title, snippet and download link is returned.             summaries. One might be tempted to believe that snippets
These results are used to locally build a resource description       are biased due to the fact that they commonly also con-
in the form of a vocabulary with frequency information, also         tain the query terms. However, in full-document sampling
called a language model [7]. The act of submitting a query           the returned documents also contain the query and have a
to the remote server, obtaining search results, updating the         similar bias, although mitigated by document length.
local language model and calculating values for the evalua-
tion metrics is called an iteration. An iteration consists of        2.1     Data sets
the following steps:                                                   We used the following data sets to conduct our tests:
     1. Pick a one-term query.                                       OANC-1.1: The Open American National Corpus: A het-
                                                                              erogeneous collection. We use it exclusively for
        (a) In the first iteration our local language model is                selecting bootstrap terms [14].
            empty and has no terms. In this case we pick a
            random term from an external resource as query.          TREC123: A heterogeneous collection consisting of TREC
                                                                              Volumes 1–3. Contains: short newspaper and
        (b) In subsequent iterations we pick a random term
                                                                              magazine articles, scientific abstracts, and gov-
            from our local language model that we have not
                                                                              ernment documents [12]. Used in previous ex-
            yet submitted previously as query.
                                                                              periments by Callan et al. [7]
     2. Send the query to the remote server, requesting a max-       WT2G:        Web Track 2G: A small subset of the Very Large
        imum number of results (n = 10). In our set-up,                           Corpus web crawl conducted in 1997 [13].
        the maximum length of the document summaries may
        be no more than 2 fragments of 90 characters each            WIKIL:       The large Memory Alpha Wiki.
        (s ≤ 2 · 90).                                                             http://memory-alpha.org
     3. Update the resource description using the returned re-       WIKIM:       The medium sized Fallout Wiki.
        sults (1 ≤ n ≤ 10).                                                       http://fallout.wikia.com


                                                                 8
LSDS-IR’10                                                                                                Query-Based Sampling using Snippets


                                                                                        CT F function returns the number of times a term t occurs
                                                                                        in the given model. The symbol α represents the sum of the
                                                                                        CTF of all terms in the actual model T , which is simply
          0.0008


                                                                    TREC123
                                                                    WT2G
                                                                    WIKIL
                                                                                        the number of tokens in T . The higher the CTF ratio, the
                                                                    WIKIM               more of the important terms have been found.
Density

          0.0004


                                                                                           The Kullback-Leibler Divergence (KLD) gives an indica-
                                                                                        tion of the extent to which two probability models, in this
                                                                                        case our local and remote language models, will produce the
          0.0000


                                                                                        same predictions. The output is the number of additional
                   0        2000      4000         6000      8000      10000
                                                                                        bits it would take to encode one model into the other. It is
                                   Document Length (Bytes)                              defined as follows [16, p. 231]:

                                                                                                “      ”  X                   P (t | T )
Figure 2: Kernel density plot of document lengths                                            KLD T k Tˆ =     P (t | T ) · log “       ”          (2)
up to 10 Kilobytes for each collection.                                                                   t∈T                 P t | Tˆ

The OANC is used as external resource to select a boot-                                 where Tˆ is the learned model and T the actual model. KLD
strap term on the first iteration: we pick a random term out                            has several disadvantages. Firstly, if a term occurs in one
of the top 25 most-frequent terms (excluding stop words).                               model, but not in the other it will produce zero or infinite
TREC123 is for comparison with Callan’s work [7]. WT2G                                  numbers. Therefore, we apply Laplace smoothing, which
is a representative subset of the web. It has some deficien-                            simply adds one to all counts of the learned model Tˆ . This
cies, such as missing inter-server links [2]. However, since                            ensures that each term in the remote model exists at least
we use only the page data, this is not a major problem for                              once in the local model, thereby avoiding divisions by zero
this experiment.                                                                        [3]. Secondly, the KLD is asymmetric, which is expressed
   Our experiment is part of a scenario where many sites                                using the double bar notation. Manning [17, p. 304] argues
offer searchable content. With this in mind using larger                                that using Jensen-Shannon Divergence (JSD) solves both
monolithic collections, like ClueWeb, offers little extra in-                           problems. It is defined in terms of the KLD as [9]:
sights. After all: there are relatively few websites that pro-                                                             !                          !
vide gigabytes or terabytes of information, whereas there is                                  “   ”                 T + Tˆ                    T + Tˆ
a long tail that offers smaller amounts. For this purpose we                            JSD T , Tˆ = KLD         Tk         +KLD           ˆ
                                                                                                                                          Tk
                                                                                                                      2                          2
have included two Wiki collections in our tests: WIKIL and
WIKIM. All Wiki collection were obtained from Wikia, on                                                                                             (3)
October 5th 2009. Wikis contain many pages in addition                                  The Jensen-Shannon Divergence (JSD) expresses how much
to normal content pages. However, we index only content                                 information is lost if we describe two distributions with their
pages which is the reason the raw sizes of these corpora are                            average distribution. This distribution is formed by sum-
bigger than the indices.                                                                ming the counts for each term that occurs in either model
   Table 1 shows some properties of the data sets. We have                              and taking the average by dividing this by two. Using the
also included Figure 2 which shows a kernel density plot                                average is a form of smoothing which avoids changing the
of the size distributions of the collections [21]. We see that                          original counts in contrast with the KLD. Other differences
WT2G has a more gradual distribution of document lengths,                               with the KLD are that the JSD is symmetric and finite. Con-
whereas TREC123 shows a sharper decline near two kilo-                                  veniently, when using a logarithm of base 2 in the underlying
bytes. Both collections consist primarily of many small doc-                            KLD, the JSD ranges from 0.0 for identical distributions to
uments. This is also true for the Wiki collections. Especially                          2.0 for maximally different distributions.
the WIKIL collection has many very small documents.
                                                                                        3.   RESULTS
2.2                Metrics                                                                 In this section we report the results of our experiments.
  Evaluation is done by comparing the complete remote lan-                              Because the queries are chosen randomly, we repeated the
guage model with the subset local language model each it-                               experiment 30 times.
eration. We discard stop words, and compare terms un-                                      Figure 3 shows our results on TREC123 in the conven-
stemmed. Various metrics exist to conduct this compari-                                 tional way for query-based sampling: a metric against the
son. For comparability with earlier work we use two metrics                             number of iterations on the horizontal axis [7]. We have
and introduce one new metric in this context: the Jensen-                               omitted graphs for WT2G and the Wikia collections as they
Shannon Divergence (JSD), which we believe is a better                                  are highly similar in shape.
choice than the others for reasons outlined below.                                         As the bottom right graph shows, the amount of band-
  We first discuss the Collection Term Frequency (CTF)                                  width consumed when using full documents is much larger
ratio. This metric expresses the coverage of the terms of the                           than when using snippets. Full documents downloads each
locally learned language model as a ratio of the terms of the                           of the ten documents in the search results, which can be po-
actual remote model. It is defined as follows [8]:                                      tentially large. Downloading all these documents also uses
                                “      ”                                                many connections to the server: one for the search results
                                         1 X                                            plus ten for the documents, whereas the snippet approach
                       CT Fratio T , Tˆ = ·  CT F (t, T )                     (1)
                                         α                                              uses only one connection for transferring the search results
                                                 t∈Tˆ
                                                                                        and performs no additional downloads.
where T is the actual model and Tˆ the learned model. The                                  The fact that the full documents approach downloads a


                                                                                    9
LSDS-IR’10                                                                                                                                                                                                                              Query-Based Sampling using Snippets


                                                                                                                                                                                                                                 TREC123                                  WT2G
 Collection Term Frequency (CTF) Ratio


                                                                                                                                                                         Collection Term Frequency (CTF) Ratio
                                                                                Kullback Leibler Divergence (KLD)


                                                                                                                                                                                                                 1.0


                                                                                                                                                                                                                                                          1.0
                                         1.0


                                                                                                                    8


                                                                                                                                                                                                                 0.8


                                                                                                                                                                                                                                                          0.8
                                         0.8


                                                                                                                    6


                                                                                                                                                                                                                 0.6


                                                                                                                                                                                                                                                          0.6
                                         0.6


                                                                                                                                                                                                                 0.4


                                                                                                                                                                                                                                                          0.4
                                                                                                                    4
                                         0.4


                                                                                                                                                                                                                 0.2


                                                                                                                                                                                                                                                          0.2
                                                                                                                                                                                                                                       Full Documents


                                                                                                                    2
                                                                                                                                                                                                                                       Snippets
                                         0.2


                                                                                                                                                                                                                 0.0


                                                                                                                                                                                                                                                          0.0
                                                              Full Documents
                                                              Snippets                                                                                                                                                 0   200   400   600   800   1000         0   200   400   600   800   1000
                                         0.0


                                                                                                                    0
                                               0   20    40    60    80   100                                                  0   20    40   60     80   100
                                                        Iterations                                                                      Iterations


                                                                                                                                                                         Kullback Leibler Divergence (KLD)
                                                                                                                                                                                                                 10


                                                                                                                                                                                                                                                          10
                                                                                                                                                                                                                 8


                                                                                                                                                                                                                                                          8
 Jensen Shannon Divergence (JSD)
                                         2.0


                                                                                Bandwidth Usage (Kilobytes)


                                                                                                                                                                                                                 6


                                                                                                                                                                                                                                                          6
                                                                                                                    2000


                                                                                                                                                                                                                 4


                                                                                                                                                                                                                                                          4
                                         1.5


                                                                                                                                                                                                                 2


                                                                                                                                                                                                                                                          2
                                         1.0


                                                                                                                    500 1000


                                                                                                                                                                                                                 0


                                                                                                                                                                                                                                                          0
                                                                                                                                                                                                                       0   200   400   600   800   1000         0   200   400   600   800   1000
                                         0.5
                                         0.0


                                                                                                                    0


                                               0   20    40    60    80   100                                                  0   20    40   60     80   100

                                                                                                                                                                         Jensen Shannon Divergence (JSD)
                                                                                                                                                                                                                 2.0


                                                                                                                                                                                                                                                          2.0
                                                        Iterations                                                                      Iterations
                                                                                                                                                                                                                 1.5


                                                                                                                                                                                                                                                          1.5
                                                                                                                                                                                                                 1.0


                                                                                                                                                                                                                                                          1.0
Figure 3: Results for TREC123. Shows CTF, KLD,
JSD and bandwidth usage, plotted against the num-
                                                                                                                                                                                                                 0.5


                                                                                                                                                                                                                                                          0.5
ber of iterations. Shows both the full document and
                                                                                                                                                                                                                 0.0


                                                                                                                                                                                                                                                          0.0
snippet-based approach. The legend is shown in the
                                                                                                                                                                                                                       0   200   400   600   800   1000         0   200   400   600   800   1000
top left graph.                                                                                                                                                                                                        Bandwidth Usage (Kilobytes)              Bandwidth Usage (Kilobytes)


lot of extra information results in it outperforming the snip-                                                                                                       Figure 4: Interpolated plots for all metrics against
pet approach for the defined metrics as shown in the other                                                                                                           bandwidth usage up to 1000 KB. The left graphs
graphs of Figure 3. However, comparing this way is unfair.                                                                                                           show results for TREC123, the right for WT2G.
Full document sampling performs better, simply because it                                                                                                            Axis titles are shown on the left and bottom graphs,
acquires more data in fewer iterations. A more interesting                                                                                                           the legend in the top left graph.
question is: how effectively do the approaches use band-
width?
                                                                                                                                                                     tribution characteristics. In Figure 5 we see that the per-
3.1                                            Bandwidth                                                                                                             formance of snippets on the WIKIL corpus is worse for the
   Figures 4 and 5 show the metrics plotted against band-                                                                                                            JSD, but undecided for the other metrics. For WIKIM per-
width usage. The graphs are 41-point interpolated plots                                                                                                              formance measured with CTF is slightly better and unde-
based on experiment data. These plots are generated in a                                                                                                             cided for the other metrics. Why this difference? We con-
similar same way as recall-precision graphs, but they con-                                                                                                           ducted tests on several other large size Wiki collections to
tain more points: 41 instead of 11, one every 25 kilobytes.                                                                                                          verify our results. The results suggest that there is some
Additionally, the recall-precision graphs, as frequently used                                                                                                        relation between the distribution of document lengths and
in TREC, use the maximum value at each point [11]. We                                                                                                                the performance of query-based sampling using snippets. In
use linear interpolation instead which uses averages.                                                                                                                Figure 2 we see a peak at the low end of documents lengths
   Figure 4 shows that snippets outperform the full docu-                                                                                                            for WIKIL. Collections that exhibit this type of peak all
ment approach for all metrics. This seems to be more pro-                                                                                                            showed similar performance as WIKIL: snippets performing
nounced for WT2G. The underlying data reveals that snip-                                                                                                             slightly worse especially for the JSD. In contrast, collections
pets yield much more stable performance increments per                                                                                                               that have a distribution like WIKIM, also show similar per-
unit of bandwidth. Partially, this is due to a larger quan-                                                                                                          formance: slightly better for CTF. Collections that have a
tity of queries. The poorer performance of full documents is                                                                                                         less pronounced peak at higher document lengths, or a more
caused by variations in document length and quality. Down-                                                                                                           gradual distribution appear to perform at least as good or
loading a long document that poorly represents the under-                                                                                                            better using snippets compared to full documents.
lying collection is heavily penalised. The snippet approach                                                                                                             The reason for this is that as the document size decreases
never makes very large ‘mistakes’ like this, because its doc-                                                                                                        and approaches the snippet summary size, the full docu-
ument length is bound to the maximum summary size.                                                                                                                   ment strategy is less heavily penalised by mistakes. It can
   TREC123 and WT2G are very large heterogeneous test                                                                                                                no longer download very large unrepresentative documents,
collections as we will show later. The WIKI collections are                                                                                                          only small ones. However, this advantage is offset if the
more homogeneous and have different document length dis-                                                                                                             document sizes equal the summary size. In that case the


                                                                                                                                                                10
LSDS-IR’10                                                                                                                                                Query-Based Sampling using Snippets


                                                              WIKIL                                    WIKIM
      Collection Term Frequency (CTF) Ratio


                                                                                                                                     Table 2:   Collection homogeneity expressed as
                                              1.0


                                                                                       1.0
                                                                                                                                     Jensen-Shannon Divergence (JSD): Lower scores in-
                                              0.8


                                                                                       0.8
                                                                                                                                     dicate more homogeneity (n = 100, σ = 0.01).
                                              0.6


                                                                                       0.6
                                                                                                                                                         Collection name     µ JSD
                                              0.4


                                                                                       0.4
                                                                                                                                                         TREC123                1.11
                                              0.2


                                                                                       0.2
                                                                    Full Documents                                                                       WT2G                   1.04
                                                                    Snippets
                                                                                                                                                         WIKIL                  0.97
                                              0.0


                                                                                       0.0
                                                    0   200   400   600   800   1000         0   200   400   600   800   1000
                                                                                                                                                         WIKIM                  0.85
      Kullback Leibler Divergence (KLD)


                                                                                                                                          1. Select a random sample S of 5000 documents from a
                                              10


                                                                                       10


                                                                                                                                             collection.
                                              8


                                                                                       8


                                                                                                                                          2. Randomly divide the documents in the sample S into
                                              6


                                                                                       6


                                                                                                                                             ten bins: s1 . . . s10 . Each bin contains approximately
                                              4


                                                                                       4


                                                                                                                                             500 documents.
                                              2


                                                                                       2


                                                                                                                                          3. For each bin si calculate the Jensen-Shannon Diver-
                                              0


                                                                                       0


                                                    0   200   400   600   800   1000         0   200   400   600   800   1000
                                                                                                                                             gence (JSD) between the bigram language model de-
                                                                                                                                             fined by the documents in bin si and the language
                                                                                                                                             model defined by the documents in the remaining nine
                                                                                                                                             bins. Meaning: the language model of documents in
      Jensen Shannon Divergence (JSD)
                                              2.0


                                                                                       2.0


                                                                                                                                             s1 would be compared to that of those in s2 . . . s10 , et
                                              1.5


                                                                                       1.5


                                                                                                                                             cetera. This is known as a leave-one-out test.
                                              1.0


                                                                                       1.0


                                                                                                                                          4. Average the ten JSD scores obtained in step 3. The
                                                                                                                                             outcome represents the homogeneity. The lower the
                                              0.5


                                                                                       0.5


                                                                                                                                             number, the more self similarity within the corpus,
                                              0.0


                                                                                       0.0


                                                    0   200   400   600   800   1000         0   200   400   600   800   1000
                                                                                                                                             thus the more homogeneous the corpus is.
                                                    Bandwidth Usage (Kilobytes)              Bandwidth Usage (Kilobytes)
                                                                                                                                     Because we select documents from the collection randomly
                                                                                                                                     in step 1, we repeated the experiment ten times for each
Figure 5: Interpolated plots for all metrics against                                                                                 collection. Results are shown in Table 2.
bandwidth usage up to 1000 KB. The left graphs                                                                                          Table 2 shows that the large collections we used, TREC123
show results for WIKIL, the right for WIKIM. Axis                                                                                    and WT2G, are more heterogeneous compared to the smaller
titles are shown on the left and bottom graphs, the                                                                                  collections WIKIL and WIKIM. It appears that WIKIL is
legend in the top left graph.                                                                                                        more heterogeneous than WIKIM, yet snippet-based sam-
                                                                                                                                     pling performs better on WIKIM. We conjecture that this
                                                                                                                                     is caused by the difference in document length distributions
full document approach would actually use double the band-                                                                           discussed earlier: see Figure 2. Overall, it appears that
width with no advantage: once to obtain the search results,                                                                          query-based sampling using snippets is better suited towards
with summaries, and once again to download the entire doc-                                                                           heterogeneous collections with a smooth distribution of doc-
uments which are the same as the summaries in the search                                                                             ument lengths.
results.
                                                                                                                                     4.     CONCLUSION
3.2                                           Homogeneity                                                                               We have shown that query-based sampling using snippets
   While WIKIM has a fairly smooth document length dis-                                                                              is a viable alternative for conventional query-based sampling
tribution, the performance increase of snippets over full doc-                                                                       using entire documents. This opens the way for distributed
uments with regard to the JSD and KLD metrics is not the                                                                             search systems that do not need to download documents at
same as that obtained with TREC123 and WT2G. This is                                                                                 all, but instead solely operate by exchanging queries and
likely caused by the homogeneous nature of the collection.                                                                           search results. Few adjustments are needed to existing op-
Consider that if a collection is highly homogeneous, only                                                                            erational distributed information retrieval systems, that use
a few samples are needed to obtain a good representation.                                                                            a central server, as the remote search engines and the cen-
Every additional sample can only slightly improve such a                                                                             tral server already exchange snippets. Our research implies
model. In contrast, for a heterogeneous collection, each new                                                                         that the significant overhead incurred by downloading docu-
sample can improve the model significantly.                                                                                          ments in today’s prototype distributed information retrieval
   So, how homogeneous are the collections that we used?                                                                             systems can be completely eliminated. This also enables
We adopt the approach of Kilgariff and Rose [15] of split-                                                                           modeling of servers from which full documents can not be
ting the corpus into parts and comparing those, with some                                                                            obtained and those which index multimedia content. Fur-
slight adjustments. As metric we use the Jensen-Shannon                                                                              thermore, the central server can continuously use the search
Divergence (JSD) explained in Section 2.2 and also used by                                                                           result data, the snippets, to keep its resource descriptions
Eiron and McCurley [10] for the same task. The exact pro-                                                                            up to date without imposing additional overhead, naturally
cedure we used is as follows:                                                                                                        coping with changes in document collections that occur over


                                                                                                                                11
LSDS-IR’10                                                                           Query-Based Sampling using Snippets


time. This also provides the extra iterations that snippet               distributed information retrieval. In Proceedings of
query-based sampling requires without extra latency.                     ECIR (Apr. 2009), vol. 5478 of Lecture Notes in
   Compared to the conventional query-based sampling ap-                 Computer Science, Springer, pp. 485–497.
proach our snippet approach shows equal or better perfor-            [5] Bar-Yossef, Z., and Gurevich, M. Random
mance per unit of bandwidth consumed for most of the test                sampling from a search engine’s index. Journal of the
collections. The performance also appears to be more sta-                ACM 55, 5 (2008), 1–74.
ble per unit of bandwidth consumed. Factors influencing              [6] Callan, J. Distributed Information Retrieval.
the performance are document length distribution and the                 Advances in Information Retrieval. Kluwer Academic
homogeneity of the data. Snippet query-based sampling per-               Publishers, 2000, ch. 5.
forms best when document lengths are smoothly distributed,           [7] Callan, J., and Connell, M. Query-based
without a large peak at the low-end of document sizes, and               sampling of text databases. ACM Transactions on
when the data is heterogeneous.                                          Information Systems 19, 2 (2001), 97–130.
   Even though the performance of snippet query-based sam-           [8] Callan, J., Connell, M., and Du, A. Automatic
pling depends on the underlying collection, the information              discovery of language models for text databases. In
that is used always comes along ‘for free’ with search results.          Proceedings of SIGMOD (June 1999), ACM Press,
No extra bandwidth, connections or operations are required               pp. 479–490.
beyond simply sending a query and obtaining a list of search
                                                                     [9] Dagan, I., Lee, L., and Pereira, F.
results. Herein lies the strength of the approach.
                                                                         Similarity-based methods for word sense
                                                                         disambiguation. In Proceedings of ACL (Morristown,
5.   FUTURE WORK                                                         NJ, US, Aug. 1997), Association for Computational
  We believe that the performance gains seen in the var-                 Linguistics, pp. 56–63.
ious metrics leads to improved selection and merging per-           [10] Eiron, N., and McCurley, K. S. Analysis of anchor
formance. However, this is something that could be further               text for web search. In Proceedings of SIGIR (New
explored. A measure for how representative the resource                  York, NY, US, July 2003), ACM, pp. 459–460.
descriptions obtained by sampling are for real-world usage          [11] Harman, D. Overview of the first trec conference. In
would be very useful. This remains an open problem, also                 Proceedings of SIGIR (New York, NY, US, June
for full document sampling, even though some attempts have               1993), ACM, pp. 36–47.
been made to solve it [4].                                          [12] Harman, D. K. Overview of the Third Text Retrieval
  An other research direction is the snippets themselves.                Conference (TREC-3). National Institute of Standards
Firstly, how snippet generation affects modeling performance.            and Technology, 1995.
Secondly, how a query can be generated from the snippets            [13] Hawking, D., Voorhees, E., Craswell, N., and
seen so far in more sophisticated ways. This could be done               Bailey, P. Overview of the trec-8 web track. Tech.
by attaching a different priority to different words in a snip-          rep., National Institute of Standards and Technology,
pet. Finally, the influence of the ratio of snippet to docu-             Gaithersburg, MD, US, 2000.
ment size could be further investigated.
                                                                    [14] Ide, N., and Suderman, K. The open american
                                                                         national corpus, 2007.
6.   ACKNOWLEDGEMENTS                                               [15] Kilgarriff, A., and Rose, T. Measures for corpus
   We thank the USI Lugano Information Retrieval group                   similarity and homogeneity. In Proceedings of EMNLP
for their comments, notably Mark Carman and Cyrus Hall.                  (Morristown, NJ, US, June 1998), ACL-SIGDAT,
We also thank Dolf Trieschnigg, Kien Tjin-Kam-Jet and Jan                pp. 46–52.
Flokstra. This paper, and the experiments, were created us-         [16] Manning, C. D., Raghavan, P., and Schütze, H.
ing only Free and Open Source Software. Finally, we grate-               Introduction to Information Retrieval. Cambridge
fully acknowledge the support of the Netherlands Organisa-               University Press, New York, NY, US, 2008.
tion for Scientific Research (NWO) under project DIRKA              [17] Manning, C. D., and Schütze, H. Foundations of
(NWO-Vidi), Number 639.022.809.                                          Statistical Natural Language Processing. MIT Press,
                                                                         Cambridge, MA, US, June 1999.
7.   REFERENCES                                                     [18] Monroe, G., French, J. C., and Powell, A. L.
 [1] Azzopardi, L., de Rijke, M., and Balog, K.                          Obtaining language models of web collections using
     Building simulated queries for known-item topics: An                query-based sampling techniques. In Proceedings of
     analysis using six european languages. In Proceedings               HICSS (Washington, DC, US, Jan. 2002), vol. 3,
     of SIGIR (New York, NY, US, July 2007), ACM,                        IEEE Computer Society, p. 67.
     pp. 455–462.                                                   [19] Paltoglou, G., Salampasis, M., and Satratzemi,
 [2] Bailey, P., Craswell, N., and Hawking, D.                           M. Hybrid results merging. In Proceedings of CIKM
     Engineering a multi-purpose test collection for web                 (New York, NY, US, Nov. 2007), ACM, pp. 321–330.
     retrieval experiments. Information Processing &                [20] Thomas, P., and Hawking, D. Evaluating sampling
     Management 39, 6 (2003), 853–871.                                   methods for uncooperative collections. In Proceedings
 [3] Baillie, M., Azzopardi, L., and Crestani, F.                        of SIGIR (New York, NY, US, July 2007), ACM,
     Adaptive Query-Based Sampling of Distributed                        pp. 503–510.
     Collections, vol. 4209 of Lecture Notes in Computer            [21] Venables, W. N., and Smith, D. M. An
     Science. Springer, 2006, pp. 316–328.                               Introduction to R, Aug. 2009.
 [4] Baillie, M., Carman, M. J., and Crestani, F. A
     topic-based measure of resource description quality for


                                                               12