=Paper= {{Paper |id=Vol-2311/paper_10 |storemode=property |title=Document Reordering is Good, Especially for e-Commerce |pdfUrl=https://ceur-ws.org/Vol-2311/paper_10.pdf |volume=Vol-2311 |authors=Vishnusaran Ramaswamy,Roberto Konow,Andrew Trotman,Jon Degenhardt,Nick Whyte |dblpUrl=https://dblp.org/rec/conf/sigir/RamaswamyKTDW17 }} ==Document Reordering is Good, Especially for e-Commerce== https://ceur-ws.org/Vol-2311/paper_10.pdf
        Document Reordering is Good, Especially for e-Commerce
        Vishnusaran Ramaswamy                                             Roberto Konow                               Andrew Trotman
                  eBay Search                                                eBay Search                              University of Otago
            visramaswamy@ebay.com                                         rkonow@ebay.com                           andrew@cs.otago.ac.nz

                                             Jon Degenhardt                                       Nick Whyte
                                               eBay Search                                       eBay Search
                                          jdegenhardt@ebay.com                                 nwhyte@ebay.com

ABSTRACT                                                                              improve data compression and reduce the size of the data that is
Document id reordering is a well-known technique in web search                        loaded into main memory, document id reordering [3].
for improving index compressibility and reducing query process-                          The inverted index is an old and simple, yet very efficient data
ing time. We explore and evaluate the benefits of document id                         structure that is at the heart of most search engines and is used
reordering for large-scale e-Commerce search. We observe that                         to support various search tasks. From a collection of documents,
many e-Commerce sites organize offerings according to an ontol-                       an inverted index stores for each unique term (or word) a list of
ogy (i.e. product category). We then present a simple extension                       postings. Each posting stores pairs ⟨d, w(t, d)⟩, where d is a unique
to document-id reordering: ordering based on item category. Our                       document identifier (doc_id) and w(t, d) is a relevance measure of
results show that we not only achieve the expected index size reduc-                  the term t with respect to document d (often the number of times
tion, but also achieve a latency reduction of over 20% (on average)                   term t occurs in document d). These posting lists can be extended to
per query.                                                                            store additional information such as the positions of the term within
                                                                                      the document. Posting lists are usually stored sorted by doc_id and
KEYWORDS                                                                              processed document at a time. To help with compression, doc_ids
                                                                                      are often stored difference encoded - the value stored in the list
E-Commerce search, Document id Re-ordering, Compression
                                                                                      is the difference (or d-gap) between this id and the preceding id.
ACM Reference format:                                                                 These d-gaps are further compressed using a variable-width integer
Vishnusaran Ramaswamy, Roberto Konow, Andrew Trotman, Jon Degen-                      codec such as variable byte encoding, PForDelta [15], QMX [13],
hardt, and Nick Whyte. 2017. Document Reordering is Good, Especially for              or similar.
e-Commerce. In Proceedings of ACM SIGIR Workshop on eCommerce, Tokyo,                    The final compression ratio depends on the number of bits re-
Japan, August 2017 (SIGIR 2017 eCom), 6 pages.                                        quired to represent the d-gaps, which depends on how the doc_ids
                                                                                      are assigned to documents. If the d-gaps are small they compress
                                                                                      better than if they are large. That is, if all the documents containing
1    INTRODUCTION                                                                     a given term have document ids that are close to each other then
The search engine plays an essential role in e-Commerce: it connects                  the index is smaller that if they are randomly distributed through-
the user’s need with a set of relevant items based on a query. This                   out the collection, simply because the d-gaps are smaller and so
is not a simple task, millions of queries per second need to be                       compress better.
processed over possibly billions of items. Moreover, it is expected                      This has motivated several authors in the past [2, 3, 7, 14] to
that every query will be executed in just a few hundred milliseconds.                 study the doc_id assignment process in such a way as to optimize
   In order to solve this problem, search engines are implemented                     compression. In practice, search engines can assign doc_ids in a
as large distributed systems where there is a limited budget in CPU                   number of different ways: at random, based on document similarity,
and memory that every machine can use. Any improvement in effi-                       in the order documents are indexed (collection order), based on a
ciency could potentially be translated into a reduction in hardware,                  global measure of quality such as pagerank, or for web documents,
networking, and operating costs. Tremendous research and engi-                        URL order. Others have noted [7] that document reordering not
neering efforts has gone into addressing performance challenges:                      only reduces index size, but can also decrease query processing
reduction of memory requirements by improving data compres-                           time.
sion, reduction the CPU use by implementing early termination                            A popular e-Commerce search technique to improve precision
techniques, and massively parallel execution engines are just a few                   is to constrain a query to a particular set of categories in a cate-
of the techniques that have been extensively studied in the past.                     gory tree. This can be done automatically by a trained classifier, or
In this paper, we focus on one technique originally designed to                       manually by the user. For example, the results of the query iphone
                                                                                      case can be constrained so that all the resulting items belong to the
Copyright © 2017 by the paper’s authors. Copying permitted for private and academic   category “Cell Phones & Accessories Cases, Covers & Skins”. These
purposes.
In: J. Degenhardt, S. Kallumadi, M. de Rijke, L. Si, A. Trotman, Y. Xu (eds.):        categories also form a natural partitions, clustering items according
Proceedings of the SIGIR 2017 eCom workshop, August 2017, Tokyo, Japan, published     to popular search dimensions.
at http://ceur-ws.org                                                                    In this paper we explore a document id reordering technique for
                                                                                      structured and categorized data that both improves compression
                                                                                      and decreases query latency. Our results show that document id
SIGIR 2017 eCom, August 2017, Tokyo, Japan
                                       Vishnusaran Ramaswamy, Roberto Konow, Andrew Trotman, Jon Degenhardt, and Nick Whyte


ordering on category substantially reduces the size of the index. It                    given offset from the start of the postings list. These ⟨doc-id, offset⟩
also reduces mean latency per query by 27% and 99th percentile                          tuples provide entry points into the postings [6, 10].
latency by 45%. Latency improvements are seen both with and                                To implement random access to a posting, a binary search is per-
without category constraints applied.                                                   formed on the skip list, then the appropriate block is decompressed
                                                                                        and searched linearly.
2 BASIC CONCEPTS & RELATED WORK
2.1 Inverted Index                                                                      2.2    Query Processing
                                                                                        Query processing involves a number of processes such as query
Given a collection D = {d 1 , d 2 , . . . , d D } of text documents, a doc-
                                                                                        parsing, query rewriting and the computation of complex machine-
ument di can be considered to be a sequence of terms (or words),
                                                                                        learned ranking functions that may include hundreds or thousands
the number of words, and an unique document identifier (doc_id)
                                                                                        of features derived from the documents, the query, and the user. To
∈ [1, D]. The number of words in a document is represented by |di |,
                                                                      ÍD                rank efficiently, it is common to separate the query processing into
and the total number of words in the collection is then i=1                |di | = n.
                                                                                        multiple stages. The first stage is responsible for identifying which
     An inverted index maintains the set of distinct terms of the
                                                                                        documents must be ranked and the subsequent stages rank those
collection (the vocabulary), which in most cases is small compared
                                                                                        documents. In the first phase, a simple and fast retrieval filtering
to the total number of words contained in the collection. More
precisely, it is of size O(n β ), for 0 < β < 1 depending on the text
                                                                                        such as WAND [4] and BlockMax-WAND [8] are often used. We do
                                                                                        not consider these algorithms further as Boolean is common as the
type (β is normally small).
                                                                                        first stage in e-Commerce search.
     For every distinct word, the inverted index stores a list of postings.
                                                                                           A conjunctive Boolean query of the form “Last AND Jedi” re-
Each posting stores the document identifier (doc_id), the weight of
                                                                                        quires an intersection calculation, whereas a disjunctive query of
the term in the document w(t, d) and, if needed, the positions of
                                                                                        the form “Episode OR VIII ” requires a union calculation. Union
the term within the document. The weight w(t, d) of term t in d
                                                                                        queries are solved by linearly traversing all postings lists for all
is a utility function that represents the importance of that word
                                                                                        terms in the expression and returning all documents containing
inside the document. That utility function is often just the number
                                                                                        either term. Efficiently resolving intersection queries requires com-
of times the term occurs in the document – the term frequency.
                                                                                        plicated traversal of the postings lists and has been examined for
     There are many different ways to encode posting lists including
                                                                                        decades [1, 5, 9]. It has been proven that the optimal intersection
term-frequency ordered, impact ordered, and so on, but the most
                                                                                        algorithm for two sets of length m and n with m ≤ n is O(m log m   n ).
common way appears to be ordering on increasing document id.
                                                                                        The most popular algorithm for solving intersections is Set Versus
Either way, in order to improve efficiency the inverted index is
                                                                                        Set (and also known as Small Versus Small) [1].
typically loaded into main memory when the search engine starts
                                                                                           In the second and subsequent phases, an increasingly expensive
up, and parts of it are compressed in order to reduce the memory
                                                                                        set of ranking functions is used to identify the most relevant doc-
footprint.
                                                                                        uments. Each stage provides to the next stage a better ranking of
     The weights are hard to compress and usually small so they are
                                                                                        the recall set, but also reduces the number of documents that are
often stored uncompressed in a fixed number of bytes (often 1). The
                                                                                        ranked (each time the top-k, for decreasing k, are re-ranked). The
document ids, however have been the subject of much research.
                                                                                        final stage might just rank the top-10 using thousands of features
     The list of doc_ids ⟨d 1 , d 2 , d 3 , . . . dn ⟩, is a strictly monotonically
                                                                                        derived from the document, the query, meta-data (the price), and a
increasing sequence. These integers can be decreased in size by
                                                                                        user profile (for e-Commerce, a buyer profile and a seller profile).
calculating the differences between each document id and the pre-
                                                                                           In general, improving the efficiency of the first stage frees more
ceding document id, resulting in a list of d-gaps ⟨d 1 , d 2 − d 1 , d 3 −
                                                                                        resources for the subsequent stages, and thus increases the overall
d 2 , . . . , dn − dn−1 ⟩. The list of d-gaps is then encoded using a
                                                                                        performance of the search engine.
variable-length encoding scheme.
     Bit-aligned codes were used in the past, but proved to be inef-
ficient when decoding. Byte-aligned codes [11] and word-aligned                         2.3    Document Reordering
codes [14] are now preferred as decoding speed is of concern. A                         Document reordering is a well studied technique in web search
simple, yet efficient byte-aligned technique is Variable Byte Encod-                    [2, 3, 7, 14]. Most prior work, has focused on reordering documents
ing (VByte), but an alternative approach is to word-align blocks of                     to achieve better compression. The approach is to perform text
integers using an encoding such as PForDelta. Integer compression                       clustering on the collection to find similar documents and then
techniques for monotonic integer sequences have been studied for                        assign doc_ids to minimize the d-gaps in the posting list. Silvestri
decades. We recommend the reader to see the work of Trotman                             [12] explored a much simpler idea that takes advantage of an im-
[13] for a comprehensive study and comparison of techniques.                            plicit organization of the documents in the Web. In his work he
     Merging lists can be done by traversing the lists from the start to                proposes to assign doc_ids sequentially according to the document
finish, but in order to support more complicated query processing                       URL and showed that this simple technique achieved competitive
techniques random access to postings in a list is needed. A recent                      results when compared to text clustering techniques. In essence,
approach is postings list encoding using Elias-Fano, but a more                         he roughly clustered on topic because different sites and different
common approach is the use of skip-lists.                                               parts of the same sites are usually topically related.
     A skip-list for a postings list is generated by dividing the postings                 Yan et al. [14] studied integer encoding for the special case of
list into blocks. Each block starts with a given doc-id and is at a                     optimally re-ordered d-gaps and introduced variants of well-known
Document Reordering is Good, Especially for e-Commerce                                            SIGIR 2017 eCom, August 2017, Tokyo, Japan




                                          Figure 1: User category constrained query on eBay.com


                                                                            Posting Lists Before
                                                                             w1
                                  D=     d1 d2 d3 d4 d5 d6 d7 …
                                                                             w2
                                                                             w3


                                                                            Posting Lists After
                                                                             w1
                                 D=      d1 d2 d3 d4 d5 d6 d7 …             w2
                                       Category 1 Category 2 Category 3     w3


Figure 2: Document reordering diagram. Different gray levels represent different categories. On the left, we show a sketch of
how documents get new doc_id assignment. On the right, the effect on posting lists.


encoding techniques for the purpose. They also showed that, when          to improve the precision of the results. Given an user query, the
using document reordering, it is possible to effectively compress         search engine can constrain the results to just those from the most
term frequencies since similar frequency values are also (conse-          relevant category. This can be done in an automatic way by training
quently) grouped together. Finally, they evaluated different query        query to category classifiers, or by allowing the user to manually
processing schemes and showed that document reordering can help           select a category. Figure 1 shows an example user query “star wars”
the search engine performance by reducing the number of doc_ids           being constrained to the category “art” on eBay.com.
that requires to be decompressed (because fewer encoded blocks                If the search engine creates posting lists for each category, the
are decompressed).                                                        first stage of query processing can be improved significantly, since
   Our approach is motivated by the work of Silvestri [12] – we           it can perform a direct boolean intersection between the (possibly
present a simple but non-optimal document id reordering tech-             term expanded) user query and the posting list for the given cat-
nique. It takes advantage of the structured nature of documents           egory. Since this cuts down the size of the recall base it increases
in e-Commerce, specifically that items that are for sale are usually      the efficiency of the search engine, but since it restricts to the
classified and categorized before being listed. We evaluate this from     most likely category it also removes noise from the results list so
both compression and query efficiency perspectives. We analyze            increases precision. And this is the motivation for document id
the benefits of employing document id reordering in our search            reordering based on item category.
engine for the different stages in query processing, taking into              We re-order the collection so that the doc_ids are assigned in
consideration special constraints that are common in e-Commerce           such a way that items belonging to the same category are given
search.                                                                   contiguous doc_ids. This reduces the size of the d-gaps in posting
                                                                          lists which leads to better compression. However, this is not the
3    DOCUMENT REORDERING IN                                               only benefit, since posting lists are sorted by doc_id, it creates
                                                                          implicit category “divisions” within each posting list.
     E-COMMERCE                                                               Figure 2 illustrates this. On the top left, the collection of docu-
E-Commerce search is a much more structured and constrained               ments is shown in “collection order”, where the distinct shades of
scenario than web search. In e-Commerce, much of the document             gray represent different categories. The top right gives example
content is given by the item properties such as price, brand, model,      posting lists for words (w 1 , w 2 , w 3 ). The bottom left of the figure
category, color, and so on. It is natural to consider these features as   shows the collection category reordered where, for example, col-
filtering components of a search and it is consequently common            lection ordered d 2 becomes category ordered d 3 . The bottom right
practice to generate posting lists for those kind of features. For        shows the effect of document reordering on the posting lists, they
example, by generating posting lists for each instance of the feature     are now implicitly divided by category.
“brand” (i.e brand:apple, brand:samsung, etc) the user can easily con-        This document reordering scheme not only helps compression,
strain their search to just the items made by a given manufacturer        but also decreases latency: as the d-gaps are smaller the decompres-
– and indeed they expect to be able to do so.                             sion of the integers is faster since, in general, a smaller number of
    Category is a particularly interesting property of e-Commerce         operations is required to decode a smaller integer. Equally, since sim-
items (and queries), because it is not only used to divide the inven-     ilar documents are stored together, fewer accesses to the skip-lists
tory into distinct types of products but it is also commonly used
SIGIR 2017 eCom, August 2017, Tokyo, Japan
                                       Vishnusaran Ramaswamy, Roberto Konow, Andrew Trotman, Jon Degenhardt, and Nick Whyte


are needed. It is obvious that when a query is category constrained                                                                                Random Reordered Change (%)
the results must lie within a consecutive part of the postings list.                                                   Avg. log2 (d-gaps)             5.73         4.11        -28%
                                                                                                                       d-gaps = 1                  890 × 106 1, 510 × 106      +70%
                                                  Distribution of Log2(d−gaps)                                         Avg. d-gaps                    1966          639       -67.5%
                         30                                                                                            Avg. Bytes/d-gap (vByte)       1.30         1.22        -6.1%
                                                                                                                       Index Size                   29.67 GB     28.72 GB      -3.2%
                                                                                                                      Table 1: Space savings and bits per d-gap obtained before and
                                                                                                                      after applying document reordering.
  Porcentage of d−gaps




                         20



                                                                                                          Random
                                                                                                          Reordered
                                                                                                                      of g-gaps. The x-axis is ⌈log2 (дapsize)⌉ while the y-axis is the per-
                                                                                                                      centage of d-gaps of that size. It can be seen that the reordered
                         10
                                                                                                                      distribution has many more d-gaps on the left – the reordered
                                                                                                                      index has substantially more small d-gaps than the random index.
                                                                                                                         Table 1 presents a summary of the figure. It shows that the num-
                                                                                                                      ber of d-gaps equal to 1 has increased by 70%, that the average
                         0                                                                                            d-gap has decreased by 67.5%, and that the average number of bits
                              0   1   2   3   4     5   6    7   8   9   10      11   12   13   14   15               required to represent d-gaps is reduced by 28%. In practice, the ac-
                                                         Log2(d−gaps)
                                                                                                                      tual size reduction in the index will depend on the integer encoding
Figure 3: Distribution of d-gaps. The x-axis corresponds to                                                           scheme. For the purpose of this paper, we constructed a special
the number of bits required to represent the d-gaps in bi-                                                            index that uses variable byte encoding to compress doc_ids. We see
nary plus one. The y-axis is the percent of such d-gaps in                                                            a reduction of 6.1% in the average number of bytes required to rep-
the collection.                                                                                                       resent a doc_id using this encoding scheme, while this represents a
                                                                                                                      3.2% space reduction of the complete index.

                                                                                                                      4.3    Query Processing Results
4 EXPERIMENTS
                                                                                                                      In order to evaluate the improvement of query processing, we
4.1 Experimental Setup                                                                                                executed two sets of experiments with the two different query sets.
We conducted our experiments on eBay’s search engine. We selected                                                        The first experiment considered throughput using the General
12 million random items from our dataset and constructed two                                                          Queries – that is, its a mirror of the production environment. We
indexes:                                                                                                              computed the average number of queries per second (QPS) that
     • Random Index: documents were assigning doc_ids at ran-                                                         could be resolved when the CPU was held at 80% utilization (the
       dom.                                                                                                           other 20% is used in eBay for processes such as index maintenance).
     • Reordered Index: documents were sorted by category and                                                         We found that on average the Reordered Index could process about
       then assigning doc_ids.                                                                                        about 30% more queries per second than the Random Index.
                                                                                                                         Table 2 shows the average search time per query (in milliseconds)
   All of our experiments were conducted on a dedicated server with
                                                                                                                      at the mean, median, 95th percentile and 99th percentile. In all cases
two 10-core Intel Xeon E5-2670 v2 CPU running at 2.50Ghz, 25 MB
                                                                                                                      we see a substantial improvement, the mean latency improved by
of cache, and 128 GB RAM running Ubuntu 16.04 on Linux Kernel
                                                                                                                      26.9% while 95th percentile latency is almost halved when compared
4.4.0-57. Our search engine is written in C++ and was compiled
                                                                                                                      to the random document Reordering.
using g++ 5.4.0 and maximum optimization level.
   To evaluate the performance of our document reordering tech-
nique we use two sets of queries:                                                                                                   Random Reordered Latency Reduction
                                                                                                                         Mean         22.43      16.4           26.9%
     • General Queries: approximately 3 million queries from pro-
                                                                                                                         Median        4.35      3.21           28.3%
       duction logs of one eBay data center. These queries included
                                                                                                                         95th Pct.      57       30.2            47%
       user issued queries as well as system-issued queries (such
       as those from the eBay public APIs). No country-specific                                                          99th Pct.     375        224            40%
       filtering was performed, so queries are in many languages.                                                     Table 2: Comparison of random doc_id assignment versus
     • User Queries: a random sample of 57,058 queries from eBay.com                                                  category doc_id reordering. Mean, median, 95th and 99th
       exactly as entered into the search box by ebay.com users.                                                      percentiles of query processing times in milliseconds for
                                                                                                                      general queries.
4.2                           Index Space Results
Since compression improvements will depend on the d-gaps values,
we first analyzed the differences in the distribution of the d-gaps                                                      For the second experiment, with user queries, we evaluated the
between random doc_id assignment and category-based doc_id                                                            impact of document reordering depending on the recall set size
assignment. Figure 3 shows on a log-linear scale the distribution                                                     (as output by the first stage of the search process) and the latency
Document Reordering is Good, Especially for e-Commerce                                         SIGIR 2017 eCom, August 2017, Tokyo, Japan




Figure 4: Recall size vs latency with full ranking enabled and without ranking. Blue triangles represent data points with
document reordered index, while red plus signs represents random document reordering.

                                 Without Category Constraint                   With Category Constraint
                         Random Reordered Latency Reduction Random Reordered Latency Reduction
              Mean          26.8        14.3             47%             18.9        8.4              55%
              Median        5.9         3.5              42%              3.2        1.7              48%
              95th Pct.     85.8        50.8             41%             61.6        27.6             55%
          Table 3: Execution time in milliseconds for user queries with and without category constraint enforcement.



before ranking and after ranking. The results are presented in figure     a small number of categories. Ordering posting lists by category
4. The blue plus signs represent the Reordered Index, while the red       will put documents satisfying these queries near each other both in
triangles represent the Random Index. On the left we show the             posting lists and in the forward index. This should improve CPU
impact post ranking, where it can be seen that the largest latency        cache hit rates and even improve the inherent efficiency of the
improvements are obtained when the queries generate a large set           posting list processing algorithms. The latter effect would result
of results. On the right we show the recall versus latency results        from having larger clusters of posting list entries that are either
when ranking is disabled, in other words, just recall identification.     matched or skipped than in a random assignment of the documents.
It can be seen that there is a strict boundary at the bottom, and            Query expansion is likely to compound these effects, especially
there is no significant improvement for the queries that are located      in an e-Commerce environment. As in most search engines, we
within that boundary. Latency improvements can be seen overall,           make extensive use of query expansion to improve recall and preci-
but are large for expensive queries.                                      sion. Rewritten queries often form a complex Boolean expression
   We also evaluated the impact of applying category constrains to        involving many posting lists and nested AND / OR constructs. Ex-
the queries. The results are shown in table 3. The left side shows        pansions involve not only text keywords, but also the structured
the latency (in milliseconds) when category constraint is not used.       meta-data associated with products. For example, the term “black”
In this case the Reordered index improved mean query latency by           may expand to “color:black” and “samsung” to “brand:samsung”.
47% and the 95th percentile by 41%. The right shows the effect               To examine these effects we used the CPU usage profiling tool
of enabling category constrain on the queries. There the mean             perf, while processing a portion of a General Queries collection, to
query latency has reduced by 55% when the Reordered Index is              analyze and identify the exact locations where the improvement
used, and a similar effect is observed for the 95th percentile. Clearly   was more noticeable. We observed the functions performing the task
both unconstrained and category constrained queries are materially        of iterating (and skipping) through posting lists was consuming
improved.                                                                 about 5% of the total CPU time, and we observed a noticeable
                                                                          improvement especially in these parts. We also saw improvements
4.4    Latency Improvement Analysis                                       in the doc_id variable byte decoding section. Finally we analyzed the
                                                                          effect of how cache misses were affected by the document reordered
Categories naturally cluster both document terms and query terms.
                                                                          index. We observed a 7% reduction in overall cache misses and a
Items satisfying a multi-term AND query will tend to come from
SIGIR 2017 eCom, August 2017, Tokyo, Japan
                                       Vishnusaran Ramaswamy, Roberto Konow, Andrew Trotman, Jon Degenhardt, and Nick Whyte


13% reduction in last-level cache misses (LLC). These numbers                                [5] Shane Culpepper and Alistair Moffat. 2010. Efficient Set Intersection for In-
show that that document ordering by category yielded a significant                               verted Indexing. ACM Transactions on Information Systems 29, 1, Article 1 (2010),
                                                                                                 25 pages.
improvement in overall CPU cache hit rates. These numbers are                                [6] Shane J. Culpepper and Alistair Moffat. 2007. Compact Set Representation for
consistent with our hypothesis for the improvements in latency.                                  Information Retrieval. In Proceedings of the 14th International Symposium on
                                                                                                 String Processing and Information Retrieval (SPIRE). 137–148.
Additional analysis is still needed to quantify the effects on posting                       [7] Shuai Ding, Josh Attenberg, and Torsten Suel. 2010. Scalable Techniques for
listing processing.                                                                              Document Identifier Assignment in Inverted Indexes. In Proceedings of the 19th
    The probability that any given cluster of posting list entries                               International Conference on World Wide Web (WWW ’10). ACM, 311–320.
                                                                                             [8] Shuai Ding and Torsten Suel. 2011. Faster top-k document retrieval using block-
will be referenced by a query is far from uniform in the General                                 max indexes. In Proceedings of the 34th international ACM SIGIR conference on
Queries collection, and more likely following something like a                                   Research and development in Information Retrieval. ACM, 993–1002.
zipfian curve. This should reduce the number of CPU cache entries                            [9] Andrew Kane and Frank Tompa. 2014. Skewed Partial Bitvectors for List Intersec-
                                                                                                 tion. In Proceedings of the 37th Annual International ACM Conference on Research
filled with posting list data during processing of a query, and thus                             and Development in Information Retrieval (SIGIR) (SIGIR ’14). 263–272.
reducing the CPU cache worked set size for the portion of query                             [10] Peter Sanders and Frederik Transier. 2007. Intersection in Integer Inverted
                                                                                                 Indices. In Proceedings 9th Workshop on Algorithm Engineering and Experiments
processing dedicated to processing posting lists. The reduction in                               (ALENEX).
CPU cache working set size for posting lists allows a larger amount                         [11] Falk Scholer, Hugh Williams, John Yiannis, and Justin Zobel. Compression of
of CPU cache to be used for other functions performing query                                     inverted indexes for fast query evaluation. In Proceedings of the 25th Annual In-
                                                                                                 ternational ACM Conference on Research and Development in Information Retrieval
processing work, which improves the CPU cache hit rate for those                                 (SIGIR). 222–229.
other functions.                                                                            [12] Fabrizio Silvestri. 2007. Sorting out the Document Identifier Assignment Problem.
    The above discussion focuses on the recall component of query                                In Proceedings of the 29th European Conference on IR Research (ECIR’07). 101–112.
                                                                                            [13] Andrew Trotman. 2014. Compression, SIMD, and Postings Lists. In Proceedings
processing. As mentioned earlier, document reordering also better                                of the 2014 Australasian Document Computing Symposium (ADCS). Article 50,
co-locates forward index data for items satisfying the recall ex-                                8 pages.
                                                                                            [14] Hao Yan, Shuai Ding, and Torsten Suel. 2009. Inverted index compression and
pression for a query. Forward index data is used extensively in the                              query processing with optimized document ordering. In Proceedings of the 18th
ranking component of query processing. As such, this is also has                                 international conference on World wide web. ACM, 401–410.
potential to improve CPU cache hit rates. We have not measured                              [15] Marcin Zukowski, Sandor Heman, Niels Nes, and Peter Boncz. 2006. Super-scalar
                                                                                                 RAM-CPU cache compression. In Proceedings of the 22nd International Conference
this effect directly, but it is consistent with the results shown in                             on Data Engineering, ICDE’06. IEEE, 59–59.
figure 4. These graphs show a latency improvement from ranking
beyond the improvement from recall processing alone.

5    CONCLUSIONS AND FUTURE WORK
We presented a simple, yet efficient, document reordering technique
that takes advantage of structured component from the queries
and the data. This is particularly common in e-Commerce search
engines. We showed that by performing this simple re-arrangement
of the doc_ids we can improve the capacity of the system by 30%,
and process category-constrained queries in about half the time
that it took without the re-arrangement.
   As future work, we plan to add more document reordering crite-
ria to other dimensions such as item aspects like color, brand and
also the location of the item by ordering by country, another impor-
tant task is to measure the behavior of the cache in this scenario.
We also plan to perform a deeper analysis and quantification of
the latency improvements seen in both boolean only queries and
when ranking is enabled. A topic for future study is whether there
is a material difference between these two cases when considering
per-document latency benefits. The current analysis is insufficient
to answer this question.

REFERENCES
 [1] Jeremy Barbay, Alejandro López-Ortiz, Tyler Lu, and Alejandro Salinger. 2009.
     An experimental investigation of set intersection algorithms for text searching.
     ACM Journal of Experimental Algorithmics 14 (2009), art. 7.
 [2] Roi Blanco and Álvaro Barreiro. 2005. Document identifier reassignment
     through dimensionality reduction. In European Conference on Information Re-
     trieval. Springer, 375–387.
 [3] Dan Blandford and Guy Blelloch. 2002. Index compression through document
     reordering. In Proceedings of the Data Compression Conference. DCC 2002. IEEE,
     342–351.
 [4] Andrei Z Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien.
     2003. Efficient query evaluation using a two-level retrieval process. In Proceedings
     of the twelfth international conference on Information and knowledge management.
     ACM, 426–434.