=Paper= {{Paper |id=Vol-2311/paper_10 |storemode=property |title=Document Reordering is Good, Especially for e-Commerce |pdfUrl=https://ceur-ws.org/Vol-2311/paper_10.pdf |volume=Vol-2311 |authors=Vishnusaran Ramaswamy,Roberto Konow,Andrew Trotman,Jon Degenhardt,Nick Whyte |dblpUrl=https://dblp.org/rec/conf/sigir/RamaswamyKTDW17 }} ==Document Reordering is Good, Especially for e-Commerce== https://ceur-ws.org/Vol-2311/paper_10.pdf

Document Reordering is Good, Especially for e-Commerce
Vishnusaran Ramaswamy Roberto Konow Andrew Trotman
eBay Search eBay Search University of Otago
visramaswamy@ebay.com rkonow@ebay.com andrew@cs.otago.ac.nz

Jon Degenhardt Nick Whyte
eBay Search eBay Search
jdegenhardt@ebay.com nwhyte@ebay.com

ABSTRACT improve data compression and reduce the size of the data that is
Document id reordering is a well-known technique in web search loaded into main memory, document id reordering [3].
for improving index compressibility and reducing query process- The inverted index is an old and simple, yet very efficient data
ing time. We explore and evaluate the benefits of document id structure that is at the heart of most search engines and is used
reordering for large-scale e-Commerce search. We observe that to support various search tasks. From a collection of documents,
many e-Commerce sites organize offerings according to an ontol- an inverted index stores for each unique term (or word) a list of
ogy (i.e. product category). We then present a simple extension postings. Each posting stores pairs ⟨d, w(t, d)⟩, where d is a unique
to document-id reordering: ordering based on item category. Our document identifier (doc_id) and w(t, d) is a relevance measure of
results show that we not only achieve the expected index size reduc- the term t with respect to document d (often the number of times
tion, but also achieve a latency reduction of over 20% (on average) term t occurs in document d). These posting lists can be extended to
per query. store additional information such as the positions of the term within
the document. Posting lists are usually stored sorted by doc_id and
KEYWORDS processed document at a time. To help with compression, doc_ids
are often stored difference encoded - the value stored in the list
E-Commerce search, Document id Re-ordering, Compression
is the difference (or d-gap) between this id and the preceding id.
ACM Reference format: These d-gaps are further compressed using a variable-width integer
Vishnusaran Ramaswamy, Roberto Konow, Andrew Trotman, Jon Degen- codec such as variable byte encoding, PForDelta [15], QMX [13],
hardt, and Nick Whyte. 2017. Document Reordering is Good, Especially for or similar.
e-Commerce. In Proceedings of ACM SIGIR Workshop on eCommerce, Tokyo, The final compression ratio depends on the number of bits re-
Japan, August 2017 (SIGIR 2017 eCom), 6 pages. quired to represent the d-gaps, which depends on how the doc_ids
are assigned to documents. If the d-gaps are small they compress
better than if they are large. That is, if all the documents containing
1 INTRODUCTION a given term have document ids that are close to each other then
The search engine plays an essential role in e-Commerce: it connects the index is smaller that if they are randomly distributed through-
the user’s need with a set of relevant items based on a query. This out the collection, simply because the d-gaps are smaller and so
is not a simple task, millions of queries per second need to be compress better.
processed over possibly billions of items. Moreover, it is expected This has motivated several authors in the past [2, 3, 7, 14] to
that every query will be executed in just a few hundred milliseconds. study the doc_id assignment process in such a way as to optimize
In order to solve this problem, search engines are implemented compression. In practice, search engines can assign doc_ids in a
as large distributed systems where there is a limited budget in CPU number of different ways: at random, based on document similarity,
and memory that every machine can use. Any improvement in effi- in the order documents are indexed (collection order), based on a
ciency could potentially be translated into a reduction in hardware, global measure of quality such as pagerank, or for web documents,
networking, and operating costs. Tremendous research and engi- URL order. Others have noted [7] that document reordering not
neering efforts has gone into addressing performance challenges: only reduces index size, but can also decrease query processing
reduction of memory requirements by improving data compres- time.
sion, reduction the CPU use by implementing early termination A popular e-Commerce search technique to improve precision
techniques, and massively parallel execution engines are just a few is to constrain a query to a particular set of categories in a cate-
of the techniques that have been extensively studied in the past. gory tree. This can be done automatically by a trained classifier, or
In this paper, we focus on one technique originally designed to manually by the user. For example, the results of the query iphone
case can be constrained so that all the resulting items belong to the
Copyright © 2017 by the paper’s authors. Copying permitted for private and academic category “Cell Phones & Accessories Cases, Covers & Skins”. These
purposes.
In: J. Degenhardt, S. Kallumadi, M. de Rijke, L. Si, A. Trotman, Y. Xu (eds.): categories also form a natural partitions, clustering items according
Proceedings of the SIGIR 2017 eCom workshop, August 2017, Tokyo, Japan, published to popular search dimensions.
at http://ceur-ws.org In this paper we explore a document id reordering technique for
structured and categorized data that both improves compression
and decreases query latency. Our results show that document id
SIGIR 2017 eCom, August 2017, Tokyo, Japan
Vishnusaran Ramaswamy, Roberto Konow, Andrew Trotman, Jon Degenhardt, and Nick Whyte

ordering on category substantially reduces the size of the index. It given offset from the start of the postings list. These ⟨doc-id, offset⟩
also reduces mean latency per query by 27% and 99th percentile tuples provide entry points into the postings [6, 10].
latency by 45%. Latency improvements are seen both with and To implement random access to a posting, a binary search is per-
without category constraints applied. formed on the skip list, then the appropriate block is decompressed
and searched linearly.
2 BASIC CONCEPTS & RELATED WORK
2.1 Inverted Index 2.2 Query Processing
Query processing involves a number of processes such as query
Given a collection D = {d 1 , d 2 , . . . , d D } of text documents, a doc-
parsing, query rewriting and the computation of complex machine-
ument di can be considered to be a sequence of terms (or words),
learned ranking functions that may include hundreds or thousands
the number of words, and an unique document identifier (doc_id)
of features derived from the documents, the query, and the user. To
∈ [1, D]. The number of words in a document is represented by |di |,
ÍD rank efficiently, it is common to separate the query processing into
and the total number of words in the collection is then i=1 |di | = n.
multiple stages. The first stage is responsible for identifying which
An inverted index maintains the set of distinct terms of the
documents must be ranked and the subsequent stages rank those
collection (the vocabulary), which in most cases is small compared
documents. In the first phase, a simple and fast retrieval filtering
to the total number of words contained in the collection. More
precisely, it is of size O(n β ), for 0 < β < 1 depending on the text
such as WAND [4] and BlockMax-WAND [8] are often used. We do
not consider these algorithms further as Boolean is common as the
type (β is normally small).
first stage in e-Commerce search.
For every distinct word, the inverted index stores a list of postings.
A conjunctive Boolean query of the form “Last AND Jedi” re-
Each posting stores the document identifier (doc_id), the weight of
quires an intersection calculation, whereas a disjunctive query of
the term in the document w(t, d) and, if needed, the positions of
the form “Episode OR VIII ” requires a union calculation. Union
the term within the document. The weight w(t, d) of term t in d
queries are solved by linearly traversing all postings lists for all
is a utility function that represents the importance of that word
terms in the expression and returning all documents containing
inside the document. That utility function is often just the number
either term. Efficiently resolving intersection queries requires com-
of times the term occurs in the document – the term frequency.
plicated traversal of the postings lists and has been examined for
There are many different ways to encode posting lists including
decades [1, 5, 9]. It has been proven that the optimal intersection
term-frequency ordered, impact ordered, and so on, but the most
algorithm for two sets of length m and n with m ≤ n is O(m log m n ).
common way appears to be ordering on increasing document id.
The most popular algorithm for solving intersections is Set Versus
Either way, in order to improve efficiency the inverted index is
Set (and also known as Small Versus Small) [1].
typically loaded into main memory when the search engine starts
In the second and subsequent phases, an increasingly expensive
up, and parts of it are compressed in order to reduce the memory
set of ranking functions is used to identify the most relevant doc-
footprint.
uments. Each stage provides to the next stage a better ranking of
The weights are hard to compress and usually small so they are
the recall set, but also reduces the number of documents that are
often stored uncompressed in a fixed number of bytes (often 1). The
ranked (each time the top-k, for decreasing k, are re-ranked). The
document ids, however have been the subject of much research.
final stage might just rank the top-10 using thousands of features
The list of doc_ids ⟨d 1 , d 2 , d 3 , . . . dn ⟩, is a strictly monotonically
derived from the document, the query, meta-data (the price), and a
increasing sequence. These integers can be decreased in size by
user profile (for e-Commerce, a buyer profile and a seller profile).
calculating the differences between each document id and the pre-
In general, improving the efficiency of the first stage frees more
ceding document id, resulting in a list of d-gaps ⟨d 1 , d 2 − d 1 , d 3 −
resources for the subsequent stages, and thus increases the overall
d 2 , . . . , dn − dn−1 ⟩. The list of d-gaps is then encoded using a
performance of the search engine.
variable-length encoding scheme.
Bit-aligned codes were used in the past, but proved to be inef-
ficient when decoding. Byte-aligned codes [11] and word-aligned 2.3 Document Reordering
codes [14] are now preferred as decoding speed is of concern. A Document reordering is a well studied technique in web search
simple, yet efficient byte-aligned technique is Variable Byte Encod- [2, 3, 7, 14]. Most prior work, has focused on reordering documents
ing (VByte), but an alternative approach is to word-align blocks of to achieve better compression. The approach is to perform text
integers using an encoding such as PForDelta. Integer compression clustering on the collection to find similar documents and then
techniques for monotonic integer sequences have been studied for assign doc_ids to minimize the d-gaps in the posting list. Silvestri
decades. We recommend the reader to see the work of Trotman [12] explored a much simpler idea that takes advantage of an im-
[13] for a comprehensive study and comparison of techniques. plicit organization of the documents in the Web. In his work he
Merging lists can be done by traversing the lists from the start to proposes to assign doc_ids sequentially according to the document
finish, but in order to support more complicated query processing URL and showed that this simple technique achieved competitive
techniques random access to postings in a list is needed. A recent results when compared to text clustering techniques. In essence,
approach is postings list encoding using Elias-Fano, but a more he roughly clustered on topic because different sites and different
common approach is the use of skip-lists. parts of the same sites are usually topically related.
A skip-list for a postings list is generated by dividing the postings Yan et al. [14] studied integer encoding for the special case of
list into blocks. Each block starts with a given doc-id and is at a optimally re-ordered d-gaps and introduced variants of well-known
Document Reordering is Good, Especially for e-Commerce SIGIR 2017 eCom, August 2017, Tokyo, Japan

Figure 1: User category constrained query on eBay.com

Posting Lists Before
w1
D= d1 d2 d3 d4 d5 d6 d7 …
w2
w3

Posting Lists After
w1
D= d1 d2 d3 d4 d5 d6 d7 … w2
Category 1 Category 2 Category 3 w3

Figure 2: Document reordering diagram. Different gray levels represent different categories. On the left, we show a sketch of
how documents get new doc_id assignment. On the right, the effect on posting lists.

encoding techniques for the purpose. They also showed that, when to improve the precision of the results. Given an user query, the
using document reordering, it is possible to effectively compress search engine can constrain the results to just those from the most
term frequencies since similar frequency values are also (conse- relevant category. This can be done in an automatic way by training
quently) grouped together. Finally, they evaluated different query query to category classifiers, or by allowing the user to manually
processing schemes and showed that document reordering can help select a category. Figure 1 shows an example user query “star wars”
the search engine performance by reducing the number of doc_ids being constrained to the category “art” on eBay.com.
that requires to be decompressed (because fewer encoded blocks If the search engine creates posting lists for each category, the
are decompressed). first stage of query processing can be improved significantly, since
Our approach is motivated by the work of Silvestri [12] – we it can perform a direct boolean intersection between the (possibly
present a simple but non-optimal document id reordering tech- term expanded) user query and the posting list for the given cat-
nique. It takes advantage of the structured nature of documents egory. Since this cuts down the size of the recall base it increases
in e-Commerce, specifically that items that are for sale are usually the efficiency of the search engine, but since it restricts to the
classified and categorized before being listed. We evaluate this from most likely category it also removes noise from the results list so
both compression and query efficiency perspectives. We analyze increases precision. And this is the motivation for document id
the benefits of employing document id reordering in our search reordering based on item category.
engine for the different stages in query processing, taking into We re-order the collection so that the doc_ids are assigned in
consideration special constraints that are common in e-Commerce such a way that items belonging to the same category are given
search. contiguous doc_ids. This reduces the size of the d-gaps in posting
lists which leads to better compression. However, this is not the
3 DOCUMENT REORDERING IN only benefit, since posting lists are sorted by doc_id, it creates
implicit category “divisions” within each posting list.
E-COMMERCE Figure 2 illustrates this. On the top left, the collection of docu-
E-Commerce search is a much more structured and constrained ments is shown in “collection order”, where the distinct shades of
scenario than web search. In e-Commerce, much of the document gray represent different categories. The top right gives example
content is given by the item properties such as price, brand, model, posting lists for words (w 1 , w 2 , w 3 ). The bottom left of the figure
category, color, and so on. It is natural to consider these features as shows the collection category reordered where, for example, col-
filtering components of a search and it is consequently common lection ordered d 2 becomes category ordered d 3 . The bottom right
practice to generate posting lists for those kind of features. For shows the effect of document reordering on the posting lists, they
example, by generating posting lists for each instance of the feature are now implicitly divided by category.
“brand” (i.e brand:apple, brand:samsung, etc) the user can easily con- This document reordering scheme not only helps compression,
strain their search to just the items made by a given manufacturer but also decreases latency: as the d-gaps are smaller the decompres-
– and indeed they expect to be able to do so. sion of the integers is faster since, in general, a smaller number of
Category is a particularly interesting property of e-Commerce operations is required to decode a smaller integer. Equally, since sim-
items (and queries), because it is not only used to divide the inven- ilar documents are stored together, fewer accesses to the skip-lists
tory into distinct types of products but it is also commonly used
SIGIR 2017 eCom, August 2017, Tokyo, Japan
Vishnusaran Ramaswamy, Roberto Konow, Andrew Trotman, Jon Degenhardt, and Nick Whyte

are needed. It is obvious that when a query is category constrained Random Reordered Change (%)
the results must lie within a consecutive part of the postings list. Avg. log2 (d-gaps) 5.73 4.11 -28%
d-gaps = 1 890 × 106 1, 510 × 106 +70%
Distribution of Log2(d−gaps) Avg. d-gaps 1966 639 -67.5%
30 Avg. Bytes/d-gap (vByte) 1.30 1.22 -6.1%
Index Size 29.67 GB 28.72 GB -3.2%
Table 1: Space savings and bits per d-gap obtained before and
after applying document reordering.
Porcentage of d−gaps

Random
Reordered
of g-gaps. The x-axis is ⌈log2 (дapsize)⌉ while the y-axis is the per-
centage of d-gaps of that size. It can be seen that the reordered
10
distribution has many more d-gaps on the left – the reordered
index has substantially more small d-gaps than the random index.
Table 1 presents a summary of the figure. It shows that the num-
ber of d-gaps equal to 1 has increased by 70%, that the average
0 d-gap has decreased by 67.5%, and that the average number of bits
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 required to represent d-gaps is reduced by 28%. In practice, the ac-
Log2(d−gaps)
tual size reduction in the index will depend on the integer encoding
Figure 3: Distribution of d-gaps. The x-axis corresponds to scheme. For the purpose of this paper, we constructed a special
the number of bits required to represent the d-gaps in bi- index that uses variable byte encoding to compress doc_ids. We see
nary plus one. The y-axis is the percent of such d-gaps in a reduction of 6.1% in the average number of bytes required to rep-
the collection. resent a doc_id using this encoding scheme, while this represents a
3.2% space reduction of the complete index.

4.3 Query Processing Results
4 EXPERIMENTS
In order to evaluate the improvement of query processing, we
4.1 Experimental Setup executed two sets of experiments with the two different query sets.
We conducted our experiments on eBay’s search engine. We selected The first experiment considered throughput using the General
12 million random items from our dataset and constructed two Queries – that is, its a mirror of the production environment. We
indexes: computed the average number of queries per second (QPS) that
• Random Index: documents were assigning doc_ids at ran- could be resolved when the CPU was held at 80% utilization (the
dom. other 20% is used in eBay for processes such as index maintenance).
• Reordered Index: documents were sorted by category and We found that on average the Reordered Index could process about
then assigning doc_ids. about 30% more queries per second than the Random Index.
Table 2 shows the average search time per query (in milliseconds)
All of our experiments were conducted on a dedicated server with
at the mean, median, 95th percentile and 99th percentile. In all cases
two 10-core Intel Xeon E5-2670 v2 CPU running at 2.50Ghz, 25 MB
we see a substantial improvement, the mean latency improved by
of cache, and 128 GB RAM running Ubuntu 16.04 on Linux Kernel
26.9% while 95th percentile latency is almost halved when compared
4.4.0-57. Our search engine is written in C++ and was compiled
to the random document Reordering.
using g++ 5.4.0 and maximum optimization level.
To evaluate the performance of our document reordering tech-
nique we use two sets of queries: Random Reordered Latency Reduction
Mean 22.43 16.4 26.9%
• General Queries: approximately 3 million queries from pro-
Median 4.35 3.21 28.3%
duction logs of one eBay data center. These queries included
95th Pct. 57 30.2 47%
user issued queries as well as system-issued queries (such
as those from the eBay public APIs). No country-specific 99th Pct. 375 224 40%
filtering was performed, so queries are in many languages. Table 2: Comparison of random doc_id assignment versus
• User Queries: a random sample of 57,058 queries from eBay.com category doc_id reordering. Mean, median, 95th and 99th
exactly as entered into the search box by ebay.com users. percentiles of query processing times in milliseconds for
general queries.
4.2 Index Space Results
Since compression improvements will depend on the d-gaps values,
we first analyzed the differences in the distribution of the d-gaps For the second experiment, with user queries, we evaluated the
between random doc_id assignment and category-based doc_id impact of document reordering depending on the recall set size
assignment. Figure 3 shows on a log-linear scale the distribution (as output by the first stage of the search process) and the latency
Document Reordering is Good, Especially for e-Commerce SIGIR 2017 eCom, August 2017, Tokyo, Japan

Figure 4: Recall size vs latency with full ranking enabled and without ranking. Blue triangles represent data points with
document reordered index, while red plus signs represents random document reordering.

Without Category Constraint With Category Constraint
Random Reordered Latency Reduction Random Reordered Latency Reduction
Mean 26.8 14.3 47% 18.9 8.4 55%
Median 5.9 3.5 42% 3.2 1.7 48%
95th Pct. 85.8 50.8 41% 61.6 27.6 55%
Table 3: Execution time in milliseconds for user queries with and without category constraint enforcement.

before ranking and after ranking. The results are presented in figure a small number of categories. Ordering posting lists by category
4. The blue plus signs represent the Reordered Index, while the red will put documents satisfying these queries near each other both in
triangles represent the Random Index. On the left we show the posting lists and in the forward index. This should improve CPU
impact post ranking, where it can be seen that the largest latency cache hit rates and even improve the inherent efficiency of the
improvements are obtained when the queries generate a large set posting list processing algorithms. The latter effect would result
of results. On the right we show the recall versus latency results from having larger clusters of posting list entries that are either
when ranking is disabled, in other words, just recall identification. matched or skipped than in a random assignment of the documents.
It can be seen that there is a strict boundary at the bottom, and Query expansion is likely to compound these effects, especially
there is no significant improvement for the queries that are located in an e-Commerce environment. As in most search engines, we
within that boundary. Latency improvements can be seen overall, make extensive use of query expansion to improve recall and preci-
but are large for expensive queries. sion. Rewritten queries often form a complex Boolean expression
We also evaluated the impact of applying category constrains to involving many posting lists and nested AND / OR constructs. Ex-
the queries. The results are shown in table 3. The left side shows pansions involve not only text keywords, but also the structured
the latency (in milliseconds) when category constraint is not used. meta-data associated with products. For example, the term “black”
In this case the Reordered index improved mean query latency by may expand to “color:black” and “samsung” to “brand:samsung”.
47% and the 95th percentile by 41%. The right shows the effect To examine these effects we used the CPU usage profiling tool
of enabling category constrain on the queries. There the mean perf, while processing a portion of a General Queries collection, to
query latency has reduced by 55% when the Reordered Index is analyze and identify the exact locations where the improvement
used, and a similar effect is observed for the 95th percentile. Clearly was more noticeable. We observed the functions performing the task
both unconstrained and category constrained queries are materially of iterating (and skipping) through posting lists was consuming
improved. about 5% of the total CPU time, and we observed a noticeable
improvement especially in these parts. We also saw improvements
4.4 Latency Improvement Analysis in the doc_id variable byte decoding section. Finally we analyzed the
effect of how cache misses were affected by the document reordered
Categories naturally cluster both document terms and query terms.
index. We observed a 7% reduction in overall cache misses and a
Items satisfying a multi-term AND query will tend to come from
SIGIR 2017 eCom, August 2017, Tokyo, Japan
Vishnusaran Ramaswamy, Roberto Konow, Andrew Trotman, Jon Degenhardt, and Nick Whyte

13% reduction in last-level cache misses (LLC). These numbers [5] Shane Culpepper and Alistair Moffat. 2010. Efficient Set Intersection for In-
show that that document ordering by category yielded a significant verted Indexing. ACM Transactions on Information Systems 29, 1, Article 1 (2010),
25 pages.
improvement in overall CPU cache hit rates. These numbers are [6] Shane J. Culpepper and Alistair Moffat. 2007. Compact Set Representation for
consistent with our hypothesis for the improvements in latency. Information Retrieval. In Proceedings of the 14th International Symposium on
String Processing and Information Retrieval (SPIRE). 137–148.
Additional analysis is still needed to quantify the effects on posting [7] Shuai Ding, Josh Attenberg, and Torsten Suel. 2010. Scalable Techniques for
listing processing. Document Identifier Assignment in Inverted Indexes. In Proceedings of the 19th
The probability that any given cluster of posting list entries International Conference on World Wide Web (WWW ’10). ACM, 311–320.
[8] Shuai Ding and Torsten Suel. 2011. Faster top-k document retrieval using block-
will be referenced by a query is far from uniform in the General max indexes. In Proceedings of the 34th international ACM SIGIR conference on
Queries collection, and more likely following something like a Research and development in Information Retrieval. ACM, 993–1002.
zipfian curve. This should reduce the number of CPU cache entries [9] Andrew Kane and Frank Tompa. 2014. Skewed Partial Bitvectors for List Intersec-
tion. In Proceedings of the 37th Annual International ACM Conference on Research
filled with posting list data during processing of a query, and thus and Development in Information Retrieval (SIGIR) (SIGIR ’14). 263–272.
reducing the CPU cache worked set size for the portion of query [10] Peter Sanders and Frederik Transier. 2007. Intersection in Integer Inverted
Indices. In Proceedings 9th Workshop on Algorithm Engineering and Experiments
processing dedicated to processing posting lists. The reduction in (ALENEX).
CPU cache working set size for posting lists allows a larger amount [11] Falk Scholer, Hugh Williams, John Yiannis, and Justin Zobel. Compression of
of CPU cache to be used for other functions performing query inverted indexes for fast query evaluation. In Proceedings of the 25th Annual In-
ternational ACM Conference on Research and Development in Information Retrieval
processing work, which improves the CPU cache hit rate for those (SIGIR). 222–229.
other functions. [12] Fabrizio Silvestri. 2007. Sorting out the Document Identifier Assignment Problem.
The above discussion focuses on the recall component of query In Proceedings of the 29th European Conference on IR Research (ECIR’07). 101–112.
[13] Andrew Trotman. 2014. Compression, SIMD, and Postings Lists. In Proceedings
processing. As mentioned earlier, document reordering also better of the 2014 Australasian Document Computing Symposium (ADCS). Article 50,
co-locates forward index data for items satisfying the recall ex- 8 pages.
[14] Hao Yan, Shuai Ding, and Torsten Suel. 2009. Inverted index compression and
pression for a query. Forward index data is used extensively in the query processing with optimized document ordering. In Proceedings of the 18th
ranking component of query processing. As such, this is also has international conference on World wide web. ACM, 401–410.
potential to improve CPU cache hit rates. We have not measured [15] Marcin Zukowski, Sandor Heman, Niels Nes, and Peter Boncz. 2006. Super-scalar
RAM-CPU cache compression. In Proceedings of the 22nd International Conference
this effect directly, but it is consistent with the results shown in on Data Engineering, ICDE’06. IEEE, 59–59.
figure 4. These graphs show a latency improvement from ranking
beyond the improvement from recall processing alone.

5 CONCLUSIONS AND FUTURE WORK
We presented a simple, yet efficient, document reordering technique
that takes advantage of structured component from the queries
and the data. This is particularly common in e-Commerce search
engines. We showed that by performing this simple re-arrangement
of the doc_ids we can improve the capacity of the system by 30%,
and process category-constrained queries in about half the time
that it took without the re-arrangement.
As future work, we plan to add more document reordering crite-
ria to other dimensions such as item aspects like color, brand and
also the location of the item by ordering by country, another impor-
tant task is to measure the behavior of the cache in this scenario.
We also plan to perform a deeper analysis and quantification of
the latency improvements seen in both boolean only queries and
when ranking is enabled. A topic for future study is whether there
is a material difference between these two cases when considering
per-document latency benefits. The current analysis is insufficient
to answer this question.

REFERENCES
[1] Jeremy Barbay, Alejandro López-Ortiz, Tyler Lu, and Alejandro Salinger. 2009.
An experimental investigation of set intersection algorithms for text searching.
ACM Journal of Experimental Algorithmics 14 (2009), art. 7.
[2] Roi Blanco and Álvaro Barreiro. 2005. Document identifier reassignment
through dimensionality reduction. In European Conference on Information Re-
trieval. Springer, 375–387.
[3] Dan Blandford and Guy Blelloch. 2002. Index compression through document
reordering. In Proceedings of the Data Compression Conference. DCC 2002. IEEE,
342–351.
[4] Andrei Z Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien.
2003. Efficient query evaluation using a two-level retrieval process. In Proceedings
of the twelfth international conference on Information and knowledge management.
ACM, 426–434.