Towards Practical Visual Search Engine Within Elasticsearch Cun (Matthew) Mu Jun (Raymond) Zhao Guang Yang Jet.com/Walmart Labs Jet.com/Walmart Labs Jet.com/Walmart Labs Hoboken, NJ Hoboken, NJ Hoboken, NJ matthew.mu@jet.com raymond@jet.com guang@jet.com Jing Zhang Zheng (John) Yan Jet.com/Walmart Labs Jet.com/Walmart Labs Hoboken, NJ Hoboken, NJ jing@jet.com john@jet.com ABSTRACT Within Elasticsearch. In Proceedings of ACM SIGIR Workshop on eCom- In this paper, we describe our end-to-end content-based image merce (SIGIR 2018 eCom). ACM, New York, NY, USA, 8 pages. retrieval system built upon Elasticsearch, a well-known and popular textual search engine. As far as we know, this is the 1 INTRODUCTION first time such a system has been implemented in eCommerce, Elasticsearch [22], built on top of Apache Lucene library [4, 15, and our efforts have turned out to be highly worthwhile. We 38], is an open-source, real-time, distributed and multi-tenant end up with a novel and exciting visual search solution that is textual search engine. Since its first release in February 2010, extremely easy to be deployed, distributed, scaled and mon- Elasticsearch has been widely adopted by eCommerce websites itored in a cost-friendly manner. Moreover, our platform is (e.g., Ebay, Etsy, Jet, Netflix, Grubhub) to successfully help intrinsically flexible in supporting multimodal searches, where customers discover products based on the textual queries they visual and textual information can be jointly leveraged in re- requested [13, 57] . trieval. But a picture is more than often worth a thousand words. With The core idea is to encode image feature vectors into a the explosive usage of phone cameras, content-based image collection of string tokens in a way such that closer vectors retrieval [16] is increasingly demanded from customers. Espe- will share more string tokens in common. By doing that, we cially for categories like furniture, fashion and lifestyle (where can utilize Elasticsearch to efficiently retrieve similar images buying decisions are largely influenced by products’ visual ap- based on similarities within encoded sting tokens. As part of pealingness), uploading a picture of the product they like could the development, we propose a novel vector to string encod- be substantially more specific, expressive and straightforward ing method, which is shown to substantially outperform the than elaborating it into abstract textual description. previous ones in terms of both precision and latency. Finding images relevant with the uploaded picture tends to First-hand experiences in implementing this Elasticsearch- be much more involved and vaguer than retrieving documents based platform are extensively addressed, which should be matching keywords [45, 48, 58] typed into the search box, as valuable to practitioners also interested in building visual words (by themselves) are substantially more semantic and search engine on top of Elasticsearch. meaningful than image pixel values. Fortunately, modern AI techniques, especially the ones developed in the field of deep CCS CONCEPTS learning [3, 21], have made incredible strides in image feature • Information systems → Image search; • Applied com- extraction [17, 32, 39, 42–44, 59, 60] to embed images as points puting → Online shopping; in high-dimensional Euclidean space, where similar images are located nearby. So, given a query image, we can simply retrieve KEYWORDS its visually similar images by finding its nearest neighbors in this Elasticsearch, visual search, content-based image retrieval, high-dimensional feature space. However, Elasticsearch, as an multimodal search, eCommerce inverted-index-based search engine, is not much empowered to accomplish this mathematically straightforward operation in ACM Reference Format: an efficient manner (though efforts [6, 7, 19, 30, 36] have been Cun (Matthew) Mu, Jun (Raymond) Zhao, Guang Yang, Jing Zhang, made successfully in finding nearest neighbors over spaces and Zheng (John) Yan. 2018. Towards Practical Visual Search Engine of much lower dimension), which significantly limits the ap- Copyright © 2018 by the paper’s authors. Copying permitted for private and academic plicability of its nicely designed engineering system as well purposes. to make digital or hard copies of part or all of this work for personal Permission In: J. Degenhardt, G. Di Fabbrizio, S. Kallumadi, M. Kumar, Y.-C. Lin, A. Trotman, H. as the huge volume of product metadata already indexed into or classroom use is granted without fee provided that copies are not made or Zhao (eds.): Proceedings of the SIGIR 2018 eCom workshop, 12 July, 2018, Ann Arbor, Michigan, USA, distributed forpublished profit oratcommercial http://ceur-ws.org advantage and that copies bear this notice its database (for textual search). The gist of the paper is to and the full citation on the first page. Copyrights for third-party components of conquer this difficulty, and thus make it feasible to conduct this work must be honored. For all other uses, contact the owner/author(s). visual search within Elasticsearch. SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA © 2018 Copyright held by the owner/author(s). In this paper, we describe our end-to-end visual search platform built upon Elasticsearch. As far as we know, this Image descriptor Indexing Vector to string tokens encoding Database images Indexing ES Image descriptor Encoding Search Retrieval based on token matching Query image Reranking within Figure 1: Pipeline of our visual search system within Elasticsearch. The image vectors and their encoded string tokens are indexed together into Elasticsearch. At search time, the query vector x̂ will be first encoded into string tokens ŝ, based on which a small candidate set R is retrieved. We will then re-rank vectors in R according to their exact Euclidean distances with x̂, and output the top ones as our final visual search outcome. is the first attempt to achieve this goal and our efforts turn visual system. Our nearest neighbor retrieval approach falls out to be quite worthwhile. By taking advantage of the ma- under the general framework recently proposed by Rygl et al. ture engineering design from Elasticsearch, we end up with [47]. The core idea is to create text documents from image a visual search solution that is extremely easy to be deployed, feature vectors by encoding each vector into a collection of distributed, scaled and monitored. Moreover, due to Elastic- string tokens in a way such that closer vectors will share more search’s disk-based (and partially memory cached) inverted in- string tokens in common. This enables Elasticsearch to approx- dex mechanism, our system is quite cost-effective. In contrast to imately retrieve neighbors in image feature space based on many existing systems (using hashing-based [2, 20, 23, 33, 54– their encoded textual similarities. The quality of the encoding 56] or quantization-based [18, 25–27, 29] approximate nearest procedure (as expected) is extremely critical to the success neighbor (ANN) methods), we do not need to load those mil- of this approach. In the paper, we propose a noval scheme lions of (high-dimensional and dense) image feature vectors called subvector-wise clustering encoder, which substantially into RAM, one of the most expensive resources in large-scale outperforms the element-wise rounding one proposed and computations. Furthermore, by integrating textual search and examined by Rygl et al. [47] and Ruzicka et al. [46], in terms of visual search into one engine, both types of product infor- both precision and latency. Note that our methodology should mation can now be shared and utilized seamlessly in a sin- be generally applicable to any full-text search engine (e.g., Solr gle index. This paves a coherent way to support multimodal [51], Sphinx [1]) besides Elasticsearch, but in the paper we do searches, allowing customers to express their interests in a share a number of Elasticsearch-specific implementation tips variety of textual requests (e.g., keywords, brands, attributes, based on our first-hand experience, which should be valuable price ranges) jointly with visual queries, at which most of to practitioners interested in building their own visual search existing visual search systems fall short (if not impossible). system on top of Elasticsearch. Since the image preprocessing step and the image feature The rest of the paper is organized as follows. In Section 2, extraction step involved in our system are standard and in- we describe the general pipeline of our visual search system, dependent of Elasticsearch, in this paper we address more and highlight a number of engineering tweaks we found useful towards how we empower Elasticsearch to retrieve close im- when implementing the system on Elasticsearch. In Section 3 age feature vectors, i.e., the Elasticsearch-related part of the and 4, we focus on how to encode an image feature vector into 2 a collection of string tokens—the most crucial part in setting in JSON 1) to check whether the encoded feature token ŝi up the system. In Section 3, we first review the element-wise from the query image is being matched or not. With all the m rounding encoder and address its drawbacks. As a remedy, scores being summed up (line 26 in JSON 1) using the same we propose a new encoding scheme called subvector-wise weight (e.g., lines 13 and 23 in JSON 1), the ranking score for clustering encoder, which is empirically shown in Section 4 to the database images are calculated exactly as the number of much outperform the element-wise rounding one. feature tokens they overlap with the ones in ŝ. For the reranking piece, our initial trial is to fetch the top-r 2 GENERAL FRAMEWORK OF VISUAL image vectors from the retrieval step, and calculate (2.4) to SEARCH WITHIN ELASTICSEARCH re-rank them outside Elasticsearch. But this approach pre- The whole pipeline of our visual search engine is depicted in vents our visual system from being an end-to-end one within Figure 1, which primarily consists of two phases: indexing and Elasticsearch, and thus makes it hard to leverage many use- searching. ful microservices (e.g., pagination) provided by Elasticsearch. More severely, this vanilla approach introduces substantial Indexing. Given image feature vectors latency in communication as thousands of high-dimensional X := {x 1 , x 2 , . . . , x n } ⊆ Rd , (2.1) and dense image embedding vectors have to be transported out of Elasticsearch database. As a remedy, we design a query we will first encode them into string tokens rescorer [12] (lines 30-52 in JSON 1) within Elasticsearch to S := {s 1 , s 2 , . . . , sn } , (2.2) execute a second query on the top-r database image vectors re- where si := E(x i ) for some encoder E(·) converting a d- turned from the function score query, to tweak their scores and dimensional vector into a collection of string tokens of car- re-rank them based on their exact Euclidean distances with dinality m. The original numerical vectors X and encoded the query image vector. In specific, we implement a custom tokens S, together with their textual metadata (e.g, product Elasticsearch plugin [10] (lines 35-47 in JSON 1) to compute titles, prices, attributes), will be all indexed into the Elastic- the negation of the Euclidean distance between query image search database, to wait for being searched. vector and the one from database. As Elasticsearch will rank the result based on the ranking score from high to low, the Searching. Conceptually, the search phase consists of two output will be in the desired order from the smallest distance steps: retrieval and reranking. Given a query vector x̂, we will to the largest one. first encode it into ŝ := E(x̂) via the same encoder used in indexing, and retrieve r (r ≪ n) most similar vectors R :=  x i 1 , x i 2 , . . . , x i r as candidates based on the overlap between Multimodal search. More often than not, scenarios more the string token set ŝ and the ones in {s 1 , s 2 , . . . , sn }, i.e., complicated than visual search will be encountered. For in- stance, a customer might be fascinated with the design and {i 1 , i 2 , . . . , i r } = r-arg max |ŝ ∩ si |. (2.3) style of an armoire at her friend’s house, but she might want i ∈ {1,2, ...,n } to change its color to be better aligned with her own home We will then re-rank vectors in the candidate set R according design or want the price to be within her budget (see Figure to their exact Euclidean distances with respect to the query 2). Searching using the picture snapped is most likely in vain. vector x̂, and choose the top-s (s ≤ r ) ones as the final visual To better enhance customers’ shopping experiences, a visual search result to output, i.e., search engine should be capable of retrieving results as a joint s-arg min ∥x i − x̂ ∥ 2 . (2.4) outcome by taking both the visual and textual requests from i ∈ {i 1,i 2, ...,i r } customers into consideration. Fortunately, our Elasticsearch- As expected, the choice of E(·) is extremely critical to the based visual system can immediately achieve this with one success of the above approach. A good encoder E(·) should or two lines modifications in JSON 1. In particular, filters can encourage image feature vectors closer in Euclidean distance be inserted within the function score query to search only to share more string tokens in common, so that the retrieval set among products of customers’ interests (e.g., within certain R obtained from the optimization problem (2.3) could contain price range [11], attributes, colors). Moreover, general full-text enough meaning candidates to be fed into the exact search query [8] can also be handled, score of which can be blended in (2.4). We will elaborate and compare different choices of with the one from visual search in a weighted manner. encoders in the next two sections (Section 3&4). Implementation. In this part, we will address how we imple- 3 VECTOR TO STRING ENCODING ment the retrieval and reranking steps in the searching phase The success of our approach hinges upon the quality of the efficiently within just one JSON-encoded request body (i.e., encoder E(·), which ideally should encourage closer vectors to JSON 1), which instructs the Elasticsearch server to compute share more sting tokens in common, so that the retrieval set R (2.3) and (2.4) and then return the visual search result in a found based on token matching contains enough meaningful desired order (via Elasticsearch’s RESTful API over HTTP). candidates. In the following, we first review the element-wise For the retrieval piece, we construct a function score query rounding encoder proposed by Rygl et al. [47], and discuss [9] to rank database images based on (2.3). Specifically, our its potential drawbacks. As a remedy, we propose a novel function score query (lines 3-29 in JSON 1) consists of m score encoding scheme called subvector-wise clustering encoder. functions, each of which is a term filter [14] (e.g., lines 6-14 3 JSON 1 Request body for visual search in Elasticsearch 6.1 1 { 2 "size": s, color: dark expresso 3 "query": { 4 "function_score": { 5 "functions": [ 6 { 7 "filter": { 8 "term": { 9 "image_encoded_tokens": price range: < $200 10 "query_encoded_token_1" 11 } 12 }, 13 "weight": 1 Figure 2: Illustration of multimodal search. Armoire is 14 }, searched using image query jointly with color/price range 15 ..., specified by the customer. Our Elasticsearch-based visual 16 { search engine can be easily tailored to handle complicated "filter": { business requests like the above by adding filters (e.g., term 17 "term": { filter [14], range filter [11]) to JSON 1. 18 19 "image_encoded_tokens": 20 "query_encoded_token_m" 21 } 22 }, 3.1 Element-wise Rounding 23 "weight": 1 Proposed and examined by Rygl et al. [47] and Ruzicka et al. 24 } [46], the element-wise rounding encoder rounds each value in 25 ], the numerical vector to p decimal places (where p ≥ 0 is a fixed 26 "score_mode": "sum", integer), and then concatenates its positional information and 27 "boost_mode": "replace" rounded value as the string tokens. 28 } 29 }, Example 1. For a vector x = [0.1234, −0.2394, 0.0657], round- 30 "rescore": { ing to two decimal places (i.e., p = 2) produces string tokens of 31 "window_size": r, x as 32 "query": { s = {“pos1val0.12”, “pos2val-0.24”, “pos3val0.07”} . 33 "rescore_query": { 34 "function_score": { The encoded positional information is essential for the inverted- 35 "script_score": { index-based search system to match (rounded) values at the same 36 "script": { position without confusion. Suppose on the other hand, positional 37 "lang": "custom_scripts", information is ignored, and thus "source": "negative_euclidean_distance", s = {“val0.12”, “val-0.24”, “val0.07”} . 38 39 "params": { 40 "vector_field": "image_actual_vector", Then the attribute “val0.12” could be mistakenly matched by 41 "query_vector": another encoded token even when it is not produced from the 42 [0.1234, -0.2394, 0.0657, ...] first entry. 43 } 44 } For a high-dimensional vector x ∈ Rd , this vanilla version 45 }, of the element-wise rounding encoder will generate a large 46 "boost_mode": "replace" collection of string tokens (essentially with |E(x)| = d), which 47 } makes it infeasible for Elasticsearch to compute (2.3) in real 48 }, time. "query_weight": 0, Filtering. As a remedy, Rygl et al. [47] presents a useful 49 50 "rescore_query_weight": 1 filtering technique to sparsify the string tokens. In specific, 51 } only top-m entries in terms of magnitude are selected to create 52 } rounding tokens. 53 } Example 2. For the same setting with Example 1, when m is set as 2, the string tokens will be produced as s = {“pos1val0.12”, “pos2val-0.24”} 4 with only the first and second entries being selected; and when In contrast with the element-wise rounding encoder, our m is set as 1, the string tokens will be produced as subvector-wise clustering encoder obtains m string tokens s = {“pos2val-0.24”} , without throwing away any entry in x, and will generate string tokens more adaptive with the data distribution, as with only the second entry being selected. the assignment function A i (·) for each subspace is learned Drawbacks. Although the filtering strategy is suggested to through X i (or data points sampled from X i ). maintain a good balance between feature sparsity and search quality [46, 47], it might not be the best practice to reduce the number of string tokens with respect to finding nearest neighbors in general. First, for two points x̂, x ∈ Rd , their Euclidean distance d 7 7 Õ 5 ∥x̂ − x ∥ 22 = (x̂ i − x i )2 , (3.1) 1 6 5 1 3 1 i=1 4 6 4 2 5 4 is summed along each axis equally rather than biasedly based 2 3 2 3 on the magnitude of x̂ i (or x i ). In specific, a mismatch/match with a (rounded) value 0.01 does not imply that it is less im- portant than a mismatch/match with a 0.99, in terms of their contributions to the sum (3.1). What essentially matters is the deviation ∆i := x̂ i − x i rather than the value of x̂ i (or x i ) by itself. Therefore, entries with small magnitude should not be Figure 3: Illustration of the subvector-wise clustering considered as less essential and be totally ignored. Second, the encoder. The vector x ∈ Rd is divided into m subvectors. efficacy of the filtering strategy is vulnerable to data distribu- Subvectors at the same position are considered together to be tions. For example, when the embedding vectors are binary classified into k clusters. Then each subvector is encoded into codes [24, 31, 34, 35, 52], choosing top-m entries will lead to a string token by combining its position in x and the cluster it an immediate tanglement. belongs to, so exacly m string tokens will be produced. In the next subsection, we will propose an alternative en- coder, which keeps all value information into consideration and is also more robust with respect to the underlying data distribution. 4 EXPERIMENT In this section, we will compare the performance of the subvector- 3.2 Subvector-wise Clustering wise clustering encoder and the element-wise rounding one Different from the element-wise rounding one, an encoder in terms of both precision and latency, when they are being that operates on a subvector level will be presented in this used in our content-based image retrieval system built upon part. The idea is also quite natural and straightforward. For Elasticsearch. any vector x ∈ Rd , we divide it into m subvectors1 , Settings. Our image datasets consists of around half a mil- [x 1 , . . . , xd /m , xd /m+1 , . . . , x 2d /m , . . . . . . , xd −m+1 , . . . , xm ]. lion images selected from Jet.com’s furniture catalog [28]. For | {z } | {z } | {z } xm each image, we extract its image feature vector using the pre- x1 x2 (3.2) trained Inception-ResNet-V2 model [53]. In specific, each image is embedded into a vector in R1536 by taking the out- Denote X i := x 1 , x 2 , . . . , x n as the collection of the i-th  i i i put from the penultimate layer (i.e., the last average pooling subvectors from X for i = 1, 2, . . . , m. We will then separately layer) of the neural network model. String tokens are produced apply the classical k-means algorithm [37] to divide each X i respectively with encoding schemes at different configura- into k clusters with the learned assignment function tions. For the element-wise rounding encoder, we select p ∈ A i : Rd /m → {1, 2, . . . , k } {0, 1, 2, 3}, and m ∈ {32, 64, 128, 256}. For the subvector-wise clustering encoder, we experiment with k ∈ {32, 64, 128, 256} assigning each subvector to the cluster index it belongs to. and m ∈ {32, 64, 128, 256}. Under each scenario, we index the Then for any x ∈ Rd , we will encode it into a collection of m image feature vectors and their string tokens into a single- string tokens node Elustersearch cluster deployed on a Microsoft Azure “pos1cluster{A 1 (x 1 )}”, “pos2cluster{A 2 (x 2 )}”, . . . . (3.3)  virtual machine [40] with 12 cores and 112 GiB of RAM. To better focus on the comparison of the efficacy in encoding The whole idea is illustrated in Figure 3. The trade-off between scheme, only vanilla setting of Elasticsearch (one shard and search latency and quality is well controlled by the parameter zero replica) is used in creating each index. m. In specific, a larger m will tend to increase the search quality as well as the search latency, as more string tokens per each Evaluation. To evaluate the two encoding schemes, we ran- vector will be indexed. domly select 1,000 images to act as our visual queries. For 1 For simplicity, we assume m divides d . each of the query image, we find the set of its 24 nearest 5 enforce a vector to be divided into subvectors exclusively using (3.2), which could potentially downgrade the performance of the encoder. Our next step is to preprocess the data (e.g., trans- form the data through some linear operation x 7→ T [x] with T [·] learned from the data) before applying our subvector- wise clustering encoder. We believe this flexibility will make our encoding scheme more robust and adaptive with respect to different image feature vectors extracted from various im- age descriptors. Another interesting research direction is to evaluate the performances of different encoding schemes in other information retrieval contexts–e.g., neural ranking model based textual searches [5, 41, 50], where relevances between user-issued queries and catalog products are modeled by their Euclidean distances in the embedding space to better match customers’ intents with products. Figure 4: Pareto frontier for the element-wise round- ing and the subvector-wise clustering encoders in the ACKNOWLEDGEMENT space of latency and precision. It can be clearly seen that We are grateful to three anonymous reviewers for their helpful our subvector-wise encoding scheme is capable of achieving suggestions and comments that substantially improve the pa- higher precision with smaller latency. per. We would also like to thank Eliot P. Brenner and Aliasgar Kutiyanawala for proofreading the first draft of the paper. neighbors in Euclidean distance, which is treated as gold stan- REFERENCES dard. We use Precision@24 [49], which measures the over- [1] A. Aksyonoff. 2011. Introduction to Search with Sphinx: From installation to lap between the 24 images retrieved from Elasticsearch (with relevance tuning. " O’Reilly Media, Inc.". r ∈ {24, 48, 96, . . . , 6144} respectively) and the gold standard, [2] A. Andoni and P. Indyk. 2006. Near-optimal hashing algorithms for ap- proximate nearest neighbor in high dimensions. In Proceedings of FOCS. to evaluate the retrieval efficacy of different encoding methods 459–468. under various settings. We also record the latency for Elas- [3] Y. Bengio. 2009. Learning deep architectures for AI. Foundations and trends® in Machine Learning 2, 1 (2009), 1–127. ticsearch to execute the retrieval and reranking steps in the [4] A. Bialecki, R. Muir, G. Ingersoll, and L. Imagination. 2012. Apache lucene searching phase. 4. In Proceedings of SIGIR workshop on open source information retrieval. [5] E. P. Brenner, J. Zhao, A. Kutiyanawala, and Z. Yan. 2018. End-to-End Results. In Table 1, we report the Precision@24 and search Neural Ranking for eCommerce Product Search. In Proceedings of SIGIR eCom’18. latency averaged over the 1,000 queries randomly selected. [6] Elasticsearch contributors. 2016. Multi-dimensional points, coming in Results corresponding to p ∈ {2, 3} or r ∈ {24, 48} are skipped Apache Lucene 6.0. Retrieved June 16, 2018 from https://www.elastic.co/ as they are largely outperformed by other settings. Configu- blog/lucene-points-6.0. [7] Elasticsearch contributors. 2017. Numeric and Date Ranges rations that can achieve precision ≥ 80% and latency ≤ 0.5s in Elasticsearch: Just Another Brick in the Wall. Re- are highlighted in bold. From Table 1, we can see that the trieved June 16, 2018 from https://www.elastic.co/blog/ numeric-and-date-ranges-in-elasticsearch-just-another-brick-in-the-wall. subvector-wise encoder outperforms the element-wise one, [8] Elasticsearch contributors. 2018. Full text queries. Retrieved May as for all results obtained by the element-wise encoder, we 01, 2018 from https://www.elastic.co/guide/en/elasticsearch/reference/6.1/ can find a better result from the subvector-wise one in both full-text-queries.html. [9] Elasticsearch contributors. 2018. Function score query. Retrieved May precision and latency. To better visualize this fact, we plot the 01, 2018 from https://www.elastic.co/guide/en/elasticsearch/reference/6.1/ Pareto frontier curve over the space of precision and latency in query-dsl-function-score-query.html. Figure 4. In specific, the dashed (resp. solid) curve in Figure 4 [10] Elasticsearch contributors. 2018. Plugins. Retrieved May 01, 2018 from https://www.elastic.co/guide/en/elasticsearch/reference/6.1/ is plotted as the best average Precision@24 achieved among modules-plugins.html. all configurations we experiment for element-wise rounding [11] Elasticsearch contributors. 2018. Range query. Retrieved May 01, 2018 from https://www.elastic.co/guide/en/elasticsearch/reference/6.1/ (resp. subvector-wise clustering) encoder, under different la- query-dsl-range-query.html. tency constraints. From Figure 4, we can more clearly observe [12] Elasticsearch contributors. 2018. Rescoring. Retrieved May 01, that the subvector-wise encoder surpasses the element-wise 2018 from https://www.elastic.co/guide/en/elasticsearch/reference/6.1/ search-request-rescore.html. one. Notably, when we require the search latency to be smaller [13] Elasticsearch contributors. 2018. Stories from Users Like You. Retrieved than 0.3 second, the subvector-wise encoder is able to achieve May 01, 2018 from https://www.elastic.co/use-cases. an average Precision@24 as 92.14%, yielding an improvement [14] Elasticsearch contributors. 2018. Term query. Retrieved May 01, 2018 from https://www.elastic.co/guide/en/elasticsearch/reference/6.1/ of more than 11% over the best average Precision@24 that can query-dsl-term-query.html. be obtained by the element-wise one. [15] Lucene contributors. 2018. Apache Lucene library. Retrieved May 06, 2018 from https://lucene.apache.org. [16] R. Datta, D. Joshi, J. Li, and J. Wang. 2008. Image retrieval: Ideas, influences, 5 FUTURE WORK and trends of the new age. Comput. Surveys 40, 2 (2008), 5. [17] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Although our subvector-wise clustering encoder outperforms 2014. Decaf: A deep convolutional activation feature for generic visual the element-wise rounding one, it might be still restrictive to recognition. In Proceedings of ICML. 647–655. 6 # of feature tokens (m) r Encoding Round./Cluster. 32 64 128 256 96 element 0-decimal-place 53.43% | 0.1237 64.35% | 0.2339 76.44% | 0.5256 88.64% | 1.5342 element 1-decimal-place 26.56% | 0.0920 37.94% | 0.1592 50.35% | 0.3370 63.71% | 0.8207 subvector 32-centroids 34.80% | 0.1111 54.45% | 0.1914 74.22% | 0.3760 87.44% | 0.8914 subvector 64-centroids 39.52% | 0.0963 58.51% | 0.1630 76.70% | 0.3426 87.28% | 0.7563 subvector 128-centroids 44.43% | 0.0914 61.93% | 0.1544 78.89% | 0.3088 85.58% | 0.7186 subvector 256-centroids 50.00% | 0.0900 66.22% | 0.1480 79.05% | 0.2970 82.89% | 0.6757 192 element 0-decimal-place 63.72% | 0.1405 74.63% | 0.2499 85.38% | 0.5416 94.13% | 1.5536 element 1-decimal-place 32.49% | 0.1084 45.50% | 0.1748 59.05% | 0.3529 72.12% | 0.8424 subvector 32-centroids 43.73% | 0.1256 64.88% | 0.2080 83.13% | 0.3917 93.56% | 0.9146 subvector 64-centroids 48.84% | 0.1130 69.14% | 0.1795 85.12% | 0.3594 93.28% | 0.7745 subvector 128-centroids 55.14% | 0.1082 72.62% | 0.1714 87.08% | 0.3250 91.97% | 0.7367 subvector 256-centroids 61.41% | 0.1066 77.08% | 0.1644 87.32% | 0.3137 89.28% | 0.6915 384 element 0-decimal-place 73.30% | 0.1749 82.76% | 0.2852 91.19% | 0.5756 97.03% | 1.5963 element 1-decimal-place 38.94% | 0.1431 53.43% | 0.2093 67.12% | 0.3877 79.25% | 0.8741 subvector 32-centroids 53.37% | 0.1603 73.92% | 0.2417 89.06% | 0.4262 96.82% | 0.9509 subvector 64-centroids 59.01% | 0.1479 78.15% | 0.2139 91.25% | 0.3935 96.59% | 0.8097 subvector 128-centroids 66.20% | 0.1433 81.56% | 0.2061 92.75% | 0.3596 95.44% | 0.7705 subvector 256-centroids 73.01% | 0.1415 85.88% | 0.1995 92.67% | 0.3520 93.38% | 0.7243 768 element 0-decimal-place 81.27% | 0.2455 89.09% | 0.3547 94.98% | 0.6443 98.60% | 1.6613 element 1-decimal-place 45.83% | 0.2130 61.30% | 0.2801 74.60% | 0.4574 84.87% | 0.9427 subvector 32-centroids 63.45% | 0.2297 81.30% | 0.3117 93.40% | 0.4974 98.58% | 1.0195 subvector 64-centroids 69.01% | 0.2182 85.47% | 0.2837 95.41% | 0.4647 98.38% | 0.8798 subvector 128-centroids 76.70% | 0.2133 88.91% | 0.2762 96.13% | 0.4288 97.50% | 0.8402 subvector 256-centroids 83.55% | 0.2112 92.14% | 0.2701 95.90% | 0.4267 95.94% | 0.7970 1536 element 0-decimal-place 87.55% | 0.3923 93.45% | 0.5027 97.47% | 0.8012 99.29% | 1.8486 element 1-decimal-place 53.76% | 0.3656 68.68% | 0.4361 81.05% | 0.6069 89.48% | 1.0931 subvector 32-centroids 72.75% | 0.3703 87.30% | 0.4524 96.14% | 0.6400 99.36% | 1.1574 subvector 64-centroids 78.85% | 0.3581 91.52% | 0.4218 97.74% | 0.6045 99.28% | 1.0188 subvector 128-centroids 86.00% | 0.3537 94.12% | 0.4158 98.03% | 0.5665 98.60% | 0.9763 subvector 256-centroids 91.16% | 0.3512 95.97% | 0.4087 97.70% | 0.5582 97.44% | 0.9281 3072 element 0-decimal-place 92.38% | 0.6843 96.40% | 0.8166 98.80% | 1.0909 99.63% | 2.1638 element 1-decimal-place 61.50% | 0.6625 75.62% | 0.7380 86.32% | 0.9135 92.85% | 1.3946 subvector 32-centroids 81.25% | 0.6645 92.11% | 0.7483 97.95% | 0.9375 99.68% | 1.4589 subvector 64-centroids 87.82% | 0.6556 96.32% | 0.7131 99.00% | 0.9006 99.68% | 1.3189 subvector 128-centroids 93.26% | 0.6508 97.72% | 0.7126 99.08% | 0.8604 99.21% | 1.2756 subvector 256-centroids 96.06% | 0.6470 97.94% | 0.7074 98.72% | 0.8566 98.37% | 1.2230 6144 element 0-decimal-place 95.52% | 1.2630 98.22% | 1.3778 99.45% | 1.6737 99.82% | 2.7669 element 1-decimal-place 68.26% | 1.2535 81.75% | 1.2942 90.69% | 1.4800 95.24% | 1.9542 subvector 32-centroids 89.61% | 1.2081 95.86% | 1.2938 99.10% | 1.4892 99.85% | 2.0124 subvector 64-centroids 95.43% | 1.2031 98.87% | 1.2537 99.65% | 1.4459 99.82% | 1.8647 subvector 128-centroids 97.56% | 1.1985 99.13% | 1.2565 99.54% | 1.3959 99.52% | 1.8200 subvector 256-centroids 98.20% | 1.1957 98.90% | 1.2542 99.25% | 1.4037 98.97% | 1.7586 Table 1: Mean Precision@24 | ES average latency. For each setting, we average the Precision@24 and the number of seconds used over the 1,000 query images randomly selected from the furniture dataset. Settings with mean precision ≥ 80% and latency ≤ 0.5s are highlighted in bold. [18] T. Ge, K. He, Q. Ke, and J. Sun. 2013. Optimized product quantization for 35, 12 (2013), 2916–2929. approximate nearest neighbor search. In Proceedings of CVPR. 2946–2953. [21] I. Goodfellow, Y. Bengio, and A. Courville. 2016. Deep learning. Vol. 1. MIT [19] C. Gennaro, G. Amato, P. Bolettieri, and P. Savino. 2010. An approach to press Cambridge. content-based image retrieval based on the Lucene search engine library. [22] C. Gormley and Z. Tong. 2015. Elasticsearch: The Definitive Guide: A In Proceedings of TPDL. 55–66. Distributed Real-Time Search and Analytics Engine. " O’Reilly Media, Inc.". [20] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. 2013. Iterative quantiza- [23] K. He, F. Wen, and J. Sun. 2013. K-means hashing: An affinity-preserving tion: A procrustean approach to learning binary codes for large-scale image quantization method for learning binary compact codes. In Proceedings of retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence CVPR. 2938–2945. 7 [24] J. Heinly, E. Dunn, and J. Frahm. 2012. Comparative evaluation of binary [53] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. 2017. Inception-v4, features. In ECCV. 759–773. inception-resnet and the impact of residual connections on learning. In [25] H. Jegou, M. Douze, and C. Schmid. 2011. Product quantization for near- Proceedings of AAAI. 4278–4284. est neighbor search. IEEE transactions on pattern analysis and machine [54] A. Torralba, R. Fergus, and Y. Weiss. 2008. Small codes and large image intelligence 33, 1 (2011), 117–128. databases for recognition. In Proceedings of CVPR. 1–8. [26] H. Jegou, F. Perronnin, M. Douze, J. Sánchez, P. Perez, and C. Schmid. 2012. [55] J. Wang, S. Kumar, and S. Chang. 2010. Semi-supervised hashing for Aggregating local image descriptors into compact codes. IEEE transactions scalable image retrieval. In Proceedings of CVPR. 3424–3431. on pattern analysis and machine intelligence 34, 9 (2012), 1704–1716. [56] Y. Weiss, A. Torralba, and R. Fergus. 2009. Spectral hashing. In Advances [27] H. Jegou, R. Tavenard, M. Douze, and L. Amsaleg. 2011. Searching in one in NIPS. 1753–1760. billion vectors: re-rank with source coding. In Proceedings of ICASSP. IEEE, [57] Wikipedia contributors. 2018. Elasticsearch. Retrieved May 06, 2018 from 861–864. https://en.wikipedia.org/wiki/Elasticsearch. [28] Jet.com. 2018. Furniture. Retrieved May 01, 2018 from https://jet.com/ [58] Wikipedia contributors. 2018. Tf-idf. Retrieved May 06, 2018 from search?category=18000000. https://en.wikipedia.org/wiki/tf-idf. [29] Y. Kalantidis and Y. Avrithis. 2014. Locally optimized product quantization [59] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. 2014. How transferable are for approximate nearest neighbor search. In Proceedings of CVPR. 2321– features in deep neural networks. In Advances in NIPS. 3320–3328. 2328. [60] M. Zeiler and R. Fergus. 2014. Visualizing and understanding convolutional [30] Nicholas Knize. 2018. Geo Capabilities in Elasticsearch. Retrieved networks. In Proceedings of ECCV. 818–833. June 16, 2018 from https://www.elastic.co/assets/blt827a0a9db0f2e04e/ webinar-geo-capabilities.pdf. [31] H. Lai, Y. Pan, Y. Liu, and S. Yan. 2015. Simultaneous feature learning and hash coding with deep neural networks. In Proceedings of CVPR. 3270– 3278. [32] Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng. 2012. Building high-level features using large scale unsupervised learning. In Proceedings of ICML. 507–514. [33] W. Liu, C. Mu, S. Kumar, and S. Chang. 2014. Discrete graph hashing. In Advances in NIPS. 3419–3427. [34] M. Loncaric, B. Liu, and R. Weber. 2018. Convolutional Hashing for Auto- mated Scene Matching. arXiv preprint arXiv:1802.03101 (2018). [35] X. Lu, L. Song, R. Xie, X. Yang, and W. Zhang. 2017. Deep Binary Rep- resentation for Efficient Image Retrieval. Advances in Multimedia 2017 (2017). [36] M. Lux, M. Riegler, P. Halvorsen, K. Pogorelov, and N. Anagnostopoulos. 2016. LIRE: open source visual information retrieval. In Proceedings of MMSys. [37] J. MacQueen. 1967. Some methods for classification and analysis of mul- tivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1. Oakland, CA, USA, 281–297. [38] M. McCandless, E. Hatcher, and O. Gospodnetic. 2010. Lucene in action: covers Apache Lucene 3.0. Manning Publications Co. [39] G. Mesnil, Y. Dauphin, X. Glorot, S. Rifai, Y. Bengio, I. Goodfellow, E. Lavoie, X. Muller, G. Desjardins, D. Warde-Farley, P. Vincent, A. Courville, and J. Bergstra. 2011. Unsupervised and transfer learning challenge: a deep learning approach. In Proceedings of ICML Workshop on Unsupervised and Transfer Learning. JMLR. org, 97–111. [40] Microsoft Azure. 2018. Virtual machines. Retrieved May 01, 2018 from https://azure.microsoft.com/en-us/services/virtual-machines/. [41] B. Mitra and N. Craswell. 2017. Neural Models for Information Retrieval. arXiv preprint arXiv:1705.01509 (2017). [42] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. 2014. Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of CVPR. 1717–1724. [43] R. Raina, A. Battle, H. Lee, B. Packer, and A. Ng. 2007. Self-taught learning: transfer learning from unlabeled data. In Proceedings of ICML. 759–766. [44] A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. 2014. CNN features off-the-shelf: an astounding baseline for recognition. In CVPR workshop. 512–519. [45] S. Robertson and H. Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389. [46] M. Ruzicka, V. Novotny, P. Sojka, J. Pomikalek, and R. Rehurek. 2018. Flex- ible Similarity Search of Semantic Vectors Using Fulltext Search Engines. http://ceur-ws.org/Vol-1923/article-01.pdf (2018). [47] J. Rygl, J. Pomikalek, R. Rehurek, M. Ruzicka, V. Novotny, and P. Sojka. 2017. Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines. In Proceedings of the 2nd Workshop on Representation Learning for NLP. 81–90. [48] G. Salton, A. Wong, and C. Yang. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613–620. [49] H. Schütze, C. D. Manning, and P. Raghavan. 2008. Introduction to infor- mation retrieval. Vol. 39. Cambridge University Press. [50] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil. 2014. A latent semantic model with convolutional-pooling structure for information retrieval. In Proceedings of CIKM. 101–110. [51] D. Smiley, E. Pugh, K. Parisa, and M. Mitchell. 2015. Apache Solr enterprise search server. Packt Publishing Ltd. [52] J. Song. 2017. Binary Generative Adversarial Networks for Image Retrieval. arXiv preprint arXiv:1708.04150 (2017). 8