-

Benchmarking Nearest Neighbor Search: In uence of Local Intrinsic Dimensionality and Result Diversity in Real-World Datasets

Martin Aumuller

Matteo Ceccarello

0 0 IT Unversity of Copenhagen , Denmark

corpus a d-dimensional real-valued vector. Common This paper reconsiders common benchmarking choices for d are between 50 and 300 dimensions. approaches to nearest neighbor search. It is shown Finding the true nearest neighbors in such a highthat the concept of local intrinsic dimensionality dimensional space is di cult, a phenomenon often (LID) allows to choose query sets of a wide range referred to as the \curse of dimensionality". In of di culty for real-world datasets. Moreover, the practice, it means that nding the true nearest e ect of di erent LID distributions on the running neighbors, in general, cannot be solved much more time performance of implementations is empirically e ciently than by a linear scan through the dataset studied. To this end, di erent visualization concepts (requiring time O(n) for n data points) or in space are introduced that allow to get a more ne-grained that is exponential in the dimensionality d, which overview of the inner workings of nearest neighbor is impractical for large values of d. search principles. The paper closes with remarks While we cannot avoid these general hardness about the diversity of datasets commonly used results, most datasets that are used in applications for nearest neighbor search benchmarking. It is are not truly high-dimensional. This means that the shown that such real-world datasets are not diverse: dataset can be embedded into a lower-dimensional results on a single dataset predict results on all space without too much distortion. Intuitively, other datasets well. the intrinsic dimensionality (ID) of the dataset is the minimum number of dimensions that allows 1 Introduction for such a representation [6]. There exist many Nearest neighbor (NN) search is a key primitive in explicit ways of nd good embeddings for a given many computer science applications, such as data dataset. For example, the Johnson-Lindenstrauss mining, machine learning and image processing. transformation [11] allows us to embed n data points For example, Spring and Shrivastava very recently in Rd into ((log n)="2) dimensions such that all showed in [18] how nearest neighbor search methods pairwise distances are preserved up to a (1+") factor can yield large speed-ups when training neural with high probability. Another classical embedding network models. In this paper, we study the often employed in practice is given by principal classical k-NN problem. Given a dataset S Rd, component analysis (PCA). the task is to build an index on S to support the In this paper, we set our focus on the \local infollowing type of query: For a query point x 2 Rd, trinsic dimensionality" (LID), a measure introduced return the k closest points in S under some distance by Houle in [6]. We defer a detailed discussion of measure D. this measure and its estimation to Section 2. InIn many practical settings, a dataset consists tuitively, the LID of a data point x at a distance of points represented as high-dimensional vectors. threshold r > 0 measures how di cult it is to disFor example, word representations generated by the tinguish between points at distance r and distance glove algorithm [17] associate with each word in a (1 + ")r in a dataset. Most importantly for this study, LID is a local measure that can be associated with a single query. It was stated in [7] that the LID

might serve as a characterization of the di culty of For a given query, they either nd all true nearest k-NN queries. One purpose of this paper is to shed neighbors or not a single one. This behavior is light on this statement. not shared by the two other approaches that we

A focus of this paper is an empirical study of consider; both yield a continuous transition from how the LID in uences the performance of NN \ nding no nearest neighbors" to \ nding all nearest algorithms. To be precise, we will benchmark neighbors". four di erent implementations [12] which employ Finally, we will challenge two more common di erent approaches to NN search. Three of them benchmarking approaches to NN search. First, we (HNSW [15], FAISS-IVF [10], Annoy [3]) stood out as will empirically observe how using small query sets most performant in the empirical study conducted (less than 1% of the dataset) a ects the robustness by Aumuller et al. in [2]. Another one (ONNG) of measurements. Second, we will discuss how diwas proposed very recently [8] and shown to be verse results on the considered real-world datasets competitive to the former approaches there. We are. Ideally, we want to benchmark on a collection base our experiments on [2] and describe their of \interesting" datasets that show the strengths benchmarking approach and the changes we made and weaknesses of individual approaches. We will to their system in Section 3. We analyze the LID conclude that there is little diversity among the distribution of real-world datasets in Section 4. We considered datasets: While the individual perforwill see that there is a substantial di erence between mance observations change from dataset to dataset, the LID distributions among these datasets. We the relative performance between implementations will then conduct two experiments: First, we x a stays the same. dataset and choose as query set the set of points with smallest, medium, and largest estimated LIDs. 1.1 Our Contributions The main contribuAs we will see, the larger the LID, the more di cult tions of this paper are the query for all implementations. Next, we will study how the di erent LID distributions between a detailed evaluation of the LID distribution of datasets in uence the running time performance. many real-world datasets used in benchmarking In a nutshell, it cannot be concluded that LID by frameworks, itself is a good indicator for the relative performance an evaluation of the in uence of the LID to the of a xed implementation over datasets. These performance of NN search implementations, statements will be made precise in the evaluation that is discussed in Section 5. considerations about the diversity of results

In the rst part of our evaluation, we will over many di erent datasets, and work in the \classical evaluation setting of nearest an exploration of di erent visualization techneighbor search". This means that we relate niques that shed light on individual properties a performance measure (such as the achieved of certain implementation principles. throughput measured in queries per second) to a quality measure (such as the average fraction of true 1.2 Related Work on Benchmarking Framenearest neighbors found over all queries). While works for NN We use the benchmarking system this is the most commonly employed evaluation described in [2] as the starting point for our study. method, we will reason that this way of representing Di erent approaches to benchmarking nearest neighresults in fact hides interesting details about the bor search are described in [4, 5, 14]. We refer to [2] inner workings of an implementation. Using non- for a detailed comparison between the frameworks. traditional visualization techniques will provide new insights into their query behavior on real- 2 Local Intrinsic Dimensionality world datasets. As one example, we will see We consider a distance-space (Rd; D) with a disthat reporting average recall on the graph-based tance function D : Rd Rd ! R. As described in [1], approaches from [15, 8] hides an important detail: we consider the distribution of distances within this space with respect to a reference point x. Such a dis- building) and search (query) methods with certain tribution is induced by sampling n points from the parameter choices. Implementations are tested on space Rd under a certain probability distribution. a large set of parameters usually provided by the We let F : R ! [0; 1] be the cumulative distribution original authors of an implementation. The answers function of distances to the reference point x. to queries are recorded as the indices of the points returned. Ann-benchmarks stores these parameters De nition 2.1 ([6]). The local continuous intrinsic together with further statistics such as individual dimension of F at distance r is given by query times, index size, and auxiliary information ln(F ((1 + ")r)=F (r)) provided by the implementation. We refer to [2] for IDF (r) = "li!m0 ln((1 + ")r=r) ; more details.

Compared to the system described in [2], we whenever this limit exists. added tools to estimate the LID based on (2.1),

The measure relates the increase in distance pick \challenging query sets" according to the LID to the increase in probability mass (the fraction of individual points, and added new datasets and imof points that are within the ball of radius r and plementations. Moreover, we implemented a mech(1 + ")r around the query point). Intuitively, the anism that allows an implementation to provide larger the LID, the more di cult it is to distinguish further query statistics after answering a query. To true nearest neighbors at distance r from the rest showcase this feature, all implementations considof the dataset. As described in [7], in the context ered in this study report the number of distance of k-NN search we set r as the distance of the k-th computations performed to answer a query.1 nearest neighbor to the reference point x.

Estimating LID We use the Maximum- 4 Algorithms and Datasets Likelihood estimator (MLE) described in [13, 1] 4.1 Algorithms Nearest neighbor search algoto estimate the LID of x at distance r. Let rithms for high dimensions are usually graph-, tree-, r1 : : : rk be the sequence of distances of the or hashing-based. We refer the reader to [2] for k-NN of x. The MLE I^Dx is then an overview over these principles and available implementations. In this study, we concentrate on (2.1) I^Dx = k1 Xi=k1 ln rrki ! 1 : ftFohAreImStaShn-rtIeVeiFni m[[12p0],l]e(mnIaeVmnFteaflrytoiomHnNsSnWocow[n1so5in]d,)e.ArenWdnoemyco[os3nt]siapdneerdrAmsaleg et al. showed in [1] that MLE estimates the very recent graph-based approach ONNG [8] in the LID well. this study as well.

HNSW and ONNG are graph-based approaches. 3 Overview over the Benchmarking This means that they build a k-NN graph during

Framework the preprocessing step. In this graph, each vertex We use the ann-benchmarks system described is a data point and a directed edge (u; v) means in [2] to conduct our experimental study. Ann- that the data point associated with v is \close" to benchmarks is a framework for benchmarking NN the data point associated with u in the dataset. search algorithms. It covers dataset creation, At query time, the graph is traversed to generate performing the actual experiment, and storing candidate points. Algorithms di er in details of the the results of these experiments in a transparent graph construction, how they build a navigation and easy-to-share way. Moreover, results can be structure on top of the graph, and how the graph explored through various plotting functionalities, is traversed. e.g., by creating a website containing interactive Annoy is an implementation of a random projecplots for all experimental runs.

Ann-benchmarks interfaces with a NN search 1We thank the authors of the implementations for their help implementation by calling its preprocess (index and responsiveness in adding this feature to their library.

Dimensions

avg(LID) median(LID)

Metric

tion forest, which is a collection of random projection trees. Each node in a tree is associated with a set of data points. It splits these points into two subsets according to a chosen hyperplane. If the dataset in a node is small enough, it is stored directly and the node is a leaf. Annoy employs a data-dependent splitting mechanism in which a splitting hyperplane is chosen as the one splitting two \average points" by repeatedly sampling dataset points. In the query phase, trees are traversed using a priority queue until a prede ned number of points is found.

IVF builds an inverted le based on clustering the dataset around a prede ned number of centroids.

It splits the dataset based on these centroids by associating each point with its closest centroid.

During query it nds the closest centroids and Figure 1: LID distribution for each dataset. Ticks checks points in the dataset associated with those. below the distribution curves represent single

We remark that both IVF and HNSW implemen- queries. Lines within each distribution curve corretations are the ones provided with FAISS2. spond to the 25, 50 and 75 percentile. The red line marks the 10 000 largest estimated LID, which we use as a threshold value to de ne hard query sets. 4.2 Datasets Table 1 presents an overview over the datasets that we consider in this study. We restrict our attention to datasets that are usually used in connection with Euclidean distance and k = 1 000. While the datasets di er widely in their Angular/Cosine distance. For each dataset, we original dimensionality, the median LID ranges from compute the LID distribution with respect to the around 13 for MNIST to about 23 for GLOVE-2M. 100-NN as discussed in Section 2. SIFT, MNIST, and The distribution of LID values is asymmetric and GLOVE are among the most-widely used datasets for shows a long tail behavior. MNIST, Fashion-MNIST, benchmarking nearest neighbor search algorithms. SIFT, and GNEWS are much more concentrated Fashion-MNIST is considered as a replacement for around the median compared to the two gloveMNIST, which is usually considered too easy for based datasets. machine learning tasks [19].

Figure 1 provides a visual representation of 5 the estimated LID distribution of each dataset, for

Evaluation

This section reports on the results of our experiments. Due to space constraints, we only present some selected results. More results can be explored 2https://github.com/facebookresearch/faiss via interactive plots at http://ann-benchmarks. Choosing Query Sets All runs are based on com/edml19/, which also contains a link to the 10 000 queries chosen from the dataset. Depending source code repository. For a xed implementation, on the question we try to answer, we employ the plots presented here consider the pareto frontier di erent selection strategies for choosing the query over all parameter choices [2]. Tested parameter set. choices and the associated plots are available on the To answer Q1 we employ random splits as done website. in the cross-validation technique: 10 000 randomly

Experimental Setup Experiments were run selected points are used as queries, while the rest on 2x 14-core Intel Xeon E5-2690v4 (2.60GHz) with are used as data points. Repeating the procedure 512GB of RAM using Ubuntu 16.10 with kernel 10 times yields 10 queryset/dataset pairs. 4.4.0. Index building was multi-threaded, queries For the discussion of Q2{Q5 we select three where answered sequentially in a single thread. di erent querysets from each dataset: (i) The

Quality and Performance Metrics As qual- dataset that contains the 10 000 points with the ity metric we measure the individual recall of each lowest estimated LID (which we denote easy ), 10 000 query, i.e., the fraction of points reported by the points around the data point with median estimated implementation that are among the true k-NN. As LID (denoted medium), and the 10 000 points with performance metric, we record individual query the largest estimated LID (dubbed hard ). Figure 1 times and the total number of distance computa- marks with a red line the LID used as a threshold tions needed to answer all queries. We usually to build the hard queryset. report on the throughput (the average number of queries that can be answered in one second, in the 5.1 Robustness of Random Split Figure 2 plots denoted as QPS for queries per second ), but shows the result of ten cross-validation runs on we will also inspect individual query times. random splits of GLOVE for ONNG and Annoy. The

Objectives of the Experiments Our experi- plot relates the average number of queries that can ments are tailored to answer the following questions: be answered in one second to the average recall achieved with a certain parameter con guration. (Q1) How robust are measurements when splitting This means that we want data points to be up and query set and dataset at random multiple to the right. The plot shows that the 10 splits give times? (Section 5.1) very similar results and represent the performance of the implementation accurately. It is interesting (Q2) How does the LID of a query set in uence the to see that a query set representing less than 1% running time performance? (Section 5.2) of the dataset points shows such a robust behavior.

We conclude that the random split allows for robust measurements. (Q3) How diverse are measurements obtained on datasets? Do relative di erences between the performance of di erent implementations stay 5.2 In uence of LID on Performance Figthe same over multiple datasets? (Section 5.3) ure 3 shows results for the in uence of using only points with low, middle, and large estimated LID (Q4) How well does the number of distance compu- as query points, in SIFT and GLOVE-2M. We obtations re ect the relative running time per- serve a clear in uence of the LID of the query set formance of the tested implementations? (Sec- on the performance: the larger the LID, the more tion 5.3) down and to the left the graphs move. This means that, for higher LID, it is more expensive, in terms (Q5) How concentrated are quality and performance of time, to answer queries with good recall. For measures around their mean for the tested all datasets except GLOVE-2M, all implementations implementations? (Section 5.4) were still able to achieve close to perfect recall with the parameters set. This means that all but one

105 ) 104 s (/1 103 SP 102 Q 101

105 ) 104 s (/1 103 SP 102 Q 101 105 104 ) s / (1S 103 P Q 102 101 0 of the tested datasets do not contain too many \noisy queries". Already the queries around the median prove challenging for most implementations.

For the most di cult queries (according to LID), only IVF and ONNG achieve close to perfect recall on GLOVE-2M.

Figure 4 reports on the results of ONNG and Annoy on selected datasets. Comparing results to the LID measurements depicted in Figure 1, the estimated median LID gives a good estimate on the relative performance of the algorithms on the data sets. As an exception, SIFT (M) is much easier than predicted by its LID distribution. In particular for Annoy, the hard SIFT instance is as challenging as the medium GLOVE version. The easy version of GLOVE-2M turns out to be e ciently solvable by both implementations (taking about the same time as it takes to answer the hard instance of FashionMNIST, which has a much higher LID). From this, we cannot conclude that LID as a single indicator explains performance di erences of an implementation Figure 5: Ranking of algorithms on ve di erent on di erent datasets. However, more careful experi- datasets, according to recall 0:75 and 0:9, and mentation is need before drawing a nal conclusion. according to two di erent performance measures: In our setting, the LID estimation is conducted number of queries per second (top) and number of for k = 1 000, while queries are only searching for distance computations (bottom). Both plots report the 10 nearest neighbors. Moreover, the estimation the ratio with the best performing algorithm on using MLE might not be accurate enough on these each dataset: for the queries per second metric a datasets. We leave the investigation of these two larger ratio is better, for the number of distance directions as future work. computations metric, a smaller ratio is better.

As a side note, we remark that Fashion-MNIST is as di cult to solve as MNIST for all implemen- with the best performing algorithm on that dataset, tations, and is by far the easiest dataset for all therefore the best performer is reported with ratio implementations. Thus, while there is a big di er- 1. Considering di erent dataset, we see that there ence in the di culty of solving the classi cation is little variation in the ranking of the algorithms. task [19], there is no measurable di erence between Only the two graph-based approaches trade ranks, these two datasets in the context of NN search. all other rankings are stable. Interestingly, Annoy makes much fewer distance computations but is 5.3 Diversity of Results Figure 5 gives an consistently outperformed by IVF.3 overview over how algorithms compare to each Comparing the number of distance computaother among all \medium di culty" datasets. We tions to running time performance, we see that an consider two metrics, namely the number of queries increase in the number of distance computations per second (top plot), and the number of distance is not re ected in a proportional decrease in the computations (bottom plot). For two di erent number of queries per second. This means that the average recall thresholds (0.75 and 0.9) we then select, for each algorithm, the best performing 3We note that IVF counts the initial comparisons to nd the parameter con guration that attains at least that closest centroids as distance computations, whereas Annoy did not recall. For each dataset, the plots report the ratio cpolaunnttothreepinornteronpruopdduacttedcormespuulttastiinontsheduwroinrkgshtroepe. traversal. We candidate set generation is in general more expen- haps surprisingly, for graph-based algorithms this sive for graph-based approaches, but the resulting shift is very sudden: most queries go from having candidate set is of much higher quality and fewer recall 0 to having recall 1, taking no intermediate distance computations have to be carried out. Gen- values. Taking the average recall as a performance erally, both graph-based algorithms are within a metric is convenient in that it is a single number factor 2 from each other, whereas the other two to compare algorithms with. However, the same need much larger candidate lists to achieve a cer- average recall can be attained with very di erent tain recall. The relative di erence usually ranges distributions: looking at such distributions can profrom 5x to 30x more distance computations for the vide more insight. non-graph based approaches, in particular at high For the plot on the right, we observe that recall. This translates well into the performance individual query times of Annoy, IVF, and ONNG di erences we see in this setting: consider for in- are well concentrated around their mean. However, stance Figure 3, where the lines corresponding to for HNSW query times uctuate widely around their HNSW and ONNG upper bound the lines relative to mean; in fact, the average query time is not well the other two algorithms. re ected in the query time distribution. For space reasons, we do not report other param5.4 Reporting the distribution of perfor- eter con gurations and datasets, which nonetheless mance In the previous sections, we made extensive show similar behaviours. use of recall/queries per second plots, where each con guration of each algorithm results in a single 6 Summary point, namely the average recall and the inverse In this paper we studied the in uence of LID to the of the average query time. As we shall see in this performance of nearest neighbor search algorithms. section, concentrating on averages can hide interest- We showed that LID allows to choose query sets ing information in the context of k-NN queries. In of a wide range of di culty from a given dataset. fact, not all queries are equally di cult to answer. We also showed how di erent LID distributions Consider the plots in Figure 6, which report perfor- in uence the running time performance of the mance of the four algorithms4 on the GLOVE-2M algorithms. In this respect, we could not conclude dataset, medium di culty. The left plot reports the that the LID alone can predict running time recall versus the number of queries per second, and di erences well. In particular, SIFT is usually easier black dots correspond to the averages. Additionally, for the algorithms, while GLOVE's LID distribution for each con guration, we report the distribution of would predict it to be the easier dataset of the two. the recall scores: the baseline of each recall curve is We introduced novel visualization techniques positioned at the corresponding queries per second to show the uncertainty within the answer to a set performance. Similarly, the right plot reports on of queries, which made it possible to show a clear the inverse of the individual query times (the aver- di erence between the graph-based algorithms and age of these is the QPS in the left plot) against the the other approaches. average recall. In both plots, the best performance We hope that this study initiates the search for is achieved towards the top-right corner. more diverse datasets, or for theoretical reasoning

Plotting the distributions, instead of just re- why certain algorithmic principles are generally porting the averages, uncovers some interesting be- better suited for nearest neighbor search. From haviour that might otherwise go unnoticed, in par- a more practical side, we would be interested in ticular with respect to the recall. The average recall seeing whether the LID can be used in the design gradually shifts towards the right as the e ect of of NN algorithms to guide the search process or more and more queries achieving good recalls. Per- parameter selection. For example, algorithms could adapt to the LID of the current set of k-NN found so 4In order not to clutter the plots, we xed parameters as follows: far, stopping early or spending more time depending IVF | number of lists 8192; Annoy | number of trees 100; HNSW | efConstruction 500; ONNG | edge 100, outdegree 10, indegree 120. on the LID of the candidates.

1e+05 HNSW ONNG 0.9 ONNG 0.0 [18] Spring, R., Shrivastava, A.: Scalable and sustainable deep learning via randomized hashing. In:

KDD'17. pp. 445{454 (2017) [19] Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. CoRR abs/1708.07747 (2017)

Trans. Pattern

Anal . Mach. Intell. 33 ( 1 ), 117 { 128

( 2011 ) [1] Amsaleg , L. , Chelly , O. , Furon , T. , Girard , S. , [ 10 ] Johnson , J. , Douze , M. , Jegou , H.: Billion-scale

Houle , M.E. , Kawarabayashi , K.I. , Nett , M. : Esti- similarity search with gpus . CoRR abs/1702.08734

mating local intrinsic dimensionality . In: KDD' 15 . ( 2017 )

pp. 29 { 38 . ACM ( 2015 ) [11] Johnson , W.B., Lindenstrauss , J. , Schechtman, [2] Aumuller, M. , Bernhardsson , E. , Faithfull , A.: G.: Extensions of lipschitz maps into banach

ANN-Benchmarks : A Benchmarking Tool for spaces . Israel Journal of Mathematics 54 ( 2 ), 129 {

Approximate

Nearest

Neighbor Algorithms . In: 138 ( 1986 )

SISAP'17 . pp. 34 { 49 ( 2017 ) [12] Kriegel , H. , Schubert , E. , Zimek , A. : The (black) [3] Bernhardsson , E. : Annoy, https://github.com / art of runtime evaluation: Are we comparing

spotify/annoy algorithms or implementations? Knowl . Inf. Syst. [4] Curtin , R.R. , Cline , J.R. , Slagle , N.P. , March , 52 ( 2 ), 341 { 378 ( 2017 )

W.B. , Ram , P. , Mehta , N.A. , Gray , A.G. : ML- [13] Levina , E. , Bickel , P.J. : Maximum likelihood

PACK: A scalable C+

+ machine learning library. estimation of intrinsic dimension . In: NIPS'15 . pp.

J. of Machine Learning Research 14 , 801 { 805 ( 2013 ) 777 { 784 ( 2005 ) [5] Edel , M. , Soni , A. , Curtin , R.R. : An automatic [ 14 ] Li , W. , Zhang, Y. , Sun , Y. , Wang , W. , Zhang, W.,

benchmarking system . In: NIPS 2014 Workshop on Lin , X. : Approximate nearest neighbor search on

Software Engineering for Machine Learning (

2014 ) high dimensional data - experiments, analyses , and [6] Houle , M.E. : Dimensionality, discriminability, improvement (v1.0) . CoRR abs/1610 .02455 ( 2016 )

density and distance distributions . In: Data Mining [ 15 ] Malkov , Y.A. , Yashunin , D.A. : E cient and

Workshops (ICDMW). pp. 468 { 473 . IEEE ( 2013 ) robust approximate nearest neighbor search using [7] Houle , M.E. , Schubert , E. , Zimek , A. : On the Hierarchical Navigable Small World graphs . ArXiv

correlation between local intrinsic dimensionality e-prints (

Mar 2016 )

and outlierness . In: SISAP'18 . pp. 177 { 191 ( 2018 ) [16] Mikolov , T. , Chen , K. , Corrado , G. , Dean , J. : [8] Iwasaki , M. , Miyazaki , D. : Optimization of In- E cient estimation of word representations in

dexing Based on k-Nearest Neighbor Graph for vector space . CoRR abs/1301 .3781 ( 2013 )

Proximity Search in High-dimensional Data . ArXiv [17] Pennington , J. , Socher , R. , Manning , C.D.: Glove:

e-prints ( Oct 2018 ) Global vectors for word representation . In: Em[9] Jegou , H. , Douze , M. , Schmid , C. : Product pirical Methods in Natural Language Processing

quantization for nearest neighbor search . IEEE (EMNLP) . pp. 1532 { 1543 ( 2014 )