=Paper=
{{Paper
|id=Vol-3946/TGD_paper5
|storemode=property
|title=ScaDANN: A Scalable Disk-based Graph Indexing Method for ANN
|pdfUrl=https://ceur-ws.org/Vol-3946/TGD-5.pdf
|volume=Vol-3946
|authors=Hyein Gu,Min-soo Kim
}}
==ScaDANN: A Scalable Disk-based Graph Indexing Method for ANN==
ScaDANN: A Scalable Disk-Based Graph Indexing Method for
ANN
Hyein Gu1 , Min-soo Kim1,∗
1
Korea Advanced Institute of Science and Technology, Republic of Korea
Abstract
Approximate Nearest Neighbor (ANN) indexing constitutes a fundamental component of modern vector-based databases, facilitating
efficient and accurate information retrieval for applications such as retrieval-augmented generation. However, scaling graph-based
ANN indices to billion-scale datasets poses substantial challenges, including high memory demands and inefficiencies in handling
partitioned graphs. To address these issues, we propose ScaDANN, a scalable disk-based graph indexing method tailored for large-scale
datasets under limited memory conditions. ScaDANN introduces two novel techniques: overlapping block-level insertion and grid
block merge, which enable the efficient construction of unified graph indices while preserving high search performance. Our approach
achieves notable advancements in index construction speed, search accuracy, and memory efficiency, establishing ScaDANN as a robust
and effective solution for scalable ANN indexing in resource-limited environments.
Keywords
Graph, ANN, Nearest neighbor search, Vector search
1. Introduction large-scale datasets, offering superior search accuracy and
performance despite higher memory demands during con-
Large Language Models (LLMs) have revolutionized natu- struction.
ral language processing by excelling in tasks such as text The need for scalable ANN indexing becomes increas-
generation [1], summarization [2], and question answering ingly apparent as data is segmented into smaller chunks to
[3]. Their ability to leverage pre-trained knowledge has meet the input constraints of LLMs. For instance, dividing
facilitated applications across diverse domains, including a 100 MB document into 4 KB chunks suitable for LLM in-
healthcare [4], education [5], and customer service [6]. How- put results in approximately 25,600 chunks—a 25,600-fold
ever, despite their strengths, LLMs face critical limitations, increase in data chunks relative to the original document
such as hallucination [7], where they generate plausible but count. This exponential growth in the number of chunks
incorrect outputs, outdated knowledge due to static train- significantly increases the size of the ANN index, necessi-
ing data, and privacy concerns when processing sensitive tating the development of methods capable of efficiently
information [8]. handling billion-scale datasets.
Retrieval-Augmented Generation (RAG) [9] has been in- Building large-scale ANN indexes, however, poses sig-
troduced as a promising framework to mitigate these limita- nificant challenges [23, 24]. First, the substantial size of
tions by integrating LLMs with external knowledge retrieval datasets and indexes requires considerable memory re-
systems. By leveraging vector databases (VectorDBs) [10], sources. Second, achieving competitive search performance
RAG retrieves relevant and up-to-date information to en- on disk-based systems, which are necessary for handling
hance the accuracy and reliability of generated responses. such datasets, is particularly challenging. To address these
This framework not only reduces the impact of hallucina- issues, we propose ScaDANN, a Scalable Disk-based Graph
tion but also addresses privacy concerns by accessing only Indexing Method for ANN, specifically designed to over-
the information necessary for a given query. come these limitations.
Approximate Nearest Neighbor (ANN) indexing serves This study pursues two primary objectives: (1) to enable
as a foundational component in RAG systems, enabling the construction of large-scale ANN indexes in memory-
efficient information retrieval from VectorDBs. ANN in- constrained environments and (2) to enhance search per-
dexes can be broadly categorized into clustering-based formance through the use of merged graph structures that
[11, 12, 13], hash-based [14, 15, 16], and graph-based meth- improve efficiency and accuracy during the search process.
ods [17, 18, 19, 20, 21]. Clustering-based methods partition By focusing on these goals, we provide a scalable and effi-
datasets into smaller clusters, facilitating memory-efficient cient solution for constructing and querying ANN indexes
index construction but often incurring longer construction in scenarios where both large-scale datasets and memory
times and reduced search accuracy. Hash-based methods limitations are critical challenges.
employ hash functions to compress datasets and reduce di- This work makes the following key contributions:
mensionality, which accelerates construction and search
processes but at the cost of reduced accuracy due to com- • We propose a graph-based ANN method that lever-
pression. Graph-based methods, in contrast, represent vec- ages grid blocks to efficiently construct and scale
tors as nodes and their relationships as edges, connecting billion-scale indexes in memory-constrained envi-
nodes based on proximity [22]. These methods construct ronments.
neighbor lists through vector operations between nodes • Our approach outperforms existing disk-based meth-
(as illustrated in Figure 1) and are particularly effective for ods, achieving index construction speeds 1.3 to 1.5
times faster and search speeds 1.2 to 1.4 times faster,
Published in the Proceedings of the Workshops of the EDBT/ICDT 2025 while maintaining comparable accuracy.
Joint Conference (March 25-28, 2025), Barcelona, Spain
∗
Corresponding author.
• We demonstrate that our method enables the con-
Envelope-Open hyein99@kaist.ac.kr (H. Gu); minsoo.k@kaist.ac.kr (M. Kim) struction of single indexes up to 10 times larger than
Orcid 0009-0009-8582-5805 (H. Gu); 0000-0002-5065-0226 (M. Kim) those of existing approaches within the same mem-
Copyright © 2025 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). ory constraints.
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
These contributions mark a significant advancement in Specifically, an edge between nodes 𝑃 and 𝑃 ′ is removed
ANN indexing, particularly in terms of efficiency and scala- if the distance between these nodes exceeds the distance be-
bility for large-scale, memory-constrained applications. tween 𝑃 ′ and a third node 𝑃 ∗ . By retaining longer edges un-
der relaxed conditions, Vamana enhances search efficiency
while maintaining graph connectivity.
2. Preliminaries
2.2.2. DiskANN
2.1. ANN search
DiskANN [25] extends the Vamana graph to a disk-based
Nearest Neighbor (NN) search identifies the data point(s) in
environment, making it well-suited for large-scale datasets.
a dataset closest to a given query point based on a defined
It employs a divide-and-conquer approach, partitioning the
distance metric, such as Euclidean distance or cosine similar-
dataset into subgraphs and merging them into a unified
ity. While exact NN search computes the distance between
structure.
the query and every data point to determine the nearest
Figure 1 illustrates the partitioning and merging process
neighbor, this approach becomes computationally infeasible
of DiskANN. (1) The dataset is first divided into multiple
for large datasets. ANN search addresses this challenge by
overlapping partitions using a clustering technique, with
finding points that are approximately close to the true near-
each node assigned to at least two partitions. (2) Indepen-
est neighbors, significantly reducing computational costs.
dent subgraphs are constructed within each partition. (3)
Although ANN sacrifices some accuracy, it delivers faster
These subgraphs are then merged by randomly selecting
query response times, which is particularly advantageous
nodes from the partitioned graphs, followed by merging
in high-dimensional and large-scale datasets.
and pruning operations. This process updates the neighbor
Definition 1 (ANN). Given a dataset 𝐷 = {𝑥1 , 𝑥2 , … , 𝑥𝑛 } lists with merged results, ensuring connectivity between
and a query point 𝑞, the task of nearest neighbor search partitions while reducing memory and computational costs.
(whether exact or approximate) is to find the point 𝑥𝑁 𝑁 ∈ 𝐷 DiskANN’s partitioning and merging strategy enables scal-
such that the distance 𝑑(𝑞, 𝑥𝑁 𝑁 ) is minimized, where 𝑑(⋅, ⋅) able graph construction but introduces challenges in main-
represents a distance metric. In the case of ANN search, the taining search performance across partitions.
algorithm guarantees that the returned point 𝑥 ∗ satisfies:
2.2.3. Batch insertion for graph construction
𝑑(𝑞, 𝑥 ∗ ) ≤ (1 + 𝜖) ⋅ 𝑑(𝑞, 𝑥𝑁 𝑁 ),
Efficient parallelization is essential for graph construction
where 𝜖 is a small approximation factor, and 𝑥𝑁 𝑁 is the exact in large-scale datasets. Batch insertion, as introduced in Par-
nearest neighbor of the query point 𝑞. layANN [27], addresses the challenge of concurrently adding
nodes to a graph while ensuring conflict-free operations.
To evaluate ANN algorithms, two primary metrics are Conflicts are avoided by grouping nodes into independent
used: Recall and Queries Per Second (QPS). Recall quan- batches and processing each batch in isolation, ensuring
tifies the effectiveness of an algorithm in retrieving true that nodes within a batch do not interfere with each other
nearest neighbors, defined as the ratio of true nearest neigh- during insertion.
bors retrieved to the total number of true nearest neighbors. The batch insertion process occurs in two steps. First,
Higher recall indicates greater accuracy but may increase nodes in a batch are independently connected to their near-
search time. QPS measures the system’s throughput, i.e., est neighbors within the existing graph. These connections
the number of queries processed per second. Higher QPS are established in parallel, as nodes within a batch do not
reflects greater efficiency, which is critical for large-scale share dependencies, preventing conflicts during edge cre-
and real-time applications. ation. This step ensures that all newly added nodes are
integrated into the graph without disrupting its structure.
2.2. Existing graph ANN indexing methods Second, reversed edges are added to the graph. A reversed
2.2.1. Vamana graph Partition into P1 P2 P3
𝑣$ 𝑣$
𝑣!
Vamana [25] is a widely used graph-based ANN indexing 𝑣$
𝑣!
𝑣" 𝑣"
𝑣&
𝑣! 𝑣&
algorithm and serves as the foundational graph structure in 𝑣"
𝑣&
𝑣"# 𝑣#
𝑣)
𝑣(
𝑣"#
many recent ANN studies. Its construction is based on two 𝑣"# 𝑣#
𝑣(
𝑣' 𝑣%
𝑣'
key techniques: Greedy Search and Robust Pruning. 𝑣"" 𝑣*
𝑣%
Greedy Search enables efficient exploration of nearest 𝑣) 𝑣*
𝑣"# 𝑣#
𝑣)
𝑣(
𝑣""
neighbors within the graph. Beginning from an initial node, * Two overlapping clusters 𝑣%
𝑣'
the algorithm iteratively identifies and prioritizes nodes * Each cluster size: 8 𝑣*
closest to the query. It maintains a priority queue of visited (1) Partitioning based on (2) Sub-graph construction
nodes, extracting the nearest unvisited node at each step, overlapping clustering
Random shuffle and Prune
adding its neighbors to the queue, and repeating this process 𝑣%
𝑣" 𝑣 1 2 3 4 5 6
until all relevant nodes are explored. 𝑣#
𝑣' 𝑣$ 𝑣# 𝑣#$ 𝑣* 𝑣" 𝑣& 𝑣)
𝑣#$ 𝑣$ 𝑣( 𝑣`# 𝑣" 𝑣) 𝑣' 𝑣$
𝑣)
𝛼 ⋅ 𝑑(𝑝 ∗ , 𝑝 ′ ) ≤ 𝑑(𝑝, 𝑝 ′ ) (𝛼 > 1) 𝑣## …
𝑣& 𝑣#$ 𝑣$ 𝑣" 𝑣&
Robust Pruning optimizes the graph by removing un- 𝑣* 𝑣+ 𝑣## 𝑣) 𝑣+ 𝑣* 𝑣(
necessary edges discovered during Greedy Search. This Neighbor list
technique utilizes the Sparse Neighborhood Graph (SNG) (3) Merging
property [26] and applies a distance threshold parameter,
alpha (𝛼), to determine whether an edge should be retained. Figure 1: DiskANN partitioning and merging process.
Step 1. Step 2. Step 3.
Build graph with block 0 and 1 Build graph with block 1 and 2 Build graph with block 2 and 0
Main memory Main memory Main memory
𝑣#
Block 0 Block 1 Block 2 𝑣!
𝑣"
Block 1 Block 2 Block 0 𝑣&
𝑣"# 𝑣$ 𝑣'
𝑣(
𝑣# 𝑣""
𝑣# 𝑣!
𝑣! 𝑣" 𝑣%
𝑣"
𝑣& 𝑣& 𝑣) 𝑣*
𝑣!" 𝑣!" 𝑣$
𝑣$ 𝑣' 𝑣'
𝑣) 𝑣)
𝑣!! 𝑣!!
𝑣% 𝑣%
𝑣(
Final graph
𝑣( 𝑣# 𝑣*
Figure 2: ScaDANN building process with three partitions and corresponding grid blocks at each step. The graph is constructed
by loading two partitions into memory at a time, ensuring efficient graph generation. In the next step, the previously used
partition is overlapped with a new partition to continue the building process.
edge is defined as an edge where the source vertex and or 512GB. However, grid partitioning ensures scalability
destination vertex are swapped, ensuring bidirectional con- by allowing the dataset to be partitioned into grid blocks,
nectivity between nodes. For example, if an edge connects making it feasible to construct the graph even when the
node 𝑣𝑖 (source) to node 𝑣𝑗 (destination), the reversed edge dataset size exceeds the available memory.
connects 𝑣𝑗 back to 𝑣𝑖 . During this step, the neighbors of Each grid block captures intra- and inter-partition rela-
each newly added node are updated with reversed edges, tionships. A grid block where the source and destination
completing the bidirectional connections required for robust partitions are the same forms a subgraph, which follows
graph-based indexing. the Vamana graph structure. Diagonal blocks represent
By separating the insertion of neighbors and the addi- subgraphs within individual partitions, while off-diagonal
tion of reversed edges into distinct steps, batch insertion blocks denote connections between partitions. For example,
guarantees structural consistency while leveraging paral- the red block (𝑝2, 𝑝1) in Figure 3 corresponds to connections
lel processing. This approach is particularly effective in from partition 𝑝2 (green) to partition 𝑝1 (blue). This struc-
large-scale datasets, as it minimizes computational over- ture improves scalability and ensures conflict-free parallel
head while maintaining the integrity and quality of the processing during node insertion.
graph. Moreover, the absence of interdependencies between
batches ensures that the graph remains stable throughout 3.2. Overlapping block-level insertion
the construction process, enabling efficient and scalable
graph-based ANN indexing. The overlapping block-level insertion method incrementally
constructs the graph by processing data in grid block units.
This method employs overlapping blocks, where a block
3. ScaDANN Method from the previous step is reused in the next step, allowing
two blocks to be processed simultaneously. As detailed in
3.1. Grid partitioning Algorithm 1, the process begins with the construction of
Figure 3 illustrates the structure of Graph ANN using a grid the graph for the first partition (𝐵1 = 0). In this step(line 3),
format, where each node in the graph consists of a base vec- specifically generates the subgraph for the first partition, en-
tor and a neighbor list. Grid partitioning divides large-scale suring that the intra-partition edges are established within
datasets into grid blocks, with rows representing source ver- the initial grid block.
tices and columns representing destination vertices. This In subsequent steps, pairs of overlapping blocks are pro-
approach organizes datasets into manageable segments, en- cessed to extend the graph. For each step, the data corre-
abling memory- and disk-efficient graph construction. For sponding to the current blocks is loaded into memory to
example, constructing a Vamana graph for a 1-billion-point construct the graph (lines 9–11). Particularly, line 11 in-
dataset with 100 dimensions, requiring 400GB for vector volves filling four grid blocks by constructing the subgraphs
storage, is infeasible under memory constraints of 256GB for two partitions and establishing connections between
them. This ensures that intra- and inter-partition edges
destination vertices are created simultaneously, improving efficiency and main-
𝑣+
𝑣&
𝑣!
𝒑𝟎 𝒑𝟏 𝒑𝟐 taining the global structure of the graph. Once the graph
𝑣!#
𝑣$ is constructed, the earlier block is offloaded from memory
𝑣# 𝑣- Sub
𝑣% 𝒑𝟎
graph 0 (lines 12–15), and the next block is loaded (lines 5–10).
𝑣!!
By reusing overlapping blocks during the insertion pro-
source vertices
𝑣,
𝑣" 𝑣' Sub cess, ScaDANN integrates graph construction and merging
𝒑𝟏
graph 1 into a single operation, eliminating the need for a separate,
𝑣# [0.71, 1.21, …, 0.92] [𝑣! , 𝑣" , 𝑣!# ]
computationally expensive merging step. This approach en-
𝑣! [1.91, 1.32, …, 0.49] [𝑣$ , 𝑣% , 𝑣& ] Edges
Sub sures that inter-block connections are captured as the graph
𝒑𝟐 from p2
…
…
…
graph 2
𝑣!! [0.01, 1.21, …, 0.43] [𝑣% , 𝑣' , 𝑣" ]
to p1 is built, thereby avoiding the creation of disjoint subgraphs.
Additionally, the overlapping mechanism significantly re-
Base vector Neighbor list Grid
duces I/O overhead by combining insertion and merging,
Figure 3: Graph ANN with grid format. The grid format repre- which are traditionally distinct steps.
sents the structure of Graph ANN, where each node consists of a Figure 2 illustrates the construction process. Initially,
base vector and a neighbor list.
Algorithm 1: ScaDANN Construction 𝑣!
𝑣$
Main memory Disk
𝑣"
Input: block size 𝑃 and dataset 𝐷 𝑣&
𝑣"# Block 1
Output: Graph 𝑣#
𝑣(
𝑣' Block 0
/* Initialize first block and interval */ 𝑣"" Block 2
𝑣%
1 first block number 𝐵1 = 0, second block number 𝑣) 𝑣* Stored on disk
𝐵2 = 1, block interval 𝑖𝑡𝑣 = 1, neighbor list (not loaded in main memory)
𝑁 𝐵𝑅𝑖 = ∅; 𝑣! Neighbor list (Neighbor ID, Distance)
Max degree:3
/* Read dataset for the first block */
( 𝑣𝑣&" , 236) ( 𝑣𝑣$! , 387) $$ , 219) ( 𝑣 𝑣&" , 236) ( 𝑣𝑣$! , 387)
$$ , 219) ( 𝑣
2 𝐷1 = ReadDataset(𝐵1 ); ( 𝑣𝑣$$ 𝑣*# , 473) ( 𝑣𝑣$$
3 𝑁 𝐵𝑅1 = vamana(𝐵1 , 𝑁 𝐵𝑅1 ); Added neighbors Updated neighbor list
4 while 𝑖𝑡𝑣 < 𝑃 do
5 𝐵2 ≡ (𝐵1 + 𝑖𝑡𝑣) (mod 𝑃); Figure 4: Grid block merge process with distance information.
/* Increase interval size */
6 if 𝐵1 = 𝐵2 then The merging process follows the Vamana graph construc-
7 𝑖𝑡𝑣 = 𝑖𝑡𝑣 ∗ 2; tion method, incorporating a greedy search to identify can-
8 continue didate neighbors and a robust pruning step to refine the
/* Read the second dataset and graph */ neighbor list. Inter-block edges are established by adding
9 𝐷2 = ReadDataset(𝐵2 ); neighbors from the second block to the neighbor list of
10 𝑁 𝐵𝑅2 = ReadGraph(𝐵2 ); the first block. The reverse edges are then added, ensuring
11 𝑁 𝐵𝑅1 , 𝑁 𝐵𝑅2 = bidirectional connectivity, where nodes in the first block
vamana((𝐷1 , 𝐷2 ), (𝑁 𝐵𝑅1 , 𝑁 𝐵𝑅2 )); are also linked as neighbors of nodes in the second block.
12 WriteGraph(𝑁 𝐵𝑅1 , 𝐵1 ); This comprehensive approach ensures a robust and efficient
13 𝐷1 = 𝐷 2 ; merging process.
14 𝑁 𝐵𝑅1 = 𝑁 𝐵𝑅2 ;
15 𝐵1 = 𝐵 2 ; 3.4. Complexity
16 WriteGraph(𝑁 𝐵𝑅1 , 𝐵1 );
Table 1 compares the time and space complexities of Vamana,
DiskANN, and ScaDANN in the index-building process, fo-
cusing on scalability and efficiency. The space complexity
blocks 0 and 1 are loaded into memory, and edges between
considers storage requirements, including neighbor lists.
the two blocks are created. In the second step, block 0 is
Vamana constructs a single large-scale graph, resulting in
offloaded, block 1 is retained, and block 2 is loaded. The
a time complexity of 𝑂(𝑆 1.16 ) and a space complexity of 𝑂(𝑆 ⋅
graph is extended based on connections between blocks 1
𝑅). Since it does not partition the dataset, the entire graph
and 2. Finally, in the third step, block 1 is offloaded, block
must fit in memory, making it less scalable for large-scale
2 is retained, and block 0 is reloaded to extend the graph
data. Both DiskANN and ScaDANN partition the dataset to
further by connecting blocks 2 and 0. This iterative pro-
construct subgraphs. Partitioning reduces memory usage
cess efficiently captures all relevant connections between
and enables disk-based processing. The time complexity
blocks, enabling the construction of a unified graph while
is inversely proportional to the number of partitions, as
minimizing memory usage and I/O costs.
increasing 𝑃 allows for more localized graph construction,
reducing the computational cost.
3.3. Grid block merge In DiskANN, each node is assigned to multiple overlap-
The grid block merge process establishes connections be- ping partitions (typically 𝐾 = 2 ), as shown in Figure 1.
tween two blocks loaded into memory and ensures bidirec- Since DiskANN performs clustering to determine partitions,
tional connectivity through reverse edges. Reverse edges, the build time complexity includes the clustering time (
where the source and destination vertices are swapped, 𝐶 ). In contrast, ScaDANN utilizes dynamic partitioning,
maintain symmetric inter-block connections, ensuring con- where the data structure size is determined based on the
sistent and efficient search performance. dataset size. This process does not require separate cluster-
In conventional methods, merging requires all data to be ing and is treated as an internal computational step, which
loaded into memory to calculate distances between nodes. is why it is not included in the time complexity. Unlike
ScaDANN overcomes this limitation by storing distance in- DiskANN, which incurs overhead due to overlapping par-
formation directly within the neighbor list of each node. titions, ScaDANN eliminates these redundancies, stream-
This pre-stored distance means that block 1 retains the lining subgraph construction and merging. This results in
neighbor list formed in step 1. For example, after step 1, 𝑣5 a computationally efficient process that scales well with
in block 1 (green) has already stored 𝑣3 and 𝑣2 from block increasing dataset size.
0 (yellow) as its neighbors. This allows distance compar-
isons between existing and new neighbors without reload- Methods Build Time Space
ing offloaded data, reducing memory usage and improving Vamana 𝑂(𝑆 1.16 ) 𝑂(𝑆 ⋅ 𝑅)
DiskANN 𝑂(𝐾 1.16 ⋅ 𝑃 −0.84 ⋅ 𝑆 1.16 + 𝐶) 𝑂(𝑆 ⋅ 𝑅 ⋅ 𝐾 ⋅ 𝑃 −1 )
scalability.
ScaDANN 𝑂(𝑃 −0.84 ⋅ 𝑆 1.16 ) 𝑂(𝑆 ⋅ 𝑅 ⋅ 𝑃 −1 )
Figure 4 illustrates the merging process during step 2
in Figure 2, where blocks 1 and 2 are loaded into memory Table 1
while block 0 remains stored on disk. When updating the Comparison of time and space complexities for Vamana,
neighbor list of 𝑣5 , the algorithm uses pre-stored distance DiskANN, and ScaDANN, where 𝑆 is the dataset size, 𝑅 is the
data from step 1 to compare block 0 neighbors with block 2 maximum degree of the graph, 𝐾 is the number of overlapping
neighbors, avoiding redundant distance calculations. As a partitions per node, 𝑃 is the number of partitions, and 𝐶 is the
clustering time.
result, 𝑣5 ’s neighbors are updated to 𝑣11 , 𝑣3 , and 𝑣2 .
4. Experiments higher number of partitions than DiskANN. This is because
ScaDANN stores additional distance information within the
4.1. Experimental setup graph, which increases memory consumption per block. As
a result, ScaDANN divides the data into more partitions
The experiments were conducted to evaluate the perfor-
than DiskANN to accommodate the additional memory us-
mance of ScaDANN under controlled conditions. The hard-
age. For instance, for the BIGANN dataset, ScaDANN used
ware environment included two Gold 6326 16-core proces-
around 200GB of memory for indexing, while DiskANN re-
sors and 1.9 TB of main memory, while the software en-
mained under the 256GB limit, but required fewer partitions
vironment was based on Ubuntu 20.04.6 LTS. To simulate
to store its graph.
memory-constrained scenarios, the available memory was
We conducted a detailed comparison of the index build-
deliberately restricted to 256 GB during the experiments.
ing process with DiskANN, as illustrated in Figure 6. In
Under these constraints, both Vamana and ParlayANN en-
the subgraph construction phase, our method completed
countered Out of Memory (OOM) errors, making them un-
approximately twice as fast as DiskANN, with the BIGANN
suitable for comparison. As a result, our evaluation focused
dataset showing a significant speedup. However, the sub-
on comparing ScaDANN with DiskANN, a state-of-the-art
graph merging phase took longer in our approach. This is
disk-based ANN indexing method.
due to the finer-grained distance comparisons performed
Table 2 provides a summary of the datasets used in the
during the merging process, which operates at the level of
experiments, including their data types and dimensions.
grid blocks. Unlike DiskANN, which uses a more straight-
The datasets used were BIGANN, Microsoft SPACEV, Yan-
forward merging process, ScaDANN’s overlapping block
dex DEEP, and Yandex Text-to-Image, each posing unique
approach requires additional computations during the merg-
challenges for ANN indexing. BIGANN consists of 1 bil-
ing phase to ensure accurate distance calculations between
lion 128-dimensional SIFT descriptors from a large-scale
partitioned blocks.
image dataset, serving as a benchmark for high-dimensional
In terms of I/O costs, our method demonstrated greater
similarity search. Microsoft SPACEV includes 1 billion 100-
efficiency. This improvement can be attributed to the use
dimensional document and query vectors reflecting real-
of overlapping block techniques, which optimize data load-
world web search scenarios. Yandex DEEP comprises 1
ing between memory and disk. Specifically, ScaDANN’s
billion 96-dimensional image descriptors designed to eval-
approach minimizes unnecessary disk reads and efficiently
uate deep learning-based image representations. Yandex
manages memory, leading to a faster construction process
Text-to-Image contains 1 billion multi-modal embeddings,
despite the increased memory overhead. This is evident
where image embeddings serve as the database and text
in the I/O time breakdown in Figure 6, where ScaDANN
embeddings form the query set, representing a cross-modal
exhibits a more balanced distribution of time between sub-
retrieval task with distinct distributions for database and
graph building, partitioning, and merging, compared to
query vectors.
DiskANN, which spends a significant portion of time in I/O
Dataset Data type Dimensions
operations.
BigANN uint8 128
Microsoft SPACEV int8 100 DiskANN ScaDANN(ours)
Yandex DEEP float32 96 20 16.5
18.8
16.3
Building time
Yandex Text-to-Image float32 200 14.6
15 10.7 11.3
(hours)
Table 2 10
Summary of datasets used in the experiments. 5
0
(a) BIGANN-1B (b) MSSPACEV-1B (c) DEEP-1B
4.2. Results and Analysis 300 248 241 239
256GB
Memory usage
Limit
199
4.2.1. Billion-scale index building in limited memory 200 160 187
(GB)
Figure 5 presents the results of indexing at a billion-scale 100 5 8 5 8 13 16
under a 256GB memory constraint. The top graph in Fig- 0
ure 5 illustrates the index building time across different (a) BIGANN-1B (b) MSSPACEV-1B (c) DEEP-1B
datasets. ScaDANN significantly reduced the index-building
time compared to DiskANN across all datasets. For example, Figure 5: Billion-scale index building experiment results, includ-
ing building time and memory usage, conducted under a 256GB
in the BIGANN dataset, ScaDANN completed the indexing
memory constraint. The memory usage values indicate the num-
task in 10.7 hours, compared to 16.5 hours for DiskANN,
ber of partitions dynamically determined during the indexing
achieving approximately a 35% reduction in build time. This process.
efficiency improvement can be attributed to ScaDANN’s op-
timized data partitioning and overlapping block techniques, Building Time (min) – BIGANN 1B
which minimize disk I/O and improve memory access pat- 3 27
terns. DiskANN 886 72
The bottom graph in Figure 5 shows the memory usage
0 16
during the index construction process. Both DiskANN and ScaDANN
373 208
(ours)
ScaDANN employed dynamic partitioning, which adjusts
the number of partitions based on available memory capac- 0 200 400 600 800 1000
ity and efficiently divides the data. Since the graph size and Partitioning Sub-graph building Merging I/O
memory usage vary depending on the degree of the neighbor
list, dynamic partitioning enables optimized partitioning by Figure 6: Time taken for each step in the building process, cate-
gorized into partitioning, subgraph building, merging, and I/O
adapting to these variations. Notably, ScaDANN required a
time for comparison.
DiskANN ScaDANN(ours)
1.E+05 1.E+05 1.E+05
Queries per second
Queries per second
Queries per second
1.E+04 1.E+04 1.E+04
1.E+03 1.E+03 1.E+03
1.E+02 1.E+02 1.E+02
70 80 90 100 80 90 100 70 80 90 100
Recall@10 Recall@10 Recall@10
(a) BIGANN-1B (b) MSSPACEV-1B (c) DEEP-1B
Figure 7: Disk-based search performance. The graph compares Recall@10 (x-axis) and QPS (y-axis), where higher placement
indicates lower latency at a given recall level. A method positioned higher in the graph achieves faster query processing while
maintaining similar accuracy.
Build time (min) QPS / Recall@10
4.2.2. Disk-based search performance 86
100
33 42 1.E+06
Figure 7 presents the comparison of the search performance 50
Queries per second
based on disks between ScaDANN and DiskANN. The ex- -
periments were carried out using the disk search method 4 8 16
Partition number
1.E+05
for evaluation. Our proposed ScaDANN method demon-
RAM usage (GB)
strated better search performance than DiskANN for most 56
60
datasets in terms of QPS, especially for the BIGANN-1B 40
31
18
dataset. As shown in the graph, ScaDANN achieved a higher 20 1.E+04
70 80 90 100
-
QPS while maintaining competitive recall rates. For ex- 4 8 16 Recall@10
ample, ScaDANN processed queries at a faster rate than Partition number
DiskANN for BIGANN and MSSPACEV-1B, achieving a sig- Figure 8: Experimental results of BIGANN 100M index building
nificant speedup while still maintaining similar or slightly with varying numbers of partitions.
better recall at top-10 results. However, for the DEEP-1B
dataset, ScaDANN showed a slight performance drop when QPS / Recall@10
Recall@10 exceeded 95. This discrepancy can be attributed ParlayANN (40) ParlayANN (50)
to the disk search method used, which is specifically opti- ScaDANN (42) ScaDANN (42)
DiskANN (55) DiskANN (82)
mized for DiskANN. 1.E+06 1.E+05
Queries per second
Queries per second
1.E+05
4.3. Ablation study 1.E+04
1.E+04
4.3.1. Impact of partition size on index construction
1.E+03 1.E+03
Figure 8 illustrates the effects of varying the number of parti- 80 85 90
Recall@10
95 100 70 75 80 85
Recall@10
90 95 100
tions on the BIGANN 100M dataset. Increasing the number (a) BIGANN-100M (b) YANDEX Text-to-image-100M
of partitions led to a significant increase in building time,
with approximately a 70% increase per step. However, RAM Figure 9: In-memory experimental results of BIGANN 100M and
usage decreased by around 40% per step, demonstrating the Yandex Text-to-Image 100M. The numbers in parentheses next to
trade-off between memory efficiency and construction time. each method indicate the index building time in minutes.
Despite the changes in partition numbers, there were no
significant differences observed in QPS relative to Recall@10 5. Conclusion
across the tested configurations. This indicates that increas-
ing the partition number primarily affects the construction This study proposed ScaDANN, a scalable disk-based graph
phase without adversely impacting search performance. indexing method designed for billion-scale ANN datasets
in memory-constrained environments. By introducing
overlapping block-level insertion and grid block merge,
4.3.2. In-memory graph construction experiments
ScaDANN effectively mitigates the high memory demands
The in-memory graph construction experiments on and inefficiencies of existing methods. These techniques
BIGANN-100M and Yandex Text-to-Image-100M compare enable seamless integration of blocks and efficient merging
ParlayANN, ScaDANN, and DiskANN in terms of indexing of partitioned graphs while minimizing I/O overhead and
time and query performance. Since ParlayANN encounters leveraging stored distance information.
out-of-memory (OOM) issues at the billion-scale, the ex- Experimental results demonstrated that ScaDANN
periments were conducted on 100 million data points with- achieves up to 1.5× faster index construction and 1.4× higher
out partitioning for ScaDANN and DiskANN. ScaDANN query throughput compared to DiskANN, while maintaining
achieves indexing speeds comparable to or faster than competitive accuracy across diverse datasets. Furthermore,
ParlayANN by storing distances within the graph, reduc- ScaDANN supports the construction of single indexes 10
ing redundant computations. Both methods also benefit times larger than existing approaches within the same mem-
from batch insertion, leading to faster construction than ory constraints, making it a robust solution for large-scale
DiskANN. ScaDANN not only achieves a faster build time ANN indexing. This work establishes ScaDANN as an effi-
but also maintains strong search performance, striking a cient and scalable approach, with significant potential for
balance between efficient indexing and high query through- real-world applications requiring high-performance ANN
put. indexing.
References 117–128.
[15] Y. Kalantidis, Y. Avrithis, Locally optimized product
[1] W. Yu, C. Zhu, Z. Li, Z. Hu, Q. Wang, H. Ji, M. Jiang, A quantization for approximate nearest neighbor search,
survey of knowledge-enhanced text generation, ACM in: Proceedings of the IEEE conference on computer
Computing Surveys 54 (11s) (2022) 1–38. vision and pattern recognition, 2014, pp. 2321–2328.
[2] H. Jin, Y. Zhang, D. Meng, J. Wang, J. Tan, A com- [16] A. Babenko, V. Lempitsky, The inverted multi-index,
prehensive survey on process-oriented automatic text IEEE transactions on pattern analysis and machine
summarization with exploration of llm-based methods, intelligence 37 (6) (2014) 1247–1260.
arXiv preprint arXiv:2403.02901 (2024). [17] W. Dong, C. Moses, K. Li, Efficient k-nearest neighbor
[3] Y. Zhuang, Y. Yu, K. Wang, H. Sun, C. Zhang, Toolqa: A graph construction for generic similarity measures, in:
dataset for llm question answering with external tools, Proceedings of the 20th international conference on
Advances in Neural Information Processing Systems World wide web, 2011, pp. 577–586.
36 (2023) 50117–50143. [18] C. Fu, C. Xiang, C. Wang, D. Cai, Fast approximate
[4] M. Cascella, J. Montomoli, V. Bellini, E. Bignami, Eval- nearest neighbor search with the navigating spreading-
uating the feasibility of chatgpt in healthcare: an anal- out graph, arXiv preprint arXiv:1707.00143 (2017).
ysis of multiple clinical and research scenarios, Journal [19] Y. Malkov, A. Ponomarenko, A. Logvinov, V. Krylov,
of medical systems 47 (1) (2023) 33. Approximate nearest neighbor algorithm based on
[5] M. S. Orenstrakh, O. Karnalim, C. A. Suarez, M. Liut, navigable small world graphs, Information Systems 45
Detecting llm-generated text in computing education: (2014) 61–68.
Comparative study for chatgpt cases, in: 2024 IEEE [20] Y. A. Malkov, D. A. Yashunin, Efficient and robust ap-
48th Annual Computers, Software, and Applications proximate nearest neighbor search using hierarchical
Conference (COMPSAC), IEEE, 2024, pp. 121–126. navigable small world graphs, IEEE transactions on
[6] K. Pandya, M. Holia, Automating customer service us- pattern analysis and machine intelligence 42 (4) (2018)
ing langchain: Building custom open-source gpt chat- 824–836.
bot for organizations, arXiv preprint arXiv:2310.05421 [21] J. Chen, H.-r. Fang, Y. Saad, Fast approximate knn
(2023). graph construction for high dimensional data via re-
[7] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, cursive lanczos bisection., Journal of Machine Learn-
Q. Chen, W. Peng, X. Feng, B. Qin, et al., A survey on ing Research 10 (9) (2009).
hallucination in large language models: Principles, tax- [22] M. Wang, X. Xu, Q. Yue, Y. Wang, A comprehensive
onomy, challenges, and open questions, ACM Trans- survey and experimental comparison of graph-based
actions on Information Systems (2023). approximate nearest neighbor search, arXiv preprint
[8] Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, Y. Zhang, A survey arXiv:2101.12631 (2021).
on large language model (llm) security and privacy: [23] K. Echihabi, K. Zoumpatianos, T. Palpanas, New trends
The good, the bad, and the ugly, High-Confidence in high-d vector similarity search: al-driven, progres-
Computing (2024) 100211. sive, and distributed, Proceedings of the VLDB Endow-
[9] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, ment 14 (12) (2021) 3198–3201.
N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rock- [24] H. V. Simhadri, G. Williams, M. Aumüller, M. Douze,
täschel, et al., Retrieval-augmented generation for A. Babenko, D. Baranchuk, Q. Chen, L. Hosseini,
knowledge-intensive nlp tasks, Advances in Neural R. Krishnaswamny, G. Srinivasa, et al., Results of
Information Processing Systems 33 (2020) 9459–9474. the neurips’21 challenge on billion-scale approximate
[10] J. J. Pan, J. Wang, G. Li, Survey of vector database nearest neighbor search, in: NeurIPS 2021 Compe-
management systems, The VLDB Journal 33 (5) (2024) titions and Demonstrations Track, PMLR, 2022, pp.
1591–1615. 177–189.
[11] Q. Chen, B. Zhao, H. Wang, M. Li, C. Liu, Z. Li, [25] S. Jayaram Subramanya, F. Devvrit, H. V. Simhadri,
M. Yang, J. Wang, Spann: Highly-efficient billion-scale R. Krishnawamy, R. Kadekodi, Diskann: Fast accurate
approximate nearest neighborhood search, Advances billion-point nearest neighbor search on a single node,
in Neural Information Processing Systems 34 (2021) Advances in Neural Information Processing Systems
5199–5212. 32 (2019).
[12] J. V. Munoz, M. A. Gonçalves, Z. Dias, R. d. S. Torres, [26] S. Arya, D. M. Mount, Approximate nearest neigh-
Hierarchical clustering-based graphs for large scale bor queries in fixed dimensions., in: SODA, Vol. 93,
approximate nearest neighbor search, Pattern Recog- Citeseer, 1993, pp. 271–280.
nition 96 (2019) 106970. [27] M. D. Manohar, Z. Shen, G. Blelloch, L. Dhulipala,
[13] S. Yu, J. Engels, Y. Huang, J. Shun, Pecann: Parallel ef- Y. Gu, H. V. Simhadri, Y. Sun, Parlayann: Scalable
ficient clustering with graph-based approximate near- and deterministic parallel graph-based approximate
est neighbor search, arXiv preprint arXiv:2312.03940 nearest neighbor search algorithms, in: Proceedings
(2023). of the 29th ACM SIGPLAN Annual Symposium on
[14] H. Jegou, M. Douze, C. Schmid, Product quantization Principles and Practice of Parallel Programming, 2024,
for nearest neighbor search, IEEE transactions on pat- pp. 270–285.
tern analysis and machine intelligence 33 (1) (2010)