ScaDANN: A Scalable Disk-Based Graph Indexing Method for ANN Hyein Gu1 , Min-soo Kim1,∗ 1 Korea Advanced Institute of Science and Technology, Republic of Korea Abstract Approximate Nearest Neighbor (ANN) indexing constitutes a fundamental component of modern vector-based databases, facilitating efficient and accurate information retrieval for applications such as retrieval-augmented generation. However, scaling graph-based ANN indices to billion-scale datasets poses substantial challenges, including high memory demands and inefficiencies in handling partitioned graphs. To address these issues, we propose ScaDANN, a scalable disk-based graph indexing method tailored for large-scale datasets under limited memory conditions. ScaDANN introduces two novel techniques: overlapping block-level insertion and grid block merge, which enable the efficient construction of unified graph indices while preserving high search performance. Our approach achieves notable advancements in index construction speed, search accuracy, and memory efficiency, establishing ScaDANN as a robust and effective solution for scalable ANN indexing in resource-limited environments. Keywords Graph, ANN, Nearest neighbor search, Vector search 1. Introduction large-scale datasets, offering superior search accuracy and performance despite higher memory demands during con- Large Language Models (LLMs) have revolutionized natu- struction. ral language processing by excelling in tasks such as text The need for scalable ANN indexing becomes increas- generation [1], summarization [2], and question answering ingly apparent as data is segmented into smaller chunks to [3]. Their ability to leverage pre-trained knowledge has meet the input constraints of LLMs. For instance, dividing facilitated applications across diverse domains, including a 100 MB document into 4 KB chunks suitable for LLM in- healthcare [4], education [5], and customer service [6]. How- put results in approximately 25,600 chunks—a 25,600-fold ever, despite their strengths, LLMs face critical limitations, increase in data chunks relative to the original document such as hallucination [7], where they generate plausible but count. This exponential growth in the number of chunks incorrect outputs, outdated knowledge due to static train- significantly increases the size of the ANN index, necessi- ing data, and privacy concerns when processing sensitive tating the development of methods capable of efficiently information [8]. handling billion-scale datasets. Retrieval-Augmented Generation (RAG) [9] has been in- Building large-scale ANN indexes, however, poses sig- troduced as a promising framework to mitigate these limita- nificant challenges [23, 24]. First, the substantial size of tions by integrating LLMs with external knowledge retrieval datasets and indexes requires considerable memory re- systems. By leveraging vector databases (VectorDBs) [10], sources. Second, achieving competitive search performance RAG retrieves relevant and up-to-date information to en- on disk-based systems, which are necessary for handling hance the accuracy and reliability of generated responses. such datasets, is particularly challenging. To address these This framework not only reduces the impact of hallucina- issues, we propose ScaDANN, a Scalable Disk-based Graph tion but also addresses privacy concerns by accessing only Indexing Method for ANN, specifically designed to over- the information necessary for a given query. come these limitations. Approximate Nearest Neighbor (ANN) indexing serves This study pursues two primary objectives: (1) to enable as a foundational component in RAG systems, enabling the construction of large-scale ANN indexes in memory- efficient information retrieval from VectorDBs. ANN in- constrained environments and (2) to enhance search per- dexes can be broadly categorized into clustering-based formance through the use of merged graph structures that [11, 12, 13], hash-based [14, 15, 16], and graph-based meth- improve efficiency and accuracy during the search process. ods [17, 18, 19, 20, 21]. Clustering-based methods partition By focusing on these goals, we provide a scalable and effi- datasets into smaller clusters, facilitating memory-efficient cient solution for constructing and querying ANN indexes index construction but often incurring longer construction in scenarios where both large-scale datasets and memory times and reduced search accuracy. Hash-based methods limitations are critical challenges. employ hash functions to compress datasets and reduce di- This work makes the following key contributions: mensionality, which accelerates construction and search processes but at the cost of reduced accuracy due to com- • We propose a graph-based ANN method that lever- pression. Graph-based methods, in contrast, represent vec- ages grid blocks to efficiently construct and scale tors as nodes and their relationships as edges, connecting billion-scale indexes in memory-constrained envi- nodes based on proximity [22]. These methods construct ronments. neighbor lists through vector operations between nodes • Our approach outperforms existing disk-based meth- (as illustrated in Figure 1) and are particularly effective for ods, achieving index construction speeds 1.3 to 1.5 times faster and search speeds 1.2 to 1.4 times faster, Published in the Proceedings of the Workshops of the EDBT/ICDT 2025 while maintaining comparable accuracy. Joint Conference (March 25-28, 2025), Barcelona, Spain ∗ Corresponding author. • We demonstrate that our method enables the con- Envelope-Open hyein99@kaist.ac.kr (H. Gu); minsoo.k@kaist.ac.kr (M. Kim) struction of single indexes up to 10 times larger than Orcid 0009-0009-8582-5805 (H. Gu); 0000-0002-5065-0226 (M. Kim) those of existing approaches within the same mem- Copyright © 2025 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). ory constraints. CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings These contributions mark a significant advancement in Specifically, an edge between nodes 𝑃 and 𝑃 ′ is removed ANN indexing, particularly in terms of efficiency and scala- if the distance between these nodes exceeds the distance be- bility for large-scale, memory-constrained applications. tween 𝑃 ′ and a third node 𝑃 ∗ . By retaining longer edges un- der relaxed conditions, Vamana enhances search efficiency while maintaining graph connectivity. 2. Preliminaries 2.2.2. DiskANN 2.1. ANN search DiskANN [25] extends the Vamana graph to a disk-based Nearest Neighbor (NN) search identifies the data point(s) in environment, making it well-suited for large-scale datasets. a dataset closest to a given query point based on a defined It employs a divide-and-conquer approach, partitioning the distance metric, such as Euclidean distance or cosine similar- dataset into subgraphs and merging them into a unified ity. While exact NN search computes the distance between structure. the query and every data point to determine the nearest Figure 1 illustrates the partitioning and merging process neighbor, this approach becomes computationally infeasible of DiskANN. (1) The dataset is first divided into multiple for large datasets. ANN search addresses this challenge by overlapping partitions using a clustering technique, with finding points that are approximately close to the true near- each node assigned to at least two partitions. (2) Indepen- est neighbors, significantly reducing computational costs. dent subgraphs are constructed within each partition. (3) Although ANN sacrifices some accuracy, it delivers faster These subgraphs are then merged by randomly selecting query response times, which is particularly advantageous nodes from the partitioned graphs, followed by merging in high-dimensional and large-scale datasets. and pruning operations. This process updates the neighbor Definition 1 (ANN). Given a dataset 𝐷 = {𝑥1 , 𝑥2 , … , 𝑥𝑛 } lists with merged results, ensuring connectivity between and a query point 𝑞, the task of nearest neighbor search partitions while reducing memory and computational costs. (whether exact or approximate) is to find the point 𝑥𝑁 𝑁 ∈ 𝐷 DiskANN’s partitioning and merging strategy enables scal- such that the distance 𝑑(𝑞, 𝑥𝑁 𝑁 ) is minimized, where 𝑑(⋅, ⋅) able graph construction but introduces challenges in main- represents a distance metric. In the case of ANN search, the taining search performance across partitions. algorithm guarantees that the returned point 𝑥 ∗ satisfies: 2.2.3. Batch insertion for graph construction 𝑑(𝑞, 𝑥 ∗ ) ≤ (1 + 𝜖) ⋅ 𝑑(𝑞, 𝑥𝑁 𝑁 ), Efficient parallelization is essential for graph construction where 𝜖 is a small approximation factor, and 𝑥𝑁 𝑁 is the exact in large-scale datasets. Batch insertion, as introduced in Par- nearest neighbor of the query point 𝑞. layANN [27], addresses the challenge of concurrently adding nodes to a graph while ensuring conflict-free operations. To evaluate ANN algorithms, two primary metrics are Conflicts are avoided by grouping nodes into independent used: Recall and Queries Per Second (QPS). Recall quan- batches and processing each batch in isolation, ensuring tifies the effectiveness of an algorithm in retrieving true that nodes within a batch do not interfere with each other nearest neighbors, defined as the ratio of true nearest neigh- during insertion. bors retrieved to the total number of true nearest neighbors. The batch insertion process occurs in two steps. First, Higher recall indicates greater accuracy but may increase nodes in a batch are independently connected to their near- search time. QPS measures the system’s throughput, i.e., est neighbors within the existing graph. These connections the number of queries processed per second. Higher QPS are established in parallel, as nodes within a batch do not reflects greater efficiency, which is critical for large-scale share dependencies, preventing conflicts during edge cre- and real-time applications. ation. This step ensures that all newly added nodes are integrated into the graph without disrupting its structure. 2.2. Existing graph ANN indexing methods Second, reversed edges are added to the graph. A reversed 2.2.1. Vamana graph Partition into P1 P2 P3 𝑣$ 𝑣$ 𝑣! Vamana [25] is a widely used graph-based ANN indexing 𝑣$ 𝑣! 𝑣" 𝑣" 𝑣& 𝑣! 𝑣& algorithm and serves as the foundational graph structure in 𝑣" 𝑣& 𝑣"# 𝑣# 𝑣) 𝑣( 𝑣"# many recent ANN studies. Its construction is based on two 𝑣"# 𝑣# 𝑣( 𝑣' 𝑣% 𝑣' key techniques: Greedy Search and Robust Pruning. 𝑣"" 𝑣* 𝑣% Greedy Search enables efficient exploration of nearest 𝑣) 𝑣* 𝑣"# 𝑣# 𝑣) 𝑣( 𝑣"" neighbors within the graph. Beginning from an initial node, * Two overlapping clusters 𝑣% 𝑣' the algorithm iteratively identifies and prioritizes nodes * Each cluster size: 8 𝑣* closest to the query. It maintains a priority queue of visited (1) Partitioning based on (2) Sub-graph construction nodes, extracting the nearest unvisited node at each step, overlapping clustering Random shuffle and Prune adding its neighbors to the queue, and repeating this process 𝑣% 𝑣" 𝑣 1 2 3 4 5 6 until all relevant nodes are explored. 𝑣# 𝑣' 𝑣$ 𝑣# 𝑣#$ 𝑣* 𝑣" 𝑣& 𝑣) 𝑣#$ 𝑣$ 𝑣( 𝑣`# 𝑣" 𝑣) 𝑣' 𝑣$ 𝑣) 𝛼 ⋅ 𝑑(𝑝 ∗ , 𝑝 ′ ) ≤ 𝑑(𝑝, 𝑝 ′ ) (𝛼 > 1) 𝑣## … 𝑣& 𝑣#$ 𝑣$ 𝑣" 𝑣& Robust Pruning optimizes the graph by removing un- 𝑣* 𝑣+ 𝑣## 𝑣) 𝑣+ 𝑣* 𝑣( necessary edges discovered during Greedy Search. This Neighbor list technique utilizes the Sparse Neighborhood Graph (SNG) (3) Merging property [26] and applies a distance threshold parameter, alpha (𝛼), to determine whether an edge should be retained. Figure 1: DiskANN partitioning and merging process. Step 1. Step 2. Step 3. Build graph with block 0 and 1 Build graph with block 1 and 2 Build graph with block 2 and 0 Main memory Main memory Main memory 𝑣# Block 0 Block 1 Block 2 𝑣! 𝑣" Block 1 Block 2 Block 0 𝑣& 𝑣"# 𝑣$ 𝑣' 𝑣( 𝑣# 𝑣"" 𝑣# 𝑣! 𝑣! 𝑣" 𝑣% 𝑣" 𝑣& 𝑣& 𝑣) 𝑣* 𝑣!" 𝑣!" 𝑣$ 𝑣$ 𝑣' 𝑣' 𝑣) 𝑣) 𝑣!! 𝑣!! 𝑣% 𝑣% 𝑣( Final graph 𝑣( 𝑣# 𝑣* Figure 2: ScaDANN building process with three partitions and corresponding grid blocks at each step. The graph is constructed by loading two partitions into memory at a time, ensuring efficient graph generation. In the next step, the previously used partition is overlapped with a new partition to continue the building process. edge is defined as an edge where the source vertex and or 512GB. However, grid partitioning ensures scalability destination vertex are swapped, ensuring bidirectional con- by allowing the dataset to be partitioned into grid blocks, nectivity between nodes. For example, if an edge connects making it feasible to construct the graph even when the node 𝑣𝑖 (source) to node 𝑣𝑗 (destination), the reversed edge dataset size exceeds the available memory. connects 𝑣𝑗 back to 𝑣𝑖 . During this step, the neighbors of Each grid block captures intra- and inter-partition rela- each newly added node are updated with reversed edges, tionships. A grid block where the source and destination completing the bidirectional connections required for robust partitions are the same forms a subgraph, which follows graph-based indexing. the Vamana graph structure. Diagonal blocks represent By separating the insertion of neighbors and the addi- subgraphs within individual partitions, while off-diagonal tion of reversed edges into distinct steps, batch insertion blocks denote connections between partitions. For example, guarantees structural consistency while leveraging paral- the red block (𝑝2, 𝑝1) in Figure 3 corresponds to connections lel processing. This approach is particularly effective in from partition 𝑝2 (green) to partition 𝑝1 (blue). This struc- large-scale datasets, as it minimizes computational over- ture improves scalability and ensures conflict-free parallel head while maintaining the integrity and quality of the processing during node insertion. graph. Moreover, the absence of interdependencies between batches ensures that the graph remains stable throughout 3.2. Overlapping block-level insertion the construction process, enabling efficient and scalable graph-based ANN indexing. The overlapping block-level insertion method incrementally constructs the graph by processing data in grid block units. This method employs overlapping blocks, where a block 3. ScaDANN Method from the previous step is reused in the next step, allowing two blocks to be processed simultaneously. As detailed in 3.1. Grid partitioning Algorithm 1, the process begins with the construction of Figure 3 illustrates the structure of Graph ANN using a grid the graph for the first partition (𝐵1 = 0). In this step(line 3), format, where each node in the graph consists of a base vec- specifically generates the subgraph for the first partition, en- tor and a neighbor list. Grid partitioning divides large-scale suring that the intra-partition edges are established within datasets into grid blocks, with rows representing source ver- the initial grid block. tices and columns representing destination vertices. This In subsequent steps, pairs of overlapping blocks are pro- approach organizes datasets into manageable segments, en- cessed to extend the graph. For each step, the data corre- abling memory- and disk-efficient graph construction. For sponding to the current blocks is loaded into memory to example, constructing a Vamana graph for a 1-billion-point construct the graph (lines 9–11). Particularly, line 11 in- dataset with 100 dimensions, requiring 400GB for vector volves filling four grid blocks by constructing the subgraphs storage, is infeasible under memory constraints of 256GB for two partitions and establishing connections between them. This ensures that intra- and inter-partition edges destination vertices are created simultaneously, improving efficiency and main- 𝑣+ 𝑣& 𝑣! 𝒑𝟎 𝒑𝟏 𝒑𝟐 taining the global structure of the graph. Once the graph 𝑣!# 𝑣$ is constructed, the earlier block is offloaded from memory 𝑣# 𝑣- Sub 𝑣% 𝒑𝟎 graph 0 (lines 12–15), and the next block is loaded (lines 5–10). 𝑣!! By reusing overlapping blocks during the insertion pro- source vertices 𝑣, 𝑣" 𝑣' Sub cess, ScaDANN integrates graph construction and merging 𝒑𝟏 graph 1 into a single operation, eliminating the need for a separate, 𝑣# [0.71, 1.21, …, 0.92] [𝑣! , 𝑣" , 𝑣!# ] computationally expensive merging step. This approach en- 𝑣! [1.91, 1.32, …, 0.49] [𝑣$ , 𝑣% , 𝑣& ] Edges Sub sures that inter-block connections are captured as the graph 𝒑𝟐 from p2 … … … graph 2 𝑣!! [0.01, 1.21, …, 0.43] [𝑣% , 𝑣' , 𝑣" ] to p1 is built, thereby avoiding the creation of disjoint subgraphs. Additionally, the overlapping mechanism significantly re- Base vector Neighbor list Grid duces I/O overhead by combining insertion and merging, Figure 3: Graph ANN with grid format. The grid format repre- which are traditionally distinct steps. sents the structure of Graph ANN, where each node consists of a Figure 2 illustrates the construction process. Initially, base vector and a neighbor list. Algorithm 1: ScaDANN Construction 𝑣! 𝑣$ Main memory Disk 𝑣" Input: block size 𝑃 and dataset 𝐷 𝑣& 𝑣"# Block 1 Output: Graph 𝑣# 𝑣( 𝑣' Block 0 /* Initialize first block and interval */ 𝑣"" Block 2 𝑣% 1 first block number 𝐵1 = 0, second block number 𝑣) 𝑣* Stored on disk 𝐵2 = 1, block interval 𝑖𝑡𝑣 = 1, neighbor list (not loaded in main memory) 𝑁 𝐵𝑅𝑖 = ∅; 𝑣! Neighbor list (Neighbor ID, Distance) Max degree:3 /* Read dataset for the first block */ ( 𝑣𝑣&" , 236) ( 𝑣𝑣$! , 387) $$ , 219) ( 𝑣 𝑣&" , 236) ( 𝑣𝑣$! , 387) $$ , 219) ( 𝑣 2 𝐷1 = ReadDataset(𝐵1 ); ( 𝑣𝑣$$ 𝑣*# , 473) ( 𝑣𝑣$$ 3 𝑁 𝐵𝑅1 = vamana(𝐵1 , 𝑁 𝐵𝑅1 ); Added neighbors Updated neighbor list 4 while 𝑖𝑡𝑣 < 𝑃 do 5 𝐵2 ≡ (𝐵1 + 𝑖𝑡𝑣) (mod 𝑃); Figure 4: Grid block merge process with distance information. /* Increase interval size */ 6 if 𝐵1 = 𝐵2 then The merging process follows the Vamana graph construc- 7 𝑖𝑡𝑣 = 𝑖𝑡𝑣 ∗ 2; tion method, incorporating a greedy search to identify can- 8 continue didate neighbors and a robust pruning step to refine the /* Read the second dataset and graph */ neighbor list. Inter-block edges are established by adding 9 𝐷2 = ReadDataset(𝐵2 ); neighbors from the second block to the neighbor list of 10 𝑁 𝐵𝑅2 = ReadGraph(𝐵2 ); the first block. The reverse edges are then added, ensuring 11 𝑁 𝐵𝑅1 , 𝑁 𝐵𝑅2 = bidirectional connectivity, where nodes in the first block vamana((𝐷1 , 𝐷2 ), (𝑁 𝐵𝑅1 , 𝑁 𝐵𝑅2 )); are also linked as neighbors of nodes in the second block. 12 WriteGraph(𝑁 𝐵𝑅1 , 𝐵1 ); This comprehensive approach ensures a robust and efficient 13 𝐷1 = 𝐷 2 ; merging process. 14 𝑁 𝐵𝑅1 = 𝑁 𝐵𝑅2 ; 15 𝐵1 = 𝐵 2 ; 3.4. Complexity 16 WriteGraph(𝑁 𝐵𝑅1 , 𝐵1 ); Table 1 compares the time and space complexities of Vamana, DiskANN, and ScaDANN in the index-building process, fo- cusing on scalability and efficiency. The space complexity blocks 0 and 1 are loaded into memory, and edges between considers storage requirements, including neighbor lists. the two blocks are created. In the second step, block 0 is Vamana constructs a single large-scale graph, resulting in offloaded, block 1 is retained, and block 2 is loaded. The a time complexity of 𝑂(𝑆 1.16 ) and a space complexity of 𝑂(𝑆 ⋅ graph is extended based on connections between blocks 1 𝑅). Since it does not partition the dataset, the entire graph and 2. Finally, in the third step, block 1 is offloaded, block must fit in memory, making it less scalable for large-scale 2 is retained, and block 0 is reloaded to extend the graph data. Both DiskANN and ScaDANN partition the dataset to further by connecting blocks 2 and 0. This iterative pro- construct subgraphs. Partitioning reduces memory usage cess efficiently captures all relevant connections between and enables disk-based processing. The time complexity blocks, enabling the construction of a unified graph while is inversely proportional to the number of partitions, as minimizing memory usage and I/O costs. increasing 𝑃 allows for more localized graph construction, reducing the computational cost. 3.3. Grid block merge In DiskANN, each node is assigned to multiple overlap- The grid block merge process establishes connections be- ping partitions (typically 𝐾 = 2 ), as shown in Figure 1. tween two blocks loaded into memory and ensures bidirec- Since DiskANN performs clustering to determine partitions, tional connectivity through reverse edges. Reverse edges, the build time complexity includes the clustering time ( where the source and destination vertices are swapped, 𝐶 ). In contrast, ScaDANN utilizes dynamic partitioning, maintain symmetric inter-block connections, ensuring con- where the data structure size is determined based on the sistent and efficient search performance. dataset size. This process does not require separate cluster- In conventional methods, merging requires all data to be ing and is treated as an internal computational step, which loaded into memory to calculate distances between nodes. is why it is not included in the time complexity. Unlike ScaDANN overcomes this limitation by storing distance in- DiskANN, which incurs overhead due to overlapping par- formation directly within the neighbor list of each node. titions, ScaDANN eliminates these redundancies, stream- This pre-stored distance means that block 1 retains the lining subgraph construction and merging. This results in neighbor list formed in step 1. For example, after step 1, 𝑣5 a computationally efficient process that scales well with in block 1 (green) has already stored 𝑣3 and 𝑣2 from block increasing dataset size. 0 (yellow) as its neighbors. This allows distance compar- isons between existing and new neighbors without reload- Methods Build Time Space ing offloaded data, reducing memory usage and improving Vamana 𝑂(𝑆 1.16 ) 𝑂(𝑆 ⋅ 𝑅) DiskANN 𝑂(𝐾 1.16 ⋅ 𝑃 −0.84 ⋅ 𝑆 1.16 + 𝐶) 𝑂(𝑆 ⋅ 𝑅 ⋅ 𝐾 ⋅ 𝑃 −1 ) scalability. ScaDANN 𝑂(𝑃 −0.84 ⋅ 𝑆 1.16 ) 𝑂(𝑆 ⋅ 𝑅 ⋅ 𝑃 −1 ) Figure 4 illustrates the merging process during step 2 in Figure 2, where blocks 1 and 2 are loaded into memory Table 1 while block 0 remains stored on disk. When updating the Comparison of time and space complexities for Vamana, neighbor list of 𝑣5 , the algorithm uses pre-stored distance DiskANN, and ScaDANN, where 𝑆 is the dataset size, 𝑅 is the data from step 1 to compare block 0 neighbors with block 2 maximum degree of the graph, 𝐾 is the number of overlapping neighbors, avoiding redundant distance calculations. As a partitions per node, 𝑃 is the number of partitions, and 𝐶 is the clustering time. result, 𝑣5 ’s neighbors are updated to 𝑣11 , 𝑣3 , and 𝑣2 . 4. Experiments higher number of partitions than DiskANN. This is because ScaDANN stores additional distance information within the 4.1. Experimental setup graph, which increases memory consumption per block. As a result, ScaDANN divides the data into more partitions The experiments were conducted to evaluate the perfor- than DiskANN to accommodate the additional memory us- mance of ScaDANN under controlled conditions. The hard- age. For instance, for the BIGANN dataset, ScaDANN used ware environment included two Gold 6326 16-core proces- around 200GB of memory for indexing, while DiskANN re- sors and 1.9 TB of main memory, while the software en- mained under the 256GB limit, but required fewer partitions vironment was based on Ubuntu 20.04.6 LTS. To simulate to store its graph. memory-constrained scenarios, the available memory was We conducted a detailed comparison of the index build- deliberately restricted to 256 GB during the experiments. ing process with DiskANN, as illustrated in Figure 6. In Under these constraints, both Vamana and ParlayANN en- the subgraph construction phase, our method completed countered Out of Memory (OOM) errors, making them un- approximately twice as fast as DiskANN, with the BIGANN suitable for comparison. As a result, our evaluation focused dataset showing a significant speedup. However, the sub- on comparing ScaDANN with DiskANN, a state-of-the-art graph merging phase took longer in our approach. This is disk-based ANN indexing method. due to the finer-grained distance comparisons performed Table 2 provides a summary of the datasets used in the during the merging process, which operates at the level of experiments, including their data types and dimensions. grid blocks. Unlike DiskANN, which uses a more straight- The datasets used were BIGANN, Microsoft SPACEV, Yan- forward merging process, ScaDANN’s overlapping block dex DEEP, and Yandex Text-to-Image, each posing unique approach requires additional computations during the merg- challenges for ANN indexing. BIGANN consists of 1 bil- ing phase to ensure accurate distance calculations between lion 128-dimensional SIFT descriptors from a large-scale partitioned blocks. image dataset, serving as a benchmark for high-dimensional In terms of I/O costs, our method demonstrated greater similarity search. Microsoft SPACEV includes 1 billion 100- efficiency. This improvement can be attributed to the use dimensional document and query vectors reflecting real- of overlapping block techniques, which optimize data load- world web search scenarios. Yandex DEEP comprises 1 ing between memory and disk. Specifically, ScaDANN’s billion 96-dimensional image descriptors designed to eval- approach minimizes unnecessary disk reads and efficiently uate deep learning-based image representations. Yandex manages memory, leading to a faster construction process Text-to-Image contains 1 billion multi-modal embeddings, despite the increased memory overhead. This is evident where image embeddings serve as the database and text in the I/O time breakdown in Figure 6, where ScaDANN embeddings form the query set, representing a cross-modal exhibits a more balanced distribution of time between sub- retrieval task with distinct distributions for database and graph building, partitioning, and merging, compared to query vectors. DiskANN, which spends a significant portion of time in I/O Dataset Data type Dimensions operations. BigANN uint8 128 Microsoft SPACEV int8 100 DiskANN ScaDANN(ours) Yandex DEEP float32 96 20 16.5 18.8 16.3 Building time Yandex Text-to-Image float32 200 14.6 15 10.7 11.3 (hours) Table 2 10 Summary of datasets used in the experiments. 5 0 (a) BIGANN-1B (b) MSSPACEV-1B (c) DEEP-1B 4.2. Results and Analysis 300 248 241 239 256GB Memory usage Limit 199 4.2.1. Billion-scale index building in limited memory 200 160 187 (GB) Figure 5 presents the results of indexing at a billion-scale 100 5 8 5 8 13 16 under a 256GB memory constraint. The top graph in Fig- 0 ure 5 illustrates the index building time across different (a) BIGANN-1B (b) MSSPACEV-1B (c) DEEP-1B datasets. ScaDANN significantly reduced the index-building time compared to DiskANN across all datasets. For example, Figure 5: Billion-scale index building experiment results, includ- ing building time and memory usage, conducted under a 256GB in the BIGANN dataset, ScaDANN completed the indexing memory constraint. The memory usage values indicate the num- task in 10.7 hours, compared to 16.5 hours for DiskANN, ber of partitions dynamically determined during the indexing achieving approximately a 35% reduction in build time. This process. efficiency improvement can be attributed to ScaDANN’s op- timized data partitioning and overlapping block techniques, Building Time (min) – BIGANN 1B which minimize disk I/O and improve memory access pat- 3 27 terns. DiskANN 886 72 The bottom graph in Figure 5 shows the memory usage 0 16 during the index construction process. Both DiskANN and ScaDANN 373 208 (ours) ScaDANN employed dynamic partitioning, which adjusts the number of partitions based on available memory capac- 0 200 400 600 800 1000 ity and efficiently divides the data. Since the graph size and Partitioning Sub-graph building Merging I/O memory usage vary depending on the degree of the neighbor list, dynamic partitioning enables optimized partitioning by Figure 6: Time taken for each step in the building process, cate- gorized into partitioning, subgraph building, merging, and I/O adapting to these variations. Notably, ScaDANN required a time for comparison. DiskANN ScaDANN(ours) 1.E+05 1.E+05 1.E+05 Queries per second Queries per second Queries per second 1.E+04 1.E+04 1.E+04 1.E+03 1.E+03 1.E+03 1.E+02 1.E+02 1.E+02 70 80 90 100 80 90 100 70 80 90 100 Recall@10 Recall@10 Recall@10 (a) BIGANN-1B (b) MSSPACEV-1B (c) DEEP-1B Figure 7: Disk-based search performance. The graph compares Recall@10 (x-axis) and QPS (y-axis), where higher placement indicates lower latency at a given recall level. A method positioned higher in the graph achieves faster query processing while maintaining similar accuracy. Build time (min) QPS / Recall@10 4.2.2. Disk-based search performance 86 100 33 42 1.E+06 Figure 7 presents the comparison of the search performance 50 Queries per second based on disks between ScaDANN and DiskANN. The ex- - periments were carried out using the disk search method 4 8 16 Partition number 1.E+05 for evaluation. Our proposed ScaDANN method demon- RAM usage (GB) strated better search performance than DiskANN for most 56 60 datasets in terms of QPS, especially for the BIGANN-1B 40 31 18 dataset. As shown in the graph, ScaDANN achieved a higher 20 1.E+04 70 80 90 100 - QPS while maintaining competitive recall rates. For ex- 4 8 16 Recall@10 ample, ScaDANN processed queries at a faster rate than Partition number DiskANN for BIGANN and MSSPACEV-1B, achieving a sig- Figure 8: Experimental results of BIGANN 100M index building nificant speedup while still maintaining similar or slightly with varying numbers of partitions. better recall at top-10 results. However, for the DEEP-1B dataset, ScaDANN showed a slight performance drop when QPS / Recall@10 Recall@10 exceeded 95. This discrepancy can be attributed ParlayANN (40) ParlayANN (50) to the disk search method used, which is specifically opti- ScaDANN (42) ScaDANN (42) DiskANN (55) DiskANN (82) mized for DiskANN. 1.E+06 1.E+05 Queries per second Queries per second 1.E+05 4.3. Ablation study 1.E+04 1.E+04 4.3.1. Impact of partition size on index construction 1.E+03 1.E+03 Figure 8 illustrates the effects of varying the number of parti- 80 85 90 Recall@10 95 100 70 75 80 85 Recall@10 90 95 100 tions on the BIGANN 100M dataset. Increasing the number (a) BIGANN-100M (b) YANDEX Text-to-image-100M of partitions led to a significant increase in building time, with approximately a 70% increase per step. However, RAM Figure 9: In-memory experimental results of BIGANN 100M and usage decreased by around 40% per step, demonstrating the Yandex Text-to-Image 100M. The numbers in parentheses next to trade-off between memory efficiency and construction time. each method indicate the index building time in minutes. Despite the changes in partition numbers, there were no significant differences observed in QPS relative to Recall@10 5. Conclusion across the tested configurations. This indicates that increas- ing the partition number primarily affects the construction This study proposed ScaDANN, a scalable disk-based graph phase without adversely impacting search performance. indexing method designed for billion-scale ANN datasets in memory-constrained environments. By introducing overlapping block-level insertion and grid block merge, 4.3.2. In-memory graph construction experiments ScaDANN effectively mitigates the high memory demands The in-memory graph construction experiments on and inefficiencies of existing methods. These techniques BIGANN-100M and Yandex Text-to-Image-100M compare enable seamless integration of blocks and efficient merging ParlayANN, ScaDANN, and DiskANN in terms of indexing of partitioned graphs while minimizing I/O overhead and time and query performance. Since ParlayANN encounters leveraging stored distance information. out-of-memory (OOM) issues at the billion-scale, the ex- Experimental results demonstrated that ScaDANN periments were conducted on 100 million data points with- achieves up to 1.5× faster index construction and 1.4× higher out partitioning for ScaDANN and DiskANN. ScaDANN query throughput compared to DiskANN, while maintaining achieves indexing speeds comparable to or faster than competitive accuracy across diverse datasets. Furthermore, ParlayANN by storing distances within the graph, reduc- ScaDANN supports the construction of single indexes 10 ing redundant computations. Both methods also benefit times larger than existing approaches within the same mem- from batch insertion, leading to faster construction than ory constraints, making it a robust solution for large-scale DiskANN. ScaDANN not only achieves a faster build time ANN indexing. This work establishes ScaDANN as an effi- but also maintains strong search performance, striking a cient and scalable approach, with significant potential for balance between efficient indexing and high query through- real-world applications requiring high-performance ANN put. indexing. References 117–128. [15] Y. Kalantidis, Y. Avrithis, Locally optimized product [1] W. Yu, C. Zhu, Z. Li, Z. Hu, Q. Wang, H. Ji, M. Jiang, A quantization for approximate nearest neighbor search, survey of knowledge-enhanced text generation, ACM in: Proceedings of the IEEE conference on computer Computing Surveys 54 (11s) (2022) 1–38. vision and pattern recognition, 2014, pp. 2321–2328. [2] H. Jin, Y. Zhang, D. Meng, J. Wang, J. Tan, A com- [16] A. Babenko, V. Lempitsky, The inverted multi-index, prehensive survey on process-oriented automatic text IEEE transactions on pattern analysis and machine summarization with exploration of llm-based methods, intelligence 37 (6) (2014) 1247–1260. arXiv preprint arXiv:2403.02901 (2024). [17] W. Dong, C. Moses, K. Li, Efficient k-nearest neighbor [3] Y. Zhuang, Y. Yu, K. Wang, H. Sun, C. Zhang, Toolqa: A graph construction for generic similarity measures, in: dataset for llm question answering with external tools, Proceedings of the 20th international conference on Advances in Neural Information Processing Systems World wide web, 2011, pp. 577–586. 36 (2023) 50117–50143. [18] C. Fu, C. Xiang, C. Wang, D. Cai, Fast approximate [4] M. Cascella, J. Montomoli, V. Bellini, E. Bignami, Eval- nearest neighbor search with the navigating spreading- uating the feasibility of chatgpt in healthcare: an anal- out graph, arXiv preprint arXiv:1707.00143 (2017). ysis of multiple clinical and research scenarios, Journal [19] Y. Malkov, A. Ponomarenko, A. Logvinov, V. Krylov, of medical systems 47 (1) (2023) 33. Approximate nearest neighbor algorithm based on [5] M. S. Orenstrakh, O. Karnalim, C. A. Suarez, M. Liut, navigable small world graphs, Information Systems 45 Detecting llm-generated text in computing education: (2014) 61–68. Comparative study for chatgpt cases, in: 2024 IEEE [20] Y. A. Malkov, D. A. Yashunin, Efficient and robust ap- 48th Annual Computers, Software, and Applications proximate nearest neighbor search using hierarchical Conference (COMPSAC), IEEE, 2024, pp. 121–126. navigable small world graphs, IEEE transactions on [6] K. Pandya, M. Holia, Automating customer service us- pattern analysis and machine intelligence 42 (4) (2018) ing langchain: Building custom open-source gpt chat- 824–836. bot for organizations, arXiv preprint arXiv:2310.05421 [21] J. Chen, H.-r. Fang, Y. Saad, Fast approximate knn (2023). graph construction for high dimensional data via re- [7] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, cursive lanczos bisection., Journal of Machine Learn- Q. Chen, W. Peng, X. Feng, B. Qin, et al., A survey on ing Research 10 (9) (2009). hallucination in large language models: Principles, tax- [22] M. Wang, X. Xu, Q. Yue, Y. Wang, A comprehensive onomy, challenges, and open questions, ACM Trans- survey and experimental comparison of graph-based actions on Information Systems (2023). approximate nearest neighbor search, arXiv preprint [8] Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, Y. Zhang, A survey arXiv:2101.12631 (2021). on large language model (llm) security and privacy: [23] K. Echihabi, K. Zoumpatianos, T. Palpanas, New trends The good, the bad, and the ugly, High-Confidence in high-d vector similarity search: al-driven, progres- Computing (2024) 100211. sive, and distributed, Proceedings of the VLDB Endow- [9] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, ment 14 (12) (2021) 3198–3201. N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rock- [24] H. V. Simhadri, G. Williams, M. Aumüller, M. Douze, täschel, et al., Retrieval-augmented generation for A. Babenko, D. Baranchuk, Q. Chen, L. Hosseini, knowledge-intensive nlp tasks, Advances in Neural R. Krishnaswamny, G. Srinivasa, et al., Results of Information Processing Systems 33 (2020) 9459–9474. the neurips’21 challenge on billion-scale approximate [10] J. J. Pan, J. Wang, G. Li, Survey of vector database nearest neighbor search, in: NeurIPS 2021 Compe- management systems, The VLDB Journal 33 (5) (2024) titions and Demonstrations Track, PMLR, 2022, pp. 1591–1615. 177–189. [11] Q. Chen, B. Zhao, H. Wang, M. Li, C. Liu, Z. Li, [25] S. Jayaram Subramanya, F. Devvrit, H. V. Simhadri, M. Yang, J. Wang, Spann: Highly-efficient billion-scale R. Krishnawamy, R. Kadekodi, Diskann: Fast accurate approximate nearest neighborhood search, Advances billion-point nearest neighbor search on a single node, in Neural Information Processing Systems 34 (2021) Advances in Neural Information Processing Systems 5199–5212. 32 (2019). [12] J. V. Munoz, M. A. Gonçalves, Z. Dias, R. d. S. Torres, [26] S. Arya, D. M. Mount, Approximate nearest neigh- Hierarchical clustering-based graphs for large scale bor queries in fixed dimensions., in: SODA, Vol. 93, approximate nearest neighbor search, Pattern Recog- Citeseer, 1993, pp. 271–280. nition 96 (2019) 106970. [27] M. D. Manohar, Z. Shen, G. Blelloch, L. Dhulipala, [13] S. Yu, J. Engels, Y. Huang, J. Shun, Pecann: Parallel ef- Y. Gu, H. V. Simhadri, Y. Sun, Parlayann: Scalable ficient clustering with graph-based approximate near- and deterministic parallel graph-based approximate est neighbor search, arXiv preprint arXiv:2312.03940 nearest neighbor search algorithms, in: Proceedings (2023). of the 29th ACM SIGPLAN Annual Symposium on [14] H. Jegou, M. Douze, C. Schmid, Product quantization Principles and Practice of Parallel Programming, 2024, for nearest neighbor search, IEEE transactions on pat- pp. 270–285. tern analysis and machine intelligence 33 (1) (2010)