1. Introduction

1613-0073

ScaDANN: A Scalable Disk-Based Graph Indexing Method for ANN

Hyein Gu

hyein99@kaist.ac.k 0

Min-sooKim

0 0 Korea Advanced Institute of Science and Technology , Republic of Korea

2025

25 28

Approximate Nearest Neighbor (ANN) indexing constitutes a fundamental component of modern vector-based databases, facilitating eficient and accurate information retrieval for applications such as retrieval-augmented generation. However, scaling graph-based ANN indices to billion-scale datasets poses substantial challenges, including high memory demands and ineficiencies in handling partitioned graphs. To address these issues, we propose ScaDANN, a scalable disk-based graph indexing method tailored for large-scale datasets under limited memory conditions. ScaDANN introduces two novel techniques: overlapping block-level insertion and grid block merge, which enable the eficient construction of unified graph indices while preserving high search performance. Our approach achieves notable advancements in index construction speed, search accuracy, and memory eficiency, establishing ScaDANN as a robust and efective solution for scalable ANN indexing in resource-limited environments.

Graph ANN Nearest neighbor search Vector search

1. Introduction

ral language processing by excelling in tasks such as text

Large Language Models (LLMs) have revolutionized natu- struction.

generation1[], summarization2[], and question answering large-scale datasets, ofering superior search accuracy and performance despite higher memory demands during con

The need for scalable ANN indexing becomes increasingly apparent as data is segmented into smaller chunks to information8[]. healthcare4[], education5[], and customer servic6e][. How[3]. Their ability to leverage pre-trained knowledge has meet the input constraints of LLMs. For instance, dividing facilitated applications across diverse domains, includinga 100 MB document into 4 KB chunks suitable for LLM input results in approximately 25,600 chunks—a 25,600-fold such as hallucination7][, where they generate plausible but ever, despite their strengths, LLMs face critical limitationisn,crease in data chunks relative to the original document count. This exponential growth in the number of chunks incorrect outputs, outdated knowledge due to static trasiing-nificantly increases the size of the ANN index, necessiing data, and privacy concerns when processing sensitivetating the development of methods capable of eficiently

Retrieval-Augmented Generation (RA9G])h[as been inhandling billion-scale datasets. Building large-scale ANN indexes, however, poses sig

the information necessary for a given query.

Approximate Nearest Neighbor (ANN) indexing serves

come these limitations. systems. By leveraging vector databases (VectorD10B]s,) [ troduced as a promising framework to mitigate these limita-nificant challenges [23, 24]. First, the substantial size of tions by integrating LLMs with external knowledge retrievadl atasets and indexes requires considerable memory resources. Second, achieving competitive search performance RAG retrieves relevant and up-to-date information to eno-n disk-based systems, which are necessary for handling hance the accuracy and reliability of generated responsess.uch datasets, is particularly challenging. To address these This framework not only reduces the impact of hallucina-issues, we propose ScaDANN, a Scalable Disk-based Graph tion but also addresses privacy concerns by accessing onlyIndexing Method for ANN, specifically designed to overThis study pursues two primary objectives: (1) to enable as a foundational component in RAG systems, enabling the construction of large-scale ANN indexes in memoryeficient information retrieval from VectorDBs. ANN in- constrained environments and (2) to enhance search perdexes can be broadly categorized into clustering-basedformance through the use of merged graph structures that [11, 12, 13], hash-based 1[4, 15, 16], and graph-based methods [17, 18, 19, 20, 21]. Clustering-based methods partition improve eficiency and accuracy during the search process.

By focusing on these goals, we provide a scalable and efitimes and reduced search accuracy. Hash-based methodslimitations are critical challenges. datasets into smaller clusters, facilitating memory-eficient cient solution for constructing and querying ANN indexes index construction but often incurring longer constructionin scenarios where both large-scale datasets and memory employ hash functions to compress datasets and reduce dimensionality, which accelerates construction and search processes but at the cost of reduced accuracy due to compression. Graph-based methods, in contrast, represent vectors as nodes and their relationships as edges, connecting nodes based on proximity22[]. These methods construct neighbor lists through vector operations between nodes (as illustrated in Figu1r)eand are particularly efective for Published in the Proceedings of the Workshops of the EDBT/ICDT 2025 ∗Corresponding author. 0009-0009-8582-5805 (H. Gu); 0000-0002-5065-0226 (M. Kim)

This work makes the following key contributions: • We propose a graph-based ANN method that leverages grid blocks to eficiently construct and scale billion-scale indexes in memory-constrained environments. • Our approach outperforms existing disk-based methods, achieving index construction speeds 1.3 to 1.5 times faster and search speeds 1.2 to 1.4 times faster, while maintaining comparable accuracy. • We demonstrate that our method enables the construction of single indexes up to 10 times larger than those of existing approaches within the same memCopyright © 2025 for this paper by its authors. Use permitted under Creative Commons License

CEUR

ceur-ws.org

2. Preliminaries These contributions mark a significant advancement in Specifically, an edge between node s and ′ is removed

ANN indexing, particularly in terms of eficiency and scala- if the distance between these nodes exceeds the distance bebility for large-scale, memory-constrained applications. tween ′ and a third no de∗. By retaining longer edges under relaxed conditions, Vamana enhances search eficiency while maintaining graph connectivity. 2.1. ANN search 2.2.2. DiskANN

DiskANN [25] extends the Vamana graph to a disk-based

Nearest Neighbor (NN) search identifies the data point(s) in environment, making it well-suited for large-scale datasets. a dataset closest to a given query point based on a defined It employs a divide-and-conquer approach, partitioning the distance metric, such as Euclidean distance or cosine similard-ataset into subgraphs and merging them into a unified ity. While exact NN search computes the distance betweenstructure. the query and every data point to determine the nearest Figure1 illustrates the partitioning and merging process neighbor, this approach becomes computationally infeasible of DiskANN. (1) The dataset is first divided into multiple for large datasets. ANN search addresses this challenge byoverlapping partitions using a clustering technique, with ifnding points that are approximately close to the true near-each node assigned to at least two partitions. (2) Indepenest neighbors, significantly reducing computational costs. dent subgraphs are constructed within each partition. (3) Although ANN sacrifices some accuracy, it delivers faster These subgraphs are then merged by randomly selecting query response times, which is particularly advantageousnodes from the partitioned graphs, followed by merging in high-dimensional and large-scale datasets. and pruning operations. This process updates the neighbor lists with merged results, ensuring connectivity between (whether exact or approximate) is to find the p o int ∈ such that the dista n(,c e ) is minimized, where(⋅, ⋅) represents a distance metric. In the case of ANN search, thetaining search performance across partitions. Definition 1 (ANN). Given a datase t= { 1, 2, … , } and a query point, the task of nearest neighbor search partitions while reducing memory and computational costs.

DiskANN’s partitioning and merging strategy enables scal

able graph construction but introduces challenges in mainalgorithm guarantees that the returned p∗osianttisfies: 2.2.3. Batch insertion for graph construction (, ∗) ≤ (1 + ) ⋅ (, ),

Eficient parallelization is essential for graph construction

where is a small approximation factor, a n d is the exact in large-scale datasets. Batch insertion, as introduced in Parnearest neighbor of the query po i.nt layANN [27], addresses the challenge of concurrently adding nodes to a graph while ensuring conflict-free operations.

To evaluate ANN algorithms, two primary metrics are Conflicts are avoided by grouping nodes into independent

used: Recall and Queries Per Second (QPS). Recall quan- batches and processing each batch in isolation, ensuring tifies the efectiveness of an algorithm in retrieving true that nodes within a batch do not interfere with each other nearest neighbors, defined as the ratio of true nearest neighd-uring insertion. bors retrieved to the total number of true nearest neighbors.The batch insertion process occurs in two steps. First, Higher recall indicates greater accuracy but may increasneodes in a batch are independently connected to their nearsearch time. QPS measures the system’s throughput, i.e., est neighbors within the existing graph. These connections the number of queries processed per second. Higher QPS are established in parallel, as nodes within a batch do not reflects greater eficiency, which is critical for large-scale share dependencies, preventing conflicts during edge creand real-time applications. ation. This step ensures that all newly added nodes are integrated into the graph without disrupting its structure. 2.2. Existing graph ANN indexing methods Second, reversed edges are added to the graph. A reversed $ ! " & $ ! " & ( "# # ) "#

* % ' "# # ) "" ( % '

* (2) Sub-graph construction

Random shuffle and Prune 1 2 3 4 5 6 $ # #$ * " & ) `# " ) ' $ … #$ $ " & ## ) + * (

Neighbor list 2.2.1. Vamana graph Partition into P1 P2 P3

Vamana [25] is a widely used graph-based ANN indexing $

algorithm and serves as the foundational graph structure in ! " & many recent ANN studies. Its construction is based on two "# # ' key techniques: Greedy Search and Robust Pruning. ( ""

Greedy Search enables eficient exploration of nearest % ) * neighbors within the graph. Beginning from an initial node, * Two overlapping clusters the algorithm iteratively identifies and prioritizes nodes * Each cluster size: 8 closest to the query. It maintains a priority queue of visited (1) Partitioning based on nodes, extracting the nearest unvisited node at each step, overlapping clustering adding its neighbors to the queue, and repeating this process % until all relevant nodes are explored. " #

Robust Pruning optimizes the graph by removing un- & * + necessary edges discovered during Greedy Search. This technique utilizes the Sparse Neighborhood Graph (SNG) (3) Merging property 2[6] and applies a distance threshold parameter, alpha ( ), to determine whether an edge should be retained.Figure 1: DiskANN partitioning and merging process. ⋅ ( ∗, ′) ≤ (, ′) ( > 1) #$ $

' ) ## ( edge is defined as an edge where the source vertex and or 512GB. However, grid partitioning ensures scalability destination vertex are swapped, ensuring bidirectional conb-y allowing the dataset to be partitioned into grid blocks, nectivity between nodes. For example, if an edge connects making it feasible to construct the graph even when the node (source) to nod e (destination), the reversed edge dataset size exceeds the available memory. connect s back to . During this step, the neighbors of Each grid block captures intra- and inter-partition relaeach newly added node are updated with reversed edges,tionships. A grid block where the source and destination completing the bidirectional connections required for robusptartitions are the same forms a subgraph, which follows graph-based indexing. the Vamana graph structure. Diagonal blocks represent

By separating the insertion of neighbors and the addis-ubgraphs within individual partitions, while of-diagonal tion of reversed edges into distinct steps, batch insertiobnlocks denote connections between partitions. For example, guarantees structural consistency while leveraging paratl-he red block(2, 1) in Figure3 corresponds to connections lel processing. This approach is particularly efective in from partition2 (green) to partitio1n (blue). This struclarge-scale datasets, as it minimizes computational over-ture improves scalability and ensures conflict-free parallel head while maintaining the integrity and quality of theprocessing during node insertion. graph. Moreover, the absence of interdependencies between batches ensures that the graph remains stable throughou3t.2. Overlapping block-level insertion the construction process, enabling eficient and scalable graph-based ANN indexing.

The overlapping block-level insertion method incrementally

constructs the graph by processing data in grid block units.

This method employs overlapping blocks, where a block

3. ScaDANN Method from the previous step is reused in the next step, allowing two blocks to be processed simultaneously. As detailed in 3.1. Grid partitioning Algorithm1, the process begins with the construction of Figure3 illustrates the structure of Graph ANN using a gridthe graph for the first partition1 =( 0). In this step(line 3), format, where each node in the graph consists of a base vec-specifically generates the subgraph for the first partition, entor and a neighbor list. Grid partitioning divides large-scalseuring that the intra-partition edges are established within datasets into grid blocks, with rows representing source vert-he initial grid block. tices and columns representing destination vertices. This In subsequent steps, pairs of overlapping blocks are proapproach organizes datasets into manageable segments, en-cessed to extend the graph. For each step, the data correabling memory- and disk-eficient graph construction. For sponding to the current blocks is loaded into memory to example, constructing a Vamana graph for a 1-billion-point construct the graph (lines 9–11). Particularly, line 11 indataset with 100 dimensions, requiring 400GB for vectorvolves filling four grid blocks by constructing the subgraphs storage, is infeasible under memory constraints of 256GB for two partitions and establishing connections between them. This ensures that intra- and inter-partition edges !# ,& # "! % !$!'+ - tisrece grSapudhbe0stinatSiounbvertices (ictalseairnscBeisneoyc,sinSrnr1sceeg2taau–rtDts1uheiA5ncde)Ntg,sgeaNioldmnovi,dubnetalrtthtllehaaesgenptrepernaaiourtenuxclegisttselbuyrgbl,rorlbioeacmlcokpopkhcsrfkidoctsivuohslironenoiasnggflotdrgareeauficddtpiceheh(tdleni.incofiOrnyenonssaamec5nrne–mdtd1tie0momhm)an.eeoirnrpggrry-ioan-pgh vec graph 1 into a single operation, eliminating the need for a separate, …!# [[10..9711,, 11…..3221,, ……,, 00..4992]] [[!$,, …"%,, !&#]] rsou frEtoodmgpe1ps2 grSapuhb 2 iscsuobrmuepsiulttth,atathtioeinrneatblelyyr-aebvxloopecidnkiscniovgnetnmheeecrtcgriioennagstsiatorenep.ocTfahdptiissujaorepindprtaossautcbhhgereagnrp-haps.h !! [0.01, 1.21, …, 0.43] [%, ', "] Additionally, the overlapping mechanism significantly reBase vector Neighbor list Grid duces I/O overhead by combining insertion and merging, which are traditionally distinct steps.

Figure2 illustrates the construction process. Initially,

The merging process follows the Vamana graph construction method, incorporating a greedy search to identify candidate neighbors and a robust pruning step to refine the neighbor list. Inter-block edges are established by adding neighbors from the second block to the neighbor list of the first block. The reverse edges are then added, ensuring bidirectional connectivity, where nodes in the first block are also linked as neighbors of nodes in the second block.

This comprehensive approach ensures a robust and eficient merging process.

3.4. Complexity

Table1 compares the time and space complexities of Vamana,

DiskANN, and ScaDANN in the index-building process, foblocks 0 and 1 are loaded into memory, and edges between cusing on scalability and eficiency. The space complexity the two blocks are created. In the second step, block 0 isconsiders storage requirements, including neighbor lists. ofloaded, block 1 is retained, and block 2 is loaded. The Vamana constructs a single large-scale graph, resulting in graph is extended based on connections between blocks 1 a time complexity of( 1.16) and a space complexity o(f ⋅ and 2. Finally, in the third step, block 1 is ofloaded, block ) . Since it does not partition the dataset, the entire graph 2 is retained, and block 0 is reloaded to extend the graphmust fit in memory, making it less scalable for large-scale further by connecting blocks 2 and 0. This iterative pro-data. Both DiskANN and ScaDANN partition the dataset to cess eficiently captures all relevant connections between construct subgraphs. Partitioning reduces memory usage blocks, enabling the construction of a unified graph while and enables disk-based processing. The time complexity minimizing memory usage and I/O costs. is inversely proportional to the number of partitions, as increasin g allows for more localized graph construction, reducing the computational cost. 3.3. Grid block merge In DiskANN, each node is assigned to multiple overlapThe grid block merge process establishes connections be- ping partitions (typical ly= 2 ), as shown in Figure1. tween two blocks loaded into memory and ensures bidirec- Since DiskANN performs clustering to determine partitions, tional connectivity through reverse edges. Reverse edgest,he build time complexity includes the clustering time ( where the source and destination vertices are swappe d, ). In contrast, ScaDANN utilizes dynamic partitioning, maintain symmetric inter-block connections, ensuring con-where the data structure size is determined based on the sistent and eficient search performance. dataset size. This process does not require separate cluster

In conventional methods, merging requires all data to being and is treated as an internal computational step, which loaded into memory to calculate distances between nodesi.s why it is not included in the time complexity. Unlike ScaDANN overcomes this limitation by storing distance in-DiskANN, which incurs overhead due to overlapping parformation directly within the neighbor list of each nodet.itions, ScaDANN eliminates these redundancies, streamThis pre-stored distance means that block 1 retains thelining subgraph construction and merging. This results in neighbor list formed in step 1. For example, after step 15, a computationally eficient process that scales well with in block 1 (green) has already sto re3dand 2 from block increasing dataset size. 0 (yellow) as its neighbors. This allows distance comparisons between existing and new neighbors without reload- Methods Build Time Space ing ofloaded data, reducing memory usage and improving Vamana ( 1.16) ( ⋅ ) scalability. DiskANN ( 1.16 ⋅ −0.84 ⋅ 1.16 + ) ( ⋅ ⋅ ⋅ −1)

Figure4 illustrates the merging process during step 2 ScaDANN ( −0.84 ⋅ 1.16) ( ⋅ ⋅ −1) in Figure2, where blocks 1 and 2 are loaded into memory Table 1 while block 0 remains stored on disk. When updating the Comparison of time and space complexities for Vamana, neighbor list o f5, the algorithm uses pre-stored distance DiskANN, and ScaDANN, where is the dataset size, is the data from step 1 to compare block 0 neighbors with block 2 maximum degree of the graph, is the number of overlapping neighbors, avoiding redundant distance calculations. As apartitions per node, is the number of partitions, and is the result, 5’s neighbors are updated t1o1, 3, and 2. clustering time.

4. Experiments

higher number of partitions than DiskANN. This is because

ScaDANN stores additional distance information within the

4.1. Experimental setup graph, which increases memory consumption per block. As a result, ScaDANN divides the data into more partitions The experiments were conducted to evaluate the perfort-han DiskANN to accommodate the additional memory usmance of ScaDANN under controlled conditions. The hard-age. For instance, for the BIGANN dataset, ScaDANN used ware environment included two Gold 6326 16-core proces- around 200GB of memory for indexing, while DiskANN resors and 1.9 TB of main memory, while the software en- mained under the 256GB limit, but required fewer partitions vironment was based on Ubuntu 20.04.6 LTS. To simulate to store its graph. memory-constrained scenarios, the available memory was We conducted a detailed comparison of the index builddeliberately restricted to 256 GB during the experiments.

ing process with DiskANN, as illustrated in Fig6u. rIen Under these constraints, both Vamana and ParlayANN ent-he subgraph construction phase, our method completed countered Out of Memory (OOM) errors, making them un- approximately twice as fast as DiskANN, with the BIGANN suitable for comparison. As a result, our evaluation focuseddataset showing a significant speedup. However, the subon comparing ScaDANN with DiskANN, a state-of-the-art graph merging phase took longer in our approach. This is disk-based ANN indexing method. due to the finer-grained distance comparisons performed

Table 2 provides a summary of the datasets used in the during the merging process, which operates at the level of experiments, including their data types and dimensions.grid blocks. Unlike DiskANN, which uses a more straightThe datasets used were BIGANN, Microsoft SPACEV, Yan- forward merging process, ScaDANN’s overlapping block dex DEEP, and Yandex Text-to-Image, each posing unique approach requires additional computations during the mergchallenges for ANN indexing. BIGANN consists of 1 bil- ing phase to ensure accurate distance calculations between lion 128-dimensional SIFT descriptors from a large-scale partitioned blocks. image dataset, serving as a benchmark for high-dimensional In terms of I/O costs, our method demonstrated greater similarity search. Microsoft SPACEV includes 1 billion 100- eficiency. This improvement can be attributed to the use dimensional document and query vectors reflecting real- of overlapping block techniques, which optimize data loadworld web search scenarios. Yandex DEEP comprises 1 ing between memory and disk. Specifically, ScaDANN’s billion 96-dimensional image descriptors designed to eval- approach minimizes unnecessary disk reads and eficiently uate deep learning-based image representations. Yandex manages memory, leading to a faster construction process Text-to-Image contains 1 billion multi-modal embeddings, despite the increased memory overhead. This is evident where image embeddings serve as the database and text in the I/O time breakdown in Figu6r,ewhere ScaDANN embeddings form the query set, representing a cross-modal exhibits a more balanced distribution of time between subretrieval task with distinct distributions for database agnrdaph building, partitioning, and merging, compared to query vectors. DiskANN, which spends a significant portion of time in I/O operations. 4.2. Results and Analysis 4.2.1. Billion-scale index building in limited memory e 20 tim )s 15 g ru 10 in o ild (h 5 uB 0

300 e g a su ) 200 y B ro (G 100 m e M 16.5 10.7 DiskANN

ScaDANN(ours) 18.8 14.6 16.3

11.3 (a) BIGANN-1B (b) MSSPACEV-1B (c) DEEP-1B

248 241 239 2L5i6mGiBt 199 187 160 Figure5 presents the results of indexing at a billion-scale 5 8 5 8 13 16 under a 256GB memory constraint. The top graph in Fig- 0 ure 5 illustrates the index building time across diferent (a) BIGANN-1B (b) MSSPACEV-1B (c) DEEP-1B datasets. ScaDANN significantly reduced the index-building time compared to DiskANN across all datasets. For example, Figure 5: Billion-scale index building experiment results, includin the BIGANN dataset, ScaDANN completed the indexing ing building time and memory usage, conducted under a 256GB task in 10.7 hours, compared to 16.5 hours for DiskANN, bmeermoofrpyacrotintsiotrnasindty.nTahmemicaemllyordyetuesramgeinveadludeusriinndgictahtee itnhdeenxuinmgachieving approximately a 35% reduction in build time. This process. eficiency improvement can be attributed to ScaDANN’s optimized data partitioning and overlapping block techniques, Building Time (min) – BIGANN 1B which minimize disk I/O and improve memory access pat- 3 27 terns. DiskANN 886 72 SdcuaTrDhineAgbNtoNhteteominmpdlogerxyaecdpohndsiyntnrFauimgciutcir5opesnahrpotrwiotscieotsnhsie.nBmgo,etwmhohDricyihsukasAadNgjuNestasnd Sc(aoDuArsN)N 0 373 208 16 the number of partitions based on available memory capac- 0 200 400 600 800 1000 ity and eficiently divides the data. Since the graph size and Partitioning Sub-graph building Merging I/O memory usage vary depending on the degree of the neighbor list, dynamic partitioning enables optimized partitioning byFigure 6: Time taken for each step in the building process, cateadapting to these variations. Notably, ScaDANN required agorized into partitioning, subgraph building, merging, and I/O time for comparison. 1.E+05 d n o ec1.E+04 s r e p se1.E+03 ir e u Q1.E+02 80

ScaDANN(ours) 70 80 90

100

Recall@10 ParlayANN (40) ScaDANN (42) DiskANN (55) 4.2.2. Disk-based search performance 100 Figure7 presents the comparison of the search performance 50 based on disks between ScaDANN and DiskANN. The ex- periments were carried out using the disk search method for evaluation. Our proposed ScaDANN method demonstrated better search performance than DiskANN for most datasets in terms of QPS, especially for the BIGANN-1B 4600 dataset. As shown in the graph, ScaDANN achieved a higher 20

QPS while maintaining competitive recall rates. For ex-

ample, ScaDANN processed queries at a faster rate than

DiskANN for BIGANN and MSSPACEV-1B, achieving a sig

nificant speedup while still maintaining similar or slightly better recall at top-10 results. However, for the DEEP-1B dataset, ScaDANN showed a slight performance drop when Recall@10 exceeded 95. This discrepancy can be attributed to the disk search method used, which is specifically optimized for DiskANN. 4.3. Ablation study 4.3.1. Impact of partition size on index construction 1.E+06 d n o ec1.E+05 rse rsep ie1.E+04 u Q 1.E+03 80 ParlayANN (50) ScaDANN (42)

DiskANN (82) 1.E+05 d n o rsce e1.E+04 irseep u Q 1.E+03 70 75 80 85 90 95 100

Recall@10 (b) YANDEX Text-to-image-100M

Figure8 illustrates the efects of varying the number of parti- 85 Recall@9010 95 100

tions on the BIGANN 100M dataset. Increasing the number (a) BIGANN-100M of partitions led to a significant increase in building time, with approximately a 70% increase per step. However, RAM Figure 9: In-memory experimental results of BIGANN 100M and usage decreased by around 40% per step, demonstrating theYandex Text-to-Image 100M. The numbers in parentheses next to trade-of between memory eficiency and construction time. each method indicate the index building time in minutes.

Despite the changes in partition numbers, there were no significant diferences observed in QPS relative to Recall@10 5. Conclusion across the tested configurations. This indicates that increasing the partition number primarily afects the constructionThis study proposed ScaDANN, a scalable disk-based graph phase without adversely impacting search performance. indexing method designed for billion-scale ANN datasets in memory-constrained environments. By introducing 4.3.2. In-memory graph construction experiments overlapping block-level insertion and grid block merge,

ScaDANN efectively mitigates the high memory demands

The in-memory graph construction experiments on and ineficiencies of existing methods. These techniques BIGANN-100M and Yandex Text-to-Image-100M compare enable seamless integration of blocks and eficient merging ParlayANN, ScaDANN, and DiskANN in terms of indexing of partitioned graphs while minimizing I/O overhead and time and query performance. Since ParlayANN encounters leveraging stored distance information. out-of-memory (OOM) issues at the billion-scale, the ex- Experimental results demonstrated that ScaDANN periments were conducted on 100 million data points with-achieves up to 1.5× faster index construction and 1.4× higher out partitioning for ScaDANN and DiskANN. ScaDANN query throughput compared to DiskANN, while maintaining achieves indexing speeds comparable to or faster thancompetitive accuracy across diverse datasets. Furthermore, ParlayANN by storing distances within the graph, reducS-caDANN supports the construction of single indexes 10 ing redundant computations. Both methods also benefit times larger than existing approaches within the same memfrom batch insertion, leading to faster construction thaonry constraints, making it a robust solution for large-scale

DiskANN. ScaDANN not only achieves a faster build time ANN indexing. This work establishes ScaDANN as an efibut also maintains strong search performance, striking acient and scalable approach, with significant potential for balance between eficient indexing and high query through- real-world applications requiring high-performance ANN put. indexing.

117–128.

[15] Y. Kalantidis, Y. Avrithis, Locally optimized product [1] W. Yu, C. Zhu, Z. Li, Z. Hu, Q. Wang, H. Ji, M. Jiang, A quantization for approximate nearest neighbor search, survey of knowledge-enhanced text generation, ACM in: Proceedings of the IEEE conference on computer Computing Surveys 54 (11s) (2022) 1–38. vision and pattern recognition, 2014, pp. 2321–2328. [2] H. Jin, Y. Zhang, D. Meng, J. Wang, J. Tan, A com- [16] A. Babenko, V. Lempitsky, The inverted multi-index, prehensive survey on process-oriented automatic text IEEE transactions on pattern analysis and machine summarization with exploration of llm-based methods, intelligence 37 (6) (2014) 1247–1260.

arXiv preprint arXiv:2403.02901 (2024). [17] W. Dong, C. Moses, K. Li, Eficient k-nearest neighbor [3] Y. Zhuang, Y. Yu, K. Wang, H. Sun, C. Zhang, Toolqa: A graph construction for generic similarity measures, in: dataset for llm question answering with external tools, Proceedings of the 20th international conference on Advances in Neural Information Processing Systems World wide web, 2011, pp. 577–586.

36 (2023) 50117–50143. [18] C. Fu, C. Xiang, C. Wang, D. Cai, Fast approximate [4] M. Cascella, J. Montomoli, V. Bellini, E. Bignami, Eval- nearest neighbor search with the navigating spreadinguating the feasibility of chatgpt in healthcare: an anal- out graph, arXiv preprint arXiv:1707.00143 (2017). ysis of multiple clinical and research scenarios, Journal [19] Y. Malkov, A. Ponomarenko, A. Logvinov, V. Krylov, of medical systems 47 (1) (2023) 33. Approximate nearest neighbor algorithm based on [5] M. S. Orenstrakh, O. Karnalim, C. A. Suarez, M. Liut, navigable small world graphs, Information Systems 45

Detecting llm-generated text in computing education: (2014) 61–68. Comparative study for chatgpt cases, in: 2024 IEEE [20] Y. A. Malkov, D. A. Yashunin, Eficient and robust ap

48th Annual Computers, Software, and Applications proximate nearest neighbor search using hierarchical

Conference (COMPSAC), IEEE, 2024, pp. 121–126. navigable small world graphs, IEEE transactions on

[6] K. Pandya, M. Holia, Automating customer service us- pattern analysis and machine intelligence 42 (4) (2018) ing langchain: Building custom open-source gpt chat- 824–836. bot for organizations, arXiv preprint arXiv:2310.05421 [21] J. Chen, H.-r. Fang, Y. Saad, Fast approximate knn (2023). graph construction for high dimensional data via re[7] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, cursive lanczos bisection., Journal of Machine Learn

Q. Chen, W. Peng, X. Feng, B. Qin, et al., A survey on ing Research 10 (9) (2009).

hallucination in large language models: Principles, tax- [22] M. Wang, X. Xu, Q. Yue, Y. Wang, A comprehensive onomy, challenges, and open questions, ACM Trans- survey and experimental comparison of graph-based actions on Information Systems (2023). approximate nearest neighbor search, arXiv preprint [8] Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, Y. Zhang, A survey arXiv:2101.12631 (2021).

on large language model (llm) security and privacy: [23] K. Echihabi, K. Zoumpatianos, T. Palpanas, New trends

The good, the bad, and the ugly, High-Confidence in high-d vector similarity search: al-driven, progres Computing (2024) 100211. sive, and distributed, Proceedings of the VLDB Endow

[9] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, ment 14 (12) (2021) 3198–3201.

N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rock- [24] H. V. Simhadri, G. Williams, M. Aumüller, M. Douze,

täschel, et al., Retrieval-augmented generation for A. Babenko, D. Baranchuk, Q. Chen, L. Hosseini, knowledge-intensive nlp tasks, Advances in Neural R. Krishnaswamny, G. Srinivasa, et al., Results of

Information Processing Systems 33 (2020) 9459–9474. the neurips’21 challenge on billion-scale approximate

[10] J. J. Pan, J. Wang, G. Li, Survey of vector database nearest neighbor search, in: NeurIPS 2021 Compemanagement systems, The VLDB Journal 33 (5) (2024) titions and Demonstrations Track, PMLR, 2022, pp. 1591–1615. 177–189. [11] Q. Chen, B. Zhao, H. Wang, M. Li, C. Liu, Z. Li, [25] S. Jayaram Subramanya, F. Devvrit, H. V. Simhadri,

M. Yang, J. Wang, Spann: Highly-eficient billion-scale R. Krishnawamy, R. Kadekodi, Diskann: Fast accurate

approximate nearest neighborhood search, Advances billion-point nearest neighbor search on a single node, in Neural Information Processing Systems 34 (2021) Advances in Neural Information Processing Systems 5199–5212. 32 (2019). [12] J. V. Munoz, M. A. Gonçalves, Z. Dias, R. d. S. Torres, [26] S. Arya, D. M. Mount, Approximate nearest neighHierarchical clustering-based graphs for large scale bor queries in fixed dimensions., in: SODA, Vol. 93, approximate nearest neighbor search, Pattern Recog- Citeseer, 1993, pp. 271–280.

nition 96 (2019) 106970. [27] M. D. Manohar, Z. Shen, G. Blelloch, L. Dhulipala, [13] S. Yu, J. Engels, Y. Huang, J. Shun, Pecann: Parallel ef- Y. Gu, H. V. Simhadri, Y. Sun, Parlayann: Scalable ifcient clustering with graph-based approximate near- and deterministic parallel graph-based approximate est neighbor search, arXiv preprint arXiv:2312.03940 nearest neighbor search algorithms, in: Proceedings (2023). of the 29th ACM SIGPLAN Annual Symposium on [14] H. Jegou, M. Douze, C. Schmid, Product quantization Principles and Practice of Parallel Programming, 2024, for nearest neighbor search, IEEE transactions on pat- pp. 270–285. tern analysis and machine intelligence 33 (1) (2010)