=Paper= {{Paper |id=Vol-3946/TGD_paper5 |storemode=property |title=ScaDANN: A Scalable Disk-based Graph Indexing Method for ANN |pdfUrl=https://ceur-ws.org/Vol-3946/TGD-5.pdf |volume=Vol-3946 |authors=Hyein Gu,Min-soo Kim }} ==ScaDANN: A Scalable Disk-based Graph Indexing Method for ANN== https://ceur-ws.org/Vol-3946/TGD-5.pdf
                         ScaDANN: A Scalable Disk-Based Graph Indexing Method for
                         ANN
                         Hyein Gu1 , Min-soo Kim1,∗
                         1
                             Korea Advanced Institute of Science and Technology, Republic of Korea


                                           Abstract
                                           Approximate Nearest Neighbor (ANN) indexing constitutes a fundamental component of modern vector-based databases, facilitating
                                           efficient and accurate information retrieval for applications such as retrieval-augmented generation. However, scaling graph-based
                                           ANN indices to billion-scale datasets poses substantial challenges, including high memory demands and inefficiencies in handling
                                           partitioned graphs. To address these issues, we propose ScaDANN, a scalable disk-based graph indexing method tailored for large-scale
                                           datasets under limited memory conditions. ScaDANN introduces two novel techniques: overlapping block-level insertion and grid
                                           block merge, which enable the efficient construction of unified graph indices while preserving high search performance. Our approach
                                           achieves notable advancements in index construction speed, search accuracy, and memory efficiency, establishing ScaDANN as a robust
                                           and effective solution for scalable ANN indexing in resource-limited environments.

                                           Keywords
                                           Graph, ANN, Nearest neighbor search, Vector search



                         1. Introduction                                                                                              large-scale datasets, offering superior search accuracy and
                                                                                                                                      performance despite higher memory demands during con-
                         Large Language Models (LLMs) have revolutionized natu-                                                       struction.
                         ral language processing by excelling in tasks such as text                                                      The need for scalable ANN indexing becomes increas-
                         generation [1], summarization [2], and question answering                                                    ingly apparent as data is segmented into smaller chunks to
                         [3]. Their ability to leverage pre-trained knowledge has                                                     meet the input constraints of LLMs. For instance, dividing
                         facilitated applications across diverse domains, including                                                   a 100 MB document into 4 KB chunks suitable for LLM in-
                         healthcare [4], education [5], and customer service [6]. How-                                                put results in approximately 25,600 chunks—a 25,600-fold
                         ever, despite their strengths, LLMs face critical limitations,                                               increase in data chunks relative to the original document
                         such as hallucination [7], where they generate plausible but                                                 count. This exponential growth in the number of chunks
                         incorrect outputs, outdated knowledge due to static train-                                                   significantly increases the size of the ANN index, necessi-
                         ing data, and privacy concerns when processing sensitive                                                     tating the development of methods capable of efficiently
                         information [8].                                                                                             handling billion-scale datasets.
                            Retrieval-Augmented Generation (RAG) [9] has been in-                                                        Building large-scale ANN indexes, however, poses sig-
                         troduced as a promising framework to mitigate these limita-                                                  nificant challenges [23, 24]. First, the substantial size of
                         tions by integrating LLMs with external knowledge retrieval                                                  datasets and indexes requires considerable memory re-
                         systems. By leveraging vector databases (VectorDBs) [10],                                                    sources. Second, achieving competitive search performance
                         RAG retrieves relevant and up-to-date information to en-                                                     on disk-based systems, which are necessary for handling
                         hance the accuracy and reliability of generated responses.                                                   such datasets, is particularly challenging. To address these
                         This framework not only reduces the impact of hallucina-                                                     issues, we propose ScaDANN, a Scalable Disk-based Graph
                         tion but also addresses privacy concerns by accessing only                                                   Indexing Method for ANN, specifically designed to over-
                         the information necessary for a given query.                                                                 come these limitations.
                            Approximate Nearest Neighbor (ANN) indexing serves                                                           This study pursues two primary objectives: (1) to enable
                         as a foundational component in RAG systems, enabling                                                         the construction of large-scale ANN indexes in memory-
                         efficient information retrieval from VectorDBs. ANN in-                                                      constrained environments and (2) to enhance search per-
                         dexes can be broadly categorized into clustering-based                                                       formance through the use of merged graph structures that
                         [11, 12, 13], hash-based [14, 15, 16], and graph-based meth-                                                 improve efficiency and accuracy during the search process.
                         ods [17, 18, 19, 20, 21]. Clustering-based methods partition                                                 By focusing on these goals, we provide a scalable and effi-
                         datasets into smaller clusters, facilitating memory-efficient                                                cient solution for constructing and querying ANN indexes
                         index construction but often incurring longer construction                                                   in scenarios where both large-scale datasets and memory
                         times and reduced search accuracy. Hash-based methods                                                        limitations are critical challenges.
                         employ hash functions to compress datasets and reduce di-                                                       This work makes the following key contributions:
                         mensionality, which accelerates construction and search
                         processes but at the cost of reduced accuracy due to com-                                                         • We propose a graph-based ANN method that lever-
                         pression. Graph-based methods, in contrast, represent vec-                                                          ages grid blocks to efficiently construct and scale
                         tors as nodes and their relationships as edges, connecting                                                          billion-scale indexes in memory-constrained envi-
                         nodes based on proximity [22]. These methods construct                                                              ronments.
                         neighbor lists through vector operations between nodes                                                            • Our approach outperforms existing disk-based meth-
                         (as illustrated in Figure 1) and are particularly effective for                                                     ods, achieving index construction speeds 1.3 to 1.5
                                                                                                                                             times faster and search speeds 1.2 to 1.4 times faster,
                         Published in the Proceedings of the Workshops of the EDBT/ICDT 2025                                                 while maintaining comparable accuracy.
                         Joint Conference (March 25-28, 2025), Barcelona, Spain
                         ∗
                              Corresponding author.
                                                                                                                                           • We demonstrate that our method enables the con-
                         Envelope-Open hyein99@kaist.ac.kr (H. Gu); minsoo.k@kaist.ac.kr (M. Kim)                                            struction of single indexes up to 10 times larger than
                         Orcid 0009-0009-8582-5805 (H. Gu); 0000-0002-5065-0226 (M. Kim)                                                     those of existing approaches within the same mem-
                                       Copyright © 2025 for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).                                                            ory constraints.

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   These contributions mark a significant advancement in             Specifically, an edge between nodes 𝑃 and 𝑃 ′ is removed
ANN indexing, particularly in terms of efficiency and scala-     if the distance between these nodes exceeds the distance be-
bility for large-scale, memory-constrained applications.         tween 𝑃 ′ and a third node 𝑃 ∗ . By retaining longer edges un-
                                                                 der relaxed conditions, Vamana enhances search efficiency
                                                                 while maintaining graph connectivity.
2. Preliminaries
                                                                 2.2.2. DiskANN
2.1. ANN search
                                                                 DiskANN [25] extends the Vamana graph to a disk-based
Nearest Neighbor (NN) search identifies the data point(s) in
                                                                 environment, making it well-suited for large-scale datasets.
a dataset closest to a given query point based on a defined
                                                                 It employs a divide-and-conquer approach, partitioning the
distance metric, such as Euclidean distance or cosine similar-
                                                                 dataset into subgraphs and merging them into a unified
ity. While exact NN search computes the distance between
                                                                 structure.
the query and every data point to determine the nearest
                                                                    Figure 1 illustrates the partitioning and merging process
neighbor, this approach becomes computationally infeasible
                                                                 of DiskANN. (1) The dataset is first divided into multiple
for large datasets. ANN search addresses this challenge by
                                                                 overlapping partitions using a clustering technique, with
finding points that are approximately close to the true near-
                                                                 each node assigned to at least two partitions. (2) Indepen-
est neighbors, significantly reducing computational costs.
                                                                 dent subgraphs are constructed within each partition. (3)
Although ANN sacrifices some accuracy, it delivers faster
                                                                 These subgraphs are then merged by randomly selecting
query response times, which is particularly advantageous
                                                                 nodes from the partitioned graphs, followed by merging
in high-dimensional and large-scale datasets.
                                                                 and pruning operations. This process updates the neighbor
Definition 1 (ANN). Given a dataset 𝐷 = {𝑥1 , 𝑥2 , … , 𝑥𝑛 }      lists with merged results, ensuring connectivity between
and a query point 𝑞, the task of nearest neighbor search         partitions while reducing memory and computational costs.
(whether exact or approximate) is to find the point 𝑥𝑁 𝑁 ∈ 𝐷     DiskANN’s partitioning and merging strategy enables scal-
such that the distance 𝑑(𝑞, 𝑥𝑁 𝑁 ) is minimized, where 𝑑(⋅, ⋅)   able graph construction but introduces challenges in main-
represents a distance metric. In the case of ANN search, the     taining search performance across partitions.
algorithm guarantees that the returned point 𝑥 ∗ satisfies:
                                                                 2.2.3. Batch insertion for graph construction
                𝑑(𝑞, 𝑥 ∗ ) ≤ (1 + 𝜖) ⋅ 𝑑(𝑞, 𝑥𝑁 𝑁 ),
                                                                 Efficient parallelization is essential for graph construction
where 𝜖 is a small approximation factor, and 𝑥𝑁 𝑁 is the exact   in large-scale datasets. Batch insertion, as introduced in Par-
nearest neighbor of the query point 𝑞.                           layANN [27], addresses the challenge of concurrently adding
                                                                 nodes to a graph while ensuring conflict-free operations.
   To evaluate ANN algorithms, two primary metrics are           Conflicts are avoided by grouping nodes into independent
used: Recall and Queries Per Second (QPS). Recall quan-          batches and processing each batch in isolation, ensuring
tifies the effectiveness of an algorithm in retrieving true      that nodes within a batch do not interfere with each other
nearest neighbors, defined as the ratio of true nearest neigh-   during insertion.
bors retrieved to the total number of true nearest neighbors.       The batch insertion process occurs in two steps. First,
Higher recall indicates greater accuracy but may increase        nodes in a batch are independently connected to their near-
search time. QPS measures the system’s throughput, i.e.,         est neighbors within the existing graph. These connections
the number of queries processed per second. Higher QPS           are established in parallel, as nodes within a batch do not
reflects greater efficiency, which is critical for large-scale   share dependencies, preventing conflicts during edge cre-
and real-time applications.                                      ation. This step ensures that all newly added nodes are
                                                                 integrated into the graph without disrupting its structure.
2.2. Existing graph ANN indexing methods                         Second, reversed edges are added to the graph. A reversed
2.2.1. Vamana graph                                                  Partition into                  P1 P2 P3
                                                                                                                                                                𝑣$                                   𝑣$
                                                                                                                                                                          𝑣!
Vamana [25] is a widely used graph-based ANN indexing                                                     𝑣$
                                                                                                                                      𝑣!
                                                                                                                                                 𝑣"                                  𝑣"
                                                                                                                                                                                                𝑣&
                                                                                𝑣!                                                                         𝑣&
algorithm and serves as the foundational graph structure in                                𝑣"
                                                                                                     𝑣&
                                                                                                                          𝑣"#              𝑣#
                                                                                                                                                                                          𝑣)
                                                                                                                                                                                                          𝑣(
                                                                                                                                                                                                𝑣"#
many recent ANN studies. Its construction is based on two          𝑣"#               𝑣#
                                                                                                𝑣(
                                                                                                                𝑣'               𝑣%
                                                                                                                                                 𝑣'
key techniques: Greedy Search and Robust Pruning.                                                    𝑣""                                                                                            𝑣*

                                                                               𝑣%
   Greedy Search enables efficient exploration of nearest                                  𝑣)         𝑣*
                                                                                                                                                𝑣"#         𝑣#
                                                                                                                                                                          𝑣)
                                                                                                                                                                                          𝑣(
                                                                                                                                                                               𝑣""
neighbors within the graph. Beginning from an initial node,                           * Two overlapping clusters                                       𝑣%
                                                                                                                                                                     𝑣'
the algorithm iteratively identifies and prioritizes nodes                            * Each cluster size: 8                                                                    𝑣*


closest to the query. It maintains a priority queue of visited     (1) Partitioning based on                                      (2) Sub-graph construction
nodes, extracting the nearest unvisited node at each step,           overlapping clustering
                                                                                                                                           Random shuffle and Prune
adding its neighbors to the queue, and repeating this process                                                   𝑣%
                                                                                      𝑣"                                         𝑣              1      2         3         4         5         6
until all relevant nodes are explored.                                                          𝑣#
                                                                                                           𝑣'                   𝑣$              𝑣#    𝑣#$       𝑣*        𝑣"     𝑣&            𝑣)
                                                                         𝑣#$              𝑣$                         𝑣(         𝑣`#             𝑣"    𝑣)        𝑣'        𝑣$
                                                                                                     𝑣)
              𝛼 ⋅ 𝑑(𝑝 ∗ , 𝑝 ′ ) ≤ 𝑑(𝑝, 𝑝 ′ )   (𝛼 > 1)                                                     𝑣##                  …
                                                                                     𝑣&                                         𝑣#$             𝑣$    𝑣"        𝑣&
   Robust Pruning optimizes the graph by removing un-                                           𝑣*             𝑣+               𝑣##             𝑣)    𝑣+        𝑣*        𝑣(
necessary edges discovered during Greedy Search. This                                                                                                  Neighbor list
technique utilizes the Sparse Neighborhood Graph (SNG)                                                          (3) Merging
property [26] and applies a distance threshold parameter,
alpha (𝛼), to determine whether an edge should be retained.      Figure 1: DiskANN partitioning and merging process.
 Step 1.                                                                  Step 2.                                            Step 3.
 Build graph with block 0 and 1                                           Build graph with block 1 and 2                     Build graph with block 2 and 0


                                      Main memory                                                           Main memory                                  Main memory
                                                                                                                                                                                                        𝑣#
                                            Block 0                                                            Block 1                                        Block 2          𝑣!
                                                                                                                                                                                        𝑣"
                                            Block 1                                                            Block 2                                        Block 0                             𝑣&
                                                                                                                                                                        𝑣"#        𝑣$                        𝑣'
                                                                                                                                                                                             𝑣(
                                                                                                                                                               𝑣#                                 𝑣""
                                         𝑣#                                               𝑣!
               𝑣!                                                                                                                              𝑣"                             𝑣%
                         𝑣"
                                    𝑣&                                                                                                                   𝑣&                             𝑣)         𝑣*
                                                                                𝑣!"                                               𝑣!"     𝑣$
                   𝑣$                         𝑣'                                                                        𝑣'
                                                                                                       𝑣)                                           𝑣)
                                                                                                             𝑣!!                                         𝑣!!

             𝑣%                                                                          𝑣%
                                                                                                  𝑣(
                                                                                                                                                                                   Final graph
                         𝑣(                                                                                   𝑣#                                          𝑣*


                  Figure 2: ScaDANN building process with three partitions and corresponding grid blocks at each step. The graph is constructed
                  by loading two partitions into memory at a time, ensuring efficient graph generation. In the next step, the previously used
                  partition is overlapped with a new partition to continue the building process.

edge is defined as an edge where the source vertex and                                                                        or 512GB. However, grid partitioning ensures scalability
destination vertex are swapped, ensuring bidirectional con-                                                                   by allowing the dataset to be partitioned into grid blocks,
nectivity between nodes. For example, if an edge connects                                                                     making it feasible to construct the graph even when the
node 𝑣𝑖 (source) to node 𝑣𝑗 (destination), the reversed edge                                                                  dataset size exceeds the available memory.
connects 𝑣𝑗 back to 𝑣𝑖 . During this step, the neighbors of                                                                      Each grid block captures intra- and inter-partition rela-
each newly added node are updated with reversed edges,                                                                        tionships. A grid block where the source and destination
completing the bidirectional connections required for robust                                                                  partitions are the same forms a subgraph, which follows
graph-based indexing.                                                                                                         the Vamana graph structure. Diagonal blocks represent
   By separating the insertion of neighbors and the addi-                                                                     subgraphs within individual partitions, while off-diagonal
tion of reversed edges into distinct steps, batch insertion                                                                   blocks denote connections between partitions. For example,
guarantees structural consistency while leveraging paral-                                                                     the red block (𝑝2, 𝑝1) in Figure 3 corresponds to connections
lel processing. This approach is particularly effective in                                                                    from partition 𝑝2 (green) to partition 𝑝1 (blue). This struc-
large-scale datasets, as it minimizes computational over-                                                                     ture improves scalability and ensures conflict-free parallel
head while maintaining the integrity and quality of the                                                                       processing during node insertion.
graph. Moreover, the absence of interdependencies between
batches ensures that the graph remains stable throughout                                                                      3.2. Overlapping block-level insertion
the construction process, enabling efficient and scalable
graph-based ANN indexing.                                                                                                     The overlapping block-level insertion method incrementally
                                                                                                                              constructs the graph by processing data in grid block units.
                                                                                                                              This method employs overlapping blocks, where a block
3. ScaDANN Method                                                                                                             from the previous step is reused in the next step, allowing
                                                                                                                              two blocks to be processed simultaneously. As detailed in
3.1. Grid partitioning                                                                                                        Algorithm 1, the process begins with the construction of
Figure 3 illustrates the structure of Graph ANN using a grid                                                                  the graph for the first partition (𝐵1 = 0). In this step(line 3),
format, where each node in the graph consists of a base vec-                                                                  specifically generates the subgraph for the first partition, en-
tor and a neighbor list. Grid partitioning divides large-scale                                                                suring that the intra-partition edges are established within
datasets into grid blocks, with rows representing source ver-                                                                 the initial grid block.
tices and columns representing destination vertices. This                                                                        In subsequent steps, pairs of overlapping blocks are pro-
approach organizes datasets into manageable segments, en-                                                                     cessed to extend the graph. For each step, the data corre-
abling memory- and disk-efficient graph construction. For                                                                     sponding to the current blocks is loaded into memory to
example, constructing a Vamana graph for a 1-billion-point                                                                    construct the graph (lines 9–11). Particularly, line 11 in-
dataset with 100 dimensions, requiring 400GB for vector                                                                       volves filling four grid blocks by constructing the subgraphs
storage, is infeasible under memory constraints of 256GB                                                                      for two partitions and establishing connections between
                                                                                                                              them. This ensures that intra- and inter-partition edges
                                                                                      destination vertices                    are created simultaneously, improving efficiency and main-
                                            𝑣+
              𝑣&
                          𝑣!
                                                                                 𝒑𝟎              𝒑𝟏                𝒑𝟐         taining the global structure of the graph. Once the graph
    𝑣!#
                                       𝑣$                                                                                     is constructed, the earlier block is offloaded from memory
                    𝑣#                             𝑣-                            Sub
                               𝑣%                                         𝒑𝟎
                                                                               graph 0                                        (lines 12–15), and the next block is loaded (lines 5–10).
                                      𝑣!!
                                                                                                                                 By reusing overlapping blocks during the insertion pro-
                                                        source vertices




             𝑣,
                         𝑣"              𝑣'                                                      Sub                          cess, ScaDANN integrates graph construction and merging
                                                                          𝒑𝟏
                                                                                               graph 1                        into a single operation, eliminating the need for a separate,
 𝑣# [0.71, 1.21, …, 0.92]           [𝑣! , 𝑣" , 𝑣!# ]
                                                                                                                              computationally expensive merging step. This approach en-
 𝑣! [1.91, 1.32, …, 0.49]           [𝑣$ , 𝑣% , 𝑣& ]                                             Edges
                                                                                                               Sub            sures that inter-block connections are captured as the graph
                                                                          𝒑𝟐                   from p2
 …


             …




                                         …




                                                                                                             graph 2
 𝑣!! [0.01, 1.21, …, 0.43]          [𝑣% , 𝑣' , 𝑣" ]
                                                                                                 to p1                        is built, thereby avoiding the creation of disjoint subgraphs.
                                                                                                                              Additionally, the overlapping mechanism significantly re-
          Base vector           Neighbor list                                                   Grid
                                                                                                                              duces I/O overhead by combining insertion and merging,
Figure 3: Graph ANN with grid format. The grid format repre-                                                                  which are traditionally distinct steps.
sents the structure of Graph ANN, where each node consists of a                                                                  Figure 2 illustrates the construction process. Initially,
base vector and a neighbor list.
 Algorithm 1: ScaDANN Construction                                              𝑣!
                                                                                                          𝑣$
                                                                                                                                  Main memory                   Disk
                                                                                          𝑣"
  Input: block size 𝑃 and dataset 𝐷                                                                  𝑣&
                                                                     𝑣"#                                                            Block 1
  Output: Graph                                                                      𝑣#
                                                                                                𝑣(
                                                                                                               𝑣'                                             Block 0
     /* Initialize first block and interval            */                                            𝑣""                            Block 2
                                                                              𝑣%
1    first block number 𝐵1 = 0, second block number                                       𝑣)           𝑣*                             Stored on disk
       𝐵2 = 1, block interval 𝑖𝑡𝑣 = 1, neighbor list                                                                                  (not loaded in main memory)

       𝑁 𝐵𝑅𝑖 = ∅;                                                  𝑣!      Neighbor list (Neighbor ID, Distance)
                                                                                                                                                              Max degree:3
     /* Read dataset for the first block               */
                                                                  ( 𝑣𝑣&" , 236) ( 𝑣𝑣$! , 387)            $$ , 219) ( 𝑣                                 𝑣&" , 236) ( 𝑣𝑣$! , 387)
                                                                                                                                           $$ , 219) ( 𝑣
2 𝐷1 = ReadDataset(𝐵1 );                                                                             ( 𝑣𝑣$$          𝑣*# , 473)        ( 𝑣𝑣$$

3 𝑁 𝐵𝑅1 = vamana(𝐵1 , 𝑁 𝐵𝑅1 );                                                                          Added neighbors                       Updated neighbor list

4 while 𝑖𝑡𝑣 < 𝑃 do
5    𝐵2 ≡ (𝐵1 + 𝑖𝑡𝑣) (mod 𝑃);                                    Figure 4: Grid block merge process with distance information.
        /* Increase interval size                      */
6       if 𝐵1 = 𝐵2 then                                             The merging process follows the Vamana graph construc-
 7          𝑖𝑡𝑣 = 𝑖𝑡𝑣 ∗ 2;                                       tion method, incorporating a greedy search to identify can-
 8          continue                                             didate neighbors and a robust pruning step to refine the
        /* Read the second dataset and graph           */        neighbor list. Inter-block edges are established by adding
9    𝐷2 = ReadDataset(𝐵2 );                                      neighbors from the second block to the neighbor list of
10   𝑁 𝐵𝑅2 = ReadGraph(𝐵2 );                                     the first block. The reverse edges are then added, ensuring
11   𝑁 𝐵𝑅1 , 𝑁 𝐵𝑅2 =                                             bidirectional connectivity, where nodes in the first block
       vamana((𝐷1 , 𝐷2 ), (𝑁 𝐵𝑅1 , 𝑁 𝐵𝑅2 ));                     are also linked as neighbors of nodes in the second block.
12   WriteGraph(𝑁 𝐵𝑅1 , 𝐵1 );                                    This comprehensive approach ensures a robust and efficient
13   𝐷1 = 𝐷 2 ;                                                  merging process.
14   𝑁 𝐵𝑅1 = 𝑁 𝐵𝑅2 ;
15   𝐵1 = 𝐵 2 ;                                                  3.4. Complexity
16 WriteGraph(𝑁 𝐵𝑅1 , 𝐵1 );
                                                                 Table 1 compares the time and space complexities of Vamana,
                                                                 DiskANN, and ScaDANN in the index-building process, fo-
                                                                 cusing on scalability and efficiency. The space complexity
blocks 0 and 1 are loaded into memory, and edges between
                                                                 considers storage requirements, including neighbor lists.
the two blocks are created. In the second step, block 0 is
                                                                    Vamana constructs a single large-scale graph, resulting in
offloaded, block 1 is retained, and block 2 is loaded. The
                                                                 a time complexity of 𝑂(𝑆 1.16 ) and a space complexity of 𝑂(𝑆 ⋅
graph is extended based on connections between blocks 1
                                                                 𝑅). Since it does not partition the dataset, the entire graph
and 2. Finally, in the third step, block 1 is offloaded, block
                                                                 must fit in memory, making it less scalable for large-scale
2 is retained, and block 0 is reloaded to extend the graph
                                                                 data. Both DiskANN and ScaDANN partition the dataset to
further by connecting blocks 2 and 0. This iterative pro-
                                                                 construct subgraphs. Partitioning reduces memory usage
cess efficiently captures all relevant connections between
                                                                 and enables disk-based processing. The time complexity
blocks, enabling the construction of a unified graph while
                                                                 is inversely proportional to the number of partitions, as
minimizing memory usage and I/O costs.
                                                                 increasing 𝑃 allows for more localized graph construction,
                                                                 reducing the computational cost.
3.3. Grid block merge                                               In DiskANN, each node is assigned to multiple overlap-
The grid block merge process establishes connections be-         ping partitions (typically 𝐾 = 2 ), as shown in Figure 1.
tween two blocks loaded into memory and ensures bidirec-         Since DiskANN performs clustering to determine partitions,
tional connectivity through reverse edges. Reverse edges,        the build time complexity includes the clustering time (
where the source and destination vertices are swapped,           𝐶 ). In contrast, ScaDANN utilizes dynamic partitioning,
maintain symmetric inter-block connections, ensuring con-        where the data structure size is determined based on the
sistent and efficient search performance.                        dataset size. This process does not require separate cluster-
   In conventional methods, merging requires all data to be      ing and is treated as an internal computational step, which
loaded into memory to calculate distances between nodes.         is why it is not included in the time complexity. Unlike
ScaDANN overcomes this limitation by storing distance in-        DiskANN, which incurs overhead due to overlapping par-
formation directly within the neighbor list of each node.        titions, ScaDANN eliminates these redundancies, stream-
This pre-stored distance means that block 1 retains the          lining subgraph construction and merging. This results in
neighbor list formed in step 1. For example, after step 1, 𝑣5    a computationally efficient process that scales well with
in block 1 (green) has already stored 𝑣3 and 𝑣2 from block       increasing dataset size.
0 (yellow) as its neighbors. This allows distance compar-
isons between existing and new neighbors without reload-            Methods                            Build Time                                    Space
ing offloaded data, reducing memory usage and improving              Vamana                                𝑂(𝑆 1.16 )                               𝑂(𝑆 ⋅ 𝑅)
                                                                    DiskANN                     𝑂(𝐾 1.16 ⋅ 𝑃 −0.84 ⋅ 𝑆 1.16 + 𝐶)                𝑂(𝑆 ⋅ 𝑅 ⋅ 𝐾 ⋅ 𝑃 −1 )
scalability.
                                                                    ScaDANN                          𝑂(𝑃 −0.84 ⋅ 𝑆 1.16 )                        𝑂(𝑆 ⋅ 𝑅 ⋅ 𝑃 −1 )
   Figure 4 illustrates the merging process during step 2
in Figure 2, where blocks 1 and 2 are loaded into memory         Table 1
while block 0 remains stored on disk. When updating the          Comparison of time and space complexities for Vamana,
neighbor list of 𝑣5 , the algorithm uses pre-stored distance     DiskANN, and ScaDANN, where 𝑆 is the dataset size, 𝑅 is the
data from step 1 to compare block 0 neighbors with block 2       maximum degree of the graph, 𝐾 is the number of overlapping
neighbors, avoiding redundant distance calculations. As a        partitions per node, 𝑃 is the number of partitions, and 𝐶 is the
                                                                 clustering time.
result, 𝑣5 ’s neighbors are updated to 𝑣11 , 𝑣3 , and 𝑣2 .
4. Experiments                                                   higher number of partitions than DiskANN. This is because
                                                                 ScaDANN stores additional distance information within the
4.1. Experimental setup                                          graph, which increases memory consumption per block. As
                                                                 a result, ScaDANN divides the data into more partitions
The experiments were conducted to evaluate the perfor-
                                                                 than DiskANN to accommodate the additional memory us-
mance of ScaDANN under controlled conditions. The hard-
                                                                 age. For instance, for the BIGANN dataset, ScaDANN used
ware environment included two Gold 6326 16-core proces-
                                                                 around 200GB of memory for indexing, while DiskANN re-
sors and 1.9 TB of main memory, while the software en-
                                                                 mained under the 256GB limit, but required fewer partitions
vironment was based on Ubuntu 20.04.6 LTS. To simulate
                                                                 to store its graph.
memory-constrained scenarios, the available memory was
                                                                    We conducted a detailed comparison of the index build-
deliberately restricted to 256 GB during the experiments.
                                                                 ing process with DiskANN, as illustrated in Figure 6. In
Under these constraints, both Vamana and ParlayANN en-
                                                                 the subgraph construction phase, our method completed
countered Out of Memory (OOM) errors, making them un-
                                                                 approximately twice as fast as DiskANN, with the BIGANN
suitable for comparison. As a result, our evaluation focused
                                                                 dataset showing a significant speedup. However, the sub-
on comparing ScaDANN with DiskANN, a state-of-the-art
                                                                 graph merging phase took longer in our approach. This is
disk-based ANN indexing method.
                                                                 due to the finer-grained distance comparisons performed
   Table 2 provides a summary of the datasets used in the
                                                                 during the merging process, which operates at the level of
experiments, including their data types and dimensions.
                                                                 grid blocks. Unlike DiskANN, which uses a more straight-
The datasets used were BIGANN, Microsoft SPACEV, Yan-
                                                                 forward merging process, ScaDANN’s overlapping block
dex DEEP, and Yandex Text-to-Image, each posing unique
                                                                 approach requires additional computations during the merg-
challenges for ANN indexing. BIGANN consists of 1 bil-
                                                                 ing phase to ensure accurate distance calculations between
lion 128-dimensional SIFT descriptors from a large-scale
                                                                 partitioned blocks.
image dataset, serving as a benchmark for high-dimensional
                                                                    In terms of I/O costs, our method demonstrated greater
similarity search. Microsoft SPACEV includes 1 billion 100-
                                                                 efficiency. This improvement can be attributed to the use
dimensional document and query vectors reflecting real-
                                                                 of overlapping block techniques, which optimize data load-
world web search scenarios. Yandex DEEP comprises 1
                                                                 ing between memory and disk. Specifically, ScaDANN’s
billion 96-dimensional image descriptors designed to eval-
                                                                 approach minimizes unnecessary disk reads and efficiently
uate deep learning-based image representations. Yandex
                                                                 manages memory, leading to a faster construction process
Text-to-Image contains 1 billion multi-modal embeddings,
                                                                 despite the increased memory overhead. This is evident
where image embeddings serve as the database and text
                                                                 in the I/O time breakdown in Figure 6, where ScaDANN
embeddings form the query set, representing a cross-modal
                                                                 exhibits a more balanced distribution of time between sub-
retrieval task with distinct distributions for database and
                                                                 graph building, partitioning, and merging, compared to
query vectors.
                                                                 DiskANN, which spends a significant portion of time in I/O
           Dataset           Data type     Dimensions
                                                                 operations.
           BigANN              uint8          128
      Microsoft SPACEV          int8          100                                                   DiskANN                   ScaDANN(ours)
        Yandex DEEP           float32          96                                 20         16.5
                                                                                                                       18.8
                                                                                                                                               16.3
                                                                  Building time




     Yandex Text-to-Image     float32         200                                                                             14.6
                                                                                  15                 10.7                                                  11.3
                                                                     (hours)




Table 2                                                                           10
Summary of datasets used in the experiments.                                       5
                                                                                   0
                                                                                         (a) BIGANN-1B            (b) MSSPACEV-1B              (c) DEEP-1B
4.2. Results and Analysis                                                         300                    248                  241                  239
                                                                                                                                                                256GB
                                                                  Memory usage




                                                                                                                                                                 Limit
                                                                                             199
4.2.1. Billion-scale index building in limited memory                             200                                  160                                     187
                                                                     (GB)




Figure 5 presents the results of indexing at a billion-scale                      100          5         8              5      8                   13          16
under a 256GB memory constraint. The top graph in Fig-                             0
ure 5 illustrates the index building time across different                               (a) BIGANN-1B            (b) MSSPACEV-1B              (c) DEEP-1B
datasets. ScaDANN significantly reduced the index-building
time compared to DiskANN across all datasets. For example,       Figure 5: Billion-scale index building experiment results, includ-
                                                                 ing building time and memory usage, conducted under a 256GB
in the BIGANN dataset, ScaDANN completed the indexing
                                                                 memory constraint. The memory usage values indicate the num-
task in 10.7 hours, compared to 16.5 hours for DiskANN,
                                                                 ber of partitions dynamically determined during the indexing
achieving approximately a 35% reduction in build time. This      process.
efficiency improvement can be attributed to ScaDANN’s op-
timized data partitioning and overlapping block techniques,                             Building Time (min) – BIGANN 1B
which minimize disk I/O and improve memory access pat-                                   3                                                                           27
terns.                                                              DiskANN                                         886                                    72
   The bottom graph in Figure 5 shows the memory usage
                                                                                        0                                           16
during the index construction process. Both DiskANN and            ScaDANN
                                                                                                   373                 208
                                                                     (ours)
ScaDANN employed dynamic partitioning, which adjusts
the number of partitions based on available memory capac-                           0              200           400          600            800                1000
ity and efficiently divides the data. Since the graph size and                              Partitioning       Sub-graph building        Merging         I/O
memory usage vary depending on the degree of the neighbor
list, dynamic partitioning enables optimized partitioning by     Figure 6: Time taken for each step in the building process, cate-
                                                                 gorized into partitioning, subgraph building, merging, and I/O
adapting to these variations. Notably, ScaDANN required a
                                                                 time for comparison.
                                                                    DiskANN                                                         ScaDANN(ours)
                       1.E+05                                                       1.E+05                                                                                            1.E+05




                                                               Queries per second
  Queries per second




                                                                                                                                                                 Queries per second
                       1.E+04                                                       1.E+04                                                                                            1.E+04


                       1.E+03                                                       1.E+03                                                                                            1.E+03


                       1.E+02                                                       1.E+02                                                                                            1.E+02
                                70      80       90     100                                  80          90                                         100                                        70                                      80       90                   100
                                         Recall@10                                                    Recall@10                                                                                                                         Recall@10
                                     (a) BIGANN-1B                                                (b) MSSPACEV-1B                                                                                                                  (c) DEEP-1B

                           Figure 7: Disk-based search performance. The graph compares Recall@10 (x-axis) and QPS (y-axis), where higher placement
                           indicates lower latency at a given recall level. A method positioned higher in the graph achieves faster query processing while
                           maintaining similar accuracy.

                                                                                                                                          Build time (min)                                                                                   QPS / Recall@10
4.2.2. Disk-based search performance                                                                                                                                                    86
                                                                                                               100
                                                                                                                                              33            42                                                                1.E+06
Figure 7 presents the comparison of the search performance                                                               50




                                                                                                                                                                                                 Queries per second
based on disks between ScaDANN and DiskANN. The ex-                                                                             -
periments were carried out using the disk search method                                                                                       4             8                           16
                                                                                                                                                   Partition number
                                                                                                                                                                                                                              1.E+05
for evaluation. Our proposed ScaDANN method demon-
                                                                                                                                              RAM usage (GB)
strated better search performance than DiskANN for most                                                                                       56
                                                                                                                           60
datasets in terms of QPS, especially for the BIGANN-1B                                                                     40
                                                                                                                                                           31
                                                                                                                                                                                       18
dataset. As shown in the graph, ScaDANN achieved a higher                                                                  20                                                                                                 1.E+04
                                                                                                                                                                                                                                        70          80          90            100
                                                                                                                            -
QPS while maintaining competitive recall rates. For ex-                                                                                       4             8                          16                                                          Recall@10
ample, ScaDANN processed queries at a faster rate than                                                                                             Partition number

DiskANN for BIGANN and MSSPACEV-1B, achieving a sig-                                                      Figure 8: Experimental results of BIGANN 100M index building
nificant speedup while still maintaining similar or slightly                                              with varying numbers of partitions.
better recall at top-10 results. However, for the DEEP-1B
dataset, ScaDANN showed a slight performance drop when                                                                                                                                QPS / Recall@10
Recall@10 exceeded 95. This discrepancy can be attributed                                                                                                 ParlayANN (40)                                                                                ParlayANN (50)
to the disk search method used, which is specifically opti-                                                                                               ScaDANN (42)                                                                                  ScaDANN (42)
                                                                                                                                                          DiskANN (55)                                                                                  DiskANN (82)
mized for DiskANN.                                                                                                              1.E+06                                                                                        1.E+05
                                                                                                           Queries per second




                                                                                                                                                                                                         Queries per second
                                                                                                                                1.E+05

4.3. Ablation study                                                                                                                                                                                                           1.E+04

                                                                                                                                1.E+04

4.3.1. Impact of partition size on index construction
                                                                                                                                1.E+03                                                                                        1.E+03

Figure 8 illustrates the effects of varying the number of parti-                                                                         80        85           90
                                                                                                                                                        Recall@10
                                                                                                                                                                                       95      100                                     70     75   80      85
                                                                                                                                                                                                                                                   Recall@10
                                                                                                                                                                                                                                                                90       95   100


tions on the BIGANN 100M dataset. Increasing the number                                                                                       (a) BIGANN-100M                                                                 (b) YANDEX Text-to-image-100M
of partitions led to a significant increase in building time,
with approximately a 70% increase per step. However, RAM                                                  Figure 9: In-memory experimental results of BIGANN 100M and
usage decreased by around 40% per step, demonstrating the                                                 Yandex Text-to-Image 100M. The numbers in parentheses next to
trade-off between memory efficiency and construction time.                                                each method indicate the index building time in minutes.
   Despite the changes in partition numbers, there were no
significant differences observed in QPS relative to Recall@10                                             5. Conclusion
across the tested configurations. This indicates that increas-
ing the partition number primarily affects the construction                                               This study proposed ScaDANN, a scalable disk-based graph
phase without adversely impacting search performance.                                                     indexing method designed for billion-scale ANN datasets
                                                                                                          in memory-constrained environments. By introducing
                                                                                                          overlapping block-level insertion and grid block merge,
4.3.2. In-memory graph construction experiments
                                                                                                          ScaDANN effectively mitigates the high memory demands
The in-memory graph construction experiments on                                                           and inefficiencies of existing methods. These techniques
BIGANN-100M and Yandex Text-to-Image-100M compare                                                         enable seamless integration of blocks and efficient merging
ParlayANN, ScaDANN, and DiskANN in terms of indexing                                                      of partitioned graphs while minimizing I/O overhead and
time and query performance. Since ParlayANN encounters                                                    leveraging stored distance information.
out-of-memory (OOM) issues at the billion-scale, the ex-                                                     Experimental results demonstrated that ScaDANN
periments were conducted on 100 million data points with-                                                 achieves up to 1.5× faster index construction and 1.4× higher
out partitioning for ScaDANN and DiskANN. ScaDANN                                                         query throughput compared to DiskANN, while maintaining
achieves indexing speeds comparable to or faster than                                                     competitive accuracy across diverse datasets. Furthermore,
ParlayANN by storing distances within the graph, reduc-                                                   ScaDANN supports the construction of single indexes 10
ing redundant computations. Both methods also benefit                                                     times larger than existing approaches within the same mem-
from batch insertion, leading to faster construction than                                                 ory constraints, making it a robust solution for large-scale
DiskANN. ScaDANN not only achieves a faster build time                                                    ANN indexing. This work establishes ScaDANN as an effi-
but also maintains strong search performance, striking a                                                  cient and scalable approach, with significant potential for
balance between efficient indexing and high query through-                                                real-world applications requiring high-performance ANN
put.                                                                                                      indexing.
References                                                             117–128.
                                                                  [15] Y. Kalantidis, Y. Avrithis, Locally optimized product
 [1] W. Yu, C. Zhu, Z. Li, Z. Hu, Q. Wang, H. Ji, M. Jiang, A          quantization for approximate nearest neighbor search,
     survey of knowledge-enhanced text generation, ACM                 in: Proceedings of the IEEE conference on computer
     Computing Surveys 54 (11s) (2022) 1–38.                           vision and pattern recognition, 2014, pp. 2321–2328.
 [2] H. Jin, Y. Zhang, D. Meng, J. Wang, J. Tan, A com-           [16] A. Babenko, V. Lempitsky, The inverted multi-index,
     prehensive survey on process-oriented automatic text              IEEE transactions on pattern analysis and machine
     summarization with exploration of llm-based methods,              intelligence 37 (6) (2014) 1247–1260.
     arXiv preprint arXiv:2403.02901 (2024).                      [17] W. Dong, C. Moses, K. Li, Efficient k-nearest neighbor
 [3] Y. Zhuang, Y. Yu, K. Wang, H. Sun, C. Zhang, Toolqa: A            graph construction for generic similarity measures, in:
     dataset for llm question answering with external tools,           Proceedings of the 20th international conference on
     Advances in Neural Information Processing Systems                 World wide web, 2011, pp. 577–586.
     36 (2023) 50117–50143.                                       [18] C. Fu, C. Xiang, C. Wang, D. Cai, Fast approximate
 [4] M. Cascella, J. Montomoli, V. Bellini, E. Bignami, Eval-          nearest neighbor search with the navigating spreading-
     uating the feasibility of chatgpt in healthcare: an anal-         out graph, arXiv preprint arXiv:1707.00143 (2017).
     ysis of multiple clinical and research scenarios, Journal    [19] Y. Malkov, A. Ponomarenko, A. Logvinov, V. Krylov,
     of medical systems 47 (1) (2023) 33.                              Approximate nearest neighbor algorithm based on
 [5] M. S. Orenstrakh, O. Karnalim, C. A. Suarez, M. Liut,             navigable small world graphs, Information Systems 45
     Detecting llm-generated text in computing education:              (2014) 61–68.
     Comparative study for chatgpt cases, in: 2024 IEEE           [20] Y. A. Malkov, D. A. Yashunin, Efficient and robust ap-
     48th Annual Computers, Software, and Applications                 proximate nearest neighbor search using hierarchical
     Conference (COMPSAC), IEEE, 2024, pp. 121–126.                    navigable small world graphs, IEEE transactions on
 [6] K. Pandya, M. Holia, Automating customer service us-              pattern analysis and machine intelligence 42 (4) (2018)
     ing langchain: Building custom open-source gpt chat-              824–836.
     bot for organizations, arXiv preprint arXiv:2310.05421       [21] J. Chen, H.-r. Fang, Y. Saad, Fast approximate knn
     (2023).                                                           graph construction for high dimensional data via re-
 [7] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang,               cursive lanczos bisection., Journal of Machine Learn-
     Q. Chen, W. Peng, X. Feng, B. Qin, et al., A survey on            ing Research 10 (9) (2009).
     hallucination in large language models: Principles, tax-     [22] M. Wang, X. Xu, Q. Yue, Y. Wang, A comprehensive
     onomy, challenges, and open questions, ACM Trans-                 survey and experimental comparison of graph-based
     actions on Information Systems (2023).                            approximate nearest neighbor search, arXiv preprint
 [8] Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, Y. Zhang, A survey        arXiv:2101.12631 (2021).
     on large language model (llm) security and privacy:          [23] K. Echihabi, K. Zoumpatianos, T. Palpanas, New trends
     The good, the bad, and the ugly, High-Confidence                  in high-d vector similarity search: al-driven, progres-
     Computing (2024) 100211.                                          sive, and distributed, Proceedings of the VLDB Endow-
 [9] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin,          ment 14 (12) (2021) 3198–3201.
     N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rock-          [24] H. V. Simhadri, G. Williams, M. Aumüller, M. Douze,
     täschel, et al., Retrieval-augmented generation for               A. Babenko, D. Baranchuk, Q. Chen, L. Hosseini,
     knowledge-intensive nlp tasks, Advances in Neural                 R. Krishnaswamny, G. Srinivasa, et al., Results of
     Information Processing Systems 33 (2020) 9459–9474.               the neurips’21 challenge on billion-scale approximate
[10] J. J. Pan, J. Wang, G. Li, Survey of vector database              nearest neighbor search, in: NeurIPS 2021 Compe-
     management systems, The VLDB Journal 33 (5) (2024)                titions and Demonstrations Track, PMLR, 2022, pp.
     1591–1615.                                                        177–189.
[11] Q. Chen, B. Zhao, H. Wang, M. Li, C. Liu, Z. Li,             [25] S. Jayaram Subramanya, F. Devvrit, H. V. Simhadri,
     M. Yang, J. Wang, Spann: Highly-efficient billion-scale           R. Krishnawamy, R. Kadekodi, Diskann: Fast accurate
     approximate nearest neighborhood search, Advances                 billion-point nearest neighbor search on a single node,
     in Neural Information Processing Systems 34 (2021)                Advances in Neural Information Processing Systems
     5199–5212.                                                        32 (2019).
[12] J. V. Munoz, M. A. Gonçalves, Z. Dias, R. d. S. Torres,      [26] S. Arya, D. M. Mount, Approximate nearest neigh-
     Hierarchical clustering-based graphs for large scale              bor queries in fixed dimensions., in: SODA, Vol. 93,
     approximate nearest neighbor search, Pattern Recog-               Citeseer, 1993, pp. 271–280.
     nition 96 (2019) 106970.                                     [27] M. D. Manohar, Z. Shen, G. Blelloch, L. Dhulipala,
[13] S. Yu, J. Engels, Y. Huang, J. Shun, Pecann: Parallel ef-         Y. Gu, H. V. Simhadri, Y. Sun, Parlayann: Scalable
     ficient clustering with graph-based approximate near-             and deterministic parallel graph-based approximate
     est neighbor search, arXiv preprint arXiv:2312.03940              nearest neighbor search algorithms, in: Proceedings
     (2023).                                                           of the 29th ACM SIGPLAN Annual Symposium on
[14] H. Jegou, M. Douze, C. Schmid, Product quantization               Principles and Practice of Parallel Programming, 2024,
     for nearest neighbor search, IEEE transactions on pat-            pp. 270–285.
     tern analysis and machine intelligence 33 (1) (2010)