<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>ScaDANN: A Scalable Disk-Based Graph Indexing Method for ANN</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hyein Gu</string-name>
          <email>hyein99@kaist.ac.k</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Min-sooKim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Korea Advanced Institute of Science and Technology</institution>
          ,
          <country>Republic of Korea</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <fpage>25</fpage>
      <lpage>28</lpage>
      <abstract>
        <p>Approximate Nearest Neighbor (ANN) indexing constitutes a fundamental component of modern vector-based databases, facilitating eficient and accurate information retrieval for applications such as retrieval-augmented generation. However, scaling graph-based ANN indices to billion-scale datasets poses substantial challenges, including high memory demands and ineficiencies in handling partitioned graphs. To address these issues, we propose ScaDANN, a scalable disk-based graph indexing method tailored for large-scale datasets under limited memory conditions. ScaDANN introduces two novel techniques: overlapping block-level insertion and grid block merge, which enable the eficient construction of unified graph indices while preserving high search performance. Our approach achieves notable advancements in index construction speed, search accuracy, and memory eficiency, establishing ScaDANN as a robust and efective solution for scalable ANN indexing in resource-limited environments.</p>
      </abstract>
      <kwd-group>
        <kwd>Graph</kwd>
        <kwd>ANN</kwd>
        <kwd>Nearest neighbor search</kwd>
        <kwd>Vector search</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>ral language processing by excelling in tasks such as text</p>
      <sec id="sec-1-1">
        <title>Large Language Models (LLMs) have revolutionized natu- struction.</title>
        <p>generation1[], summarization2[], and question answering
large-scale datasets, ofering superior search accuracy and
performance despite higher memory demands during
con</p>
        <p>The need for scalable ANN indexing becomes
increasingly apparent as data is segmented into smaller chunks to
information8[].
healthcare4[], education5[], and customer servic6e][.
How[3]. Their ability to leverage pre-trained knowledge has meet the input constraints of LLMs. For instance, dividing
facilitated applications across diverse domains, includinga 100 MB document into 4 KB chunks suitable for LLM
input results in approximately 25,600 chunks—a 25,600-fold
such as hallucination7][, where they generate plausible but
ever, despite their strengths, LLMs face critical limitationisn,crease in data chunks relative to the original document
count. This exponential growth in the number of chunks
incorrect outputs, outdated knowledge due to static trasiing-nificantly increases the size of the ANN index,
necessiing data, and privacy concerns when processing sensitivetating the development of methods capable of eficiently</p>
        <sec id="sec-1-1-1">
          <title>Retrieval-Augmented Generation (RA9G])h[as been inhandling billion-scale datasets.</title>
        </sec>
        <sec id="sec-1-1-2">
          <title>Building large-scale ANN indexes, however, poses sig</title>
          <p>the information necessary for a given query.</p>
        </sec>
        <sec id="sec-1-1-3">
          <title>Approximate Nearest Neighbor (ANN) indexing serves</title>
          <p>come these limitations.
systems. By leveraging vector databases (VectorD10B]s,) [
troduced as a promising framework to mitigate these limita-nificant challenges [23, 24]. First, the substantial size of
tions by integrating LLMs with external knowledge retrievadl atasets and indexes requires considerable memory
resources. Second, achieving competitive search performance
RAG retrieves relevant and up-to-date information to eno-n disk-based systems, which are necessary for handling
hance the accuracy and reliability of generated responsess.uch datasets, is particularly challenging. To address these
This framework not only reduces the impact of hallucina-issues, we propose ScaDANN, a Scalable Disk-based Graph
tion but also addresses privacy concerns by accessing onlyIndexing Method for ANN, specifically designed to
overThis study pursues two primary objectives: (1) to enable
as a foundational component in RAG systems, enabling the construction of large-scale ANN indexes in
memoryeficient information retrieval from VectorDBs. ANN in- constrained environments and (2) to enhance search
perdexes can be broadly categorized into clustering-basedformance through the use of merged graph structures that
[11, 12, 13], hash-based 1[4, 15, 16], and graph-based
methods [17, 18, 19, 20, 21]. Clustering-based methods partition
improve eficiency and accuracy during the search process.</p>
          <p>By focusing on these goals, we provide a scalable and
efitimes and reduced search accuracy. Hash-based methodslimitations are critical challenges.
datasets into smaller clusters, facilitating memory-eficient cient solution for constructing and querying ANN indexes
index construction but often incurring longer constructionin scenarios where both large-scale datasets and memory
employ hash functions to compress datasets and reduce
dimensionality, which accelerates construction and search
processes but at the cost of reduced accuracy due to
compression. Graph-based methods, in contrast, represent
vectors as nodes and their relationships as edges, connecting
nodes based on proximity22[]. These methods construct
neighbor lists through vector operations between nodes
(as illustrated in Figu1r)eand are particularly efective for
Published in the Proceedings of the Workshops of the EDBT/ICDT 2025
∗Corresponding author.
0009-0009-8582-5805 (H. Gu); 0000-0002-5065-0226 (M. Kim)</p>
          <p>This work makes the following key contributions:
• We propose a graph-based ANN method that
leverages grid blocks to eficiently construct and scale
billion-scale indexes in memory-constrained
environments.
• Our approach outperforms existing disk-based
methods, achieving index construction speeds 1.3 to 1.5
times faster and search speeds 1.2 to 1.4 times faster,
while maintaining comparable accuracy.
• We demonstrate that our method enables the
construction of single indexes up to 10 times larger than
those of existing approaches within the same
memCopyright © 2025 for this paper by its authors. Use permitted under Creative Commons License</p>
          <p>CEUR</p>
          <p>ceur-ws.org</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Preliminaries</title>
      <sec id="sec-2-1">
        <title>These contributions mark a significant advancement in Specifically, an edge between node s and ′ is removed</title>
        <p>ANN indexing, particularly in terms of eficiency and scala- if the distance between these nodes exceeds the distance
bebility for large-scale, memory-constrained applications. tween ′ and a third no de∗. By retaining longer edges
under relaxed conditions, Vamana enhances search eficiency
while maintaining graph connectivity.
2.1. ANN search
2.2.2. DiskANN</p>
        <sec id="sec-2-1-1">
          <title>DiskANN [25] extends the Vamana graph to a disk-based</title>
          <p>Nearest Neighbor (NN) search identifies the data point(s) in environment, making it well-suited for large-scale datasets.
a dataset closest to a given query point based on a defined It employs a divide-and-conquer approach, partitioning the
distance metric, such as Euclidean distance or cosine similard-ataset into subgraphs and merging them into a unified
ity. While exact NN search computes the distance betweenstructure.
the query and every data point to determine the nearest Figure1 illustrates the partitioning and merging process
neighbor, this approach becomes computationally infeasible of DiskANN. (1) The dataset is first divided into multiple
for large datasets. ANN search addresses this challenge byoverlapping partitions using a clustering technique, with
ifnding points that are approximately close to the true near-each node assigned to at least two partitions. (2)
Indepenest neighbors, significantly reducing computational costs. dent subgraphs are constructed within each partition. (3)
Although ANN sacrifices some accuracy, it delivers faster These subgraphs are then merged by randomly selecting
query response times, which is particularly advantageousnodes from the partitioned graphs, followed by merging
in high-dimensional and large-scale datasets. and pruning operations. This process updates the neighbor
lists with merged results, ensuring connectivity between
(whether exact or approximate) is to find the p o int ∈ 
such that the dista n(,c e   ) is minimized, where(⋅, ⋅)
represents a distance metric. In the case of ANN search, thetaining search performance across partitions.
Definition 1 (ANN). Given a datase t= { 1,  2, … ,   }
and a query point, the task of nearest neighbor search partitions while reducing memory and computational costs.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>DiskANN’s partitioning and merging strategy enables scal</title>
          <p>able graph construction but introduces challenges in
mainalgorithm guarantees that the returned p∗osianttisfies:
2.2.3. Batch insertion for graph construction
(,  ∗) ≤ (1 + ) ⋅ (,    ),</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Eficient parallelization is essential for graph construction</title>
          <p>where is a small approximation factor, a n  d is the exact in large-scale datasets. Batch insertion, as introduced in
Parnearest neighbor of the query po i.nt layANN [27], addresses the challenge of concurrently adding
nodes to a graph while ensuring conflict-free operations.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>To evaluate ANN algorithms, two primary metrics are Conflicts are avoided by grouping nodes into independent</title>
        <p>used: Recall and Queries Per Second (QPS). Recall quan- batches and processing each batch in isolation, ensuring
tifies the efectiveness of an algorithm in retrieving true that nodes within a batch do not interfere with each other
nearest neighbors, defined as the ratio of true nearest neighd-uring insertion.
bors retrieved to the total number of true nearest neighbors.The batch insertion process occurs in two steps. First,
Higher recall indicates greater accuracy but may increasneodes in a batch are independently connected to their
nearsearch time. QPS measures the system’s throughput, i.e., est neighbors within the existing graph. These connections
the number of queries processed per second. Higher QPS are established in parallel, as nodes within a batch do not
reflects greater eficiency, which is critical for large-scale share dependencies, preventing conflicts during edge
creand real-time applications. ation. This step ensures that all newly added nodes are
integrated into the graph without disrupting its structure.
2.2. Existing graph ANN indexing methods Second, reversed edges are added to the graph. A reversed
$
! " &amp; $ ! " &amp; (
"# #
) "#</p>
        <p>*
% '
"# # ) "" (
% '</p>
        <p>*
(2) Sub-graph construction</p>
        <p>Random shuffle and Prune
 1 2 3 4 5 6
$ # #$ * " &amp; )
`# " ) ' $
…
#$ $ " &amp;
## ) + * (</p>
        <p>Neighbor list
2.2.1. Vamana graph
Partition into P1 P2 P3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Vamana [25] is a widely used graph-based ANN indexing $</title>
        <p>algorithm and serves as the foundational graph structure in ! " &amp;
many recent ANN studies. Its construction is based on two "# # '
key techniques: Greedy Search and Robust Pruning. ( ""</p>
        <p>Greedy Search enables eficient exploration of nearest % ) *
neighbors within the graph. Beginning from an initial node, * Two overlapping clusters
the algorithm iteratively identifies and prioritizes nodes * Each cluster size: 8
closest to the query. It maintains a priority queue of visited (1) Partitioning based on
nodes, extracting the nearest unvisited node at each step, overlapping clustering
adding its neighbors to the queue, and repeating this process %
until all relevant nodes are explored. " #</p>
        <p>Robust Pruning optimizes the graph by removing un- &amp; * +
necessary edges discovered during Greedy Search. This
technique utilizes the Sparse Neighborhood Graph (SNG) (3) Merging
property 2[6] and applies a distance threshold parameter,
alpha ( ), to determine whether an edge should be retained.Figure 1: DiskANN partitioning and merging process.
 ⋅ ( ∗,  ′) ≤ (, 
′) ( &gt; 1)
#$
$</p>
        <p>'
) ##
(
edge is defined as an edge where the source vertex and or 512GB. However, grid partitioning ensures scalability
destination vertex are swapped, ensuring bidirectional conb-y allowing the dataset to be partitioned into grid blocks,
nectivity between nodes. For example, if an edge connects making it feasible to construct the graph even when the
node  (source) to nod e (destination), the reversed edge dataset size exceeds the available memory.
connect s back to  . During this step, the neighbors of Each grid block captures intra- and inter-partition
relaeach newly added node are updated with reversed edges,tionships. A grid block where the source and destination
completing the bidirectional connections required for robusptartitions are the same forms a subgraph, which follows
graph-based indexing. the Vamana graph structure. Diagonal blocks represent</p>
        <p>By separating the insertion of neighbors and the addis-ubgraphs within individual partitions, while of-diagonal
tion of reversed edges into distinct steps, batch insertiobnlocks denote connections between partitions. For example,
guarantees structural consistency while leveraging paratl-he red block(2, 1) in Figure3 corresponds to connections
lel processing. This approach is particularly efective in from partition2 (green) to partitio1n (blue). This
struclarge-scale datasets, as it minimizes computational over-ture improves scalability and ensures conflict-free parallel
head while maintaining the integrity and quality of theprocessing during node insertion.
graph. Moreover, the absence of interdependencies between
batches ensures that the graph remains stable throughou3t.2. Overlapping block-level insertion
the construction process, enabling eficient and scalable
graph-based ANN indexing.</p>
        <sec id="sec-2-3-1">
          <title>The overlapping block-level insertion method incrementally</title>
          <p>constructs the graph by processing data in grid block units.</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>This method employs overlapping blocks, where a block</title>
          <p>3. ScaDANN Method from the previous step is reused in the next step, allowing
two blocks to be processed simultaneously. As detailed in
3.1. Grid partitioning Algorithm1, the process begins with the construction of
Figure3 illustrates the structure of Graph ANN using a gridthe graph for the first partition1 =( 0). In this step(line 3),
format, where each node in the graph consists of a base vec-specifically generates the subgraph for the first partition,
entor and a neighbor list. Grid partitioning divides large-scalseuring that the intra-partition edges are established within
datasets into grid blocks, with rows representing source vert-he initial grid block.
tices and columns representing destination vertices. This In subsequent steps, pairs of overlapping blocks are
proapproach organizes datasets into manageable segments, en-cessed to extend the graph. For each step, the data
correabling memory- and disk-eficient graph construction. For sponding to the current blocks is loaded into memory to
example, constructing a Vamana graph for a 1-billion-point construct the graph (lines 9–11). Particularly, line 11
indataset with 100 dimensions, requiring 400GB for vectorvolves filling four grid blocks by constructing the subgraphs
storage, is infeasible under memory constraints of 256GB for two partitions and establishing connections between
them. This ensures that intra- and inter-partition edges
!# ,&amp; # "! % !$!'+ - tisrece grSapudhbe0stinatSiounbvertices (ictalseairnscBeisneoyc,sinSrnr1sceeg2taau–rtDts1uheiA5ncde)Ntg,sgeaNioldmnovi,dubnetalrtthtllehaaesgenptrepernaaiourtenuxclegisttselbuyrgbl,rorlbioeacmlcokpopkhcsrfkidoctsivuohslironenoiasnggflotdrgareeauficddtpiceheh(tdleni.incofiOrnyenonssaamec5nrne–mdtd1tie0momhm)an.eeoirnrpggrry-ioan-pgh
vec graph 1 into a single operation, eliminating the need for a separate,
…!# [[10..9711,, 11…..3221,, ……,, 00..4992]] [[!$,, …"%,, !&amp;#]] rsou frEtoodmgpe1ps2 grSapuhb 2 iscsuobrmuepsiulttth,atathtioeinrneatblelyyr-aebvxloopecidnkiscniovgnetnmheeecrtcgriioennagstsiatorenep.ocTfahdptiissujaorepindprtaossautcbhhgereagnrp-haps.h
!! [0.01, 1.21, …, 0.43] [%, ', "] Additionally, the overlapping mechanism significantly
reBase vector Neighbor list Grid duces I/O overhead by combining insertion and merging,
which are traditionally distinct steps.</p>
        </sec>
        <sec id="sec-2-3-3">
          <title>Figure2 illustrates the construction process. Initially,</title>
          <p>The merging process follows the Vamana graph
construction method, incorporating a greedy search to identify
candidate neighbors and a robust pruning step to refine the
neighbor list. Inter-block edges are established by adding
neighbors from the second block to the neighbor list of
the first block. The reverse edges are then added, ensuring
bidirectional connectivity, where nodes in the first block
are also linked as neighbors of nodes in the second block.</p>
        </sec>
        <sec id="sec-2-3-4">
          <title>This comprehensive approach ensures a robust and eficient merging process.</title>
          <p>3.4. Complexity</p>
        </sec>
        <sec id="sec-2-3-5">
          <title>Table1 compares the time and space complexities of Vamana,</title>
          <p>DiskANN, and ScaDANN in the index-building process,
foblocks 0 and 1 are loaded into memory, and edges between cusing on scalability and eficiency. The space complexity
the two blocks are created. In the second step, block 0 isconsiders storage requirements, including neighbor lists.
ofloaded, block 1 is retained, and block 2 is loaded. The Vamana constructs a single large-scale graph, resulting in
graph is extended based on connections between blocks 1 a time complexity of( 1.16) and a space complexity o(f ⋅
and 2. Finally, in the third step, block 1 is ofloaded, block ) . Since it does not partition the dataset, the entire graph
2 is retained, and block 0 is reloaded to extend the graphmust fit in memory, making it less scalable for large-scale
further by connecting blocks 2 and 0. This iterative pro-data. Both DiskANN and ScaDANN partition the dataset to
cess eficiently captures all relevant connections between construct subgraphs. Partitioning reduces memory usage
blocks, enabling the construction of a unified graph while and enables disk-based processing. The time complexity
minimizing memory usage and I/O costs. is inversely proportional to the number of partitions, as
increasin g allows for more localized graph construction,
reducing the computational cost.
3.3. Grid block merge In DiskANN, each node is assigned to multiple
overlapThe grid block merge process establishes connections be- ping partitions (typical ly= 2 ), as shown in Figure1.
tween two blocks loaded into memory and ensures bidirec- Since DiskANN performs clustering to determine partitions,
tional connectivity through reverse edges. Reverse edgest,he build time complexity includes the clustering time (
where the source and destination vertices are swappe d, ). In contrast, ScaDANN utilizes dynamic partitioning,
maintain symmetric inter-block connections, ensuring con-where the data structure size is determined based on the
sistent and eficient search performance. dataset size. This process does not require separate
cluster</p>
          <p>In conventional methods, merging requires all data to being and is treated as an internal computational step, which
loaded into memory to calculate distances between nodesi.s why it is not included in the time complexity. Unlike
ScaDANN overcomes this limitation by storing distance in-DiskANN, which incurs overhead due to overlapping
parformation directly within the neighbor list of each nodet.itions, ScaDANN eliminates these redundancies,
streamThis pre-stored distance means that block 1 retains thelining subgraph construction and merging. This results in
neighbor list formed in step 1. For example, after step 15, a computationally eficient process that scales well with
in block 1 (green) has already sto re3dand 2 from block increasing dataset size.
0 (yellow) as its neighbors. This allows distance
comparisons between existing and new neighbors without reload- Methods Build Time Space
ing ofloaded data, reducing memory usage and improving Vamana ( 1.16) ( ⋅ )
scalability. DiskANN ( 1.16 ⋅  −0.84 ⋅  1.16 + ) ( ⋅  ⋅  ⋅  −1)</p>
          <p>Figure4 illustrates the merging process during step 2 ScaDANN ( −0.84 ⋅  1.16) ( ⋅  ⋅  −1)
in Figure2, where blocks 1 and 2 are loaded into memory Table 1
while block 0 remains stored on disk. When updating the Comparison of time and space complexities for Vamana,
neighbor list o f5, the algorithm uses pre-stored distance DiskANN, and ScaDANN, where  is the dataset size,  is the
data from step 1 to compare block 0 neighbors with block 2 maximum degree of the graph,  is the number of overlapping
neighbors, avoiding redundant distance calculations. As apartitions per node,  is the number of partitions, and  is the
result, 5’s neighbors are updated t1o1,  3, and 2. clustering time.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experiments</title>
      <p>higher number of partitions than DiskANN. This is because</p>
      <sec id="sec-3-1">
        <title>ScaDANN stores additional distance information within the</title>
        <p>4.1. Experimental setup graph, which increases memory consumption per block. As
a result, ScaDANN divides the data into more partitions
The experiments were conducted to evaluate the perfort-han DiskANN to accommodate the additional memory
usmance of ScaDANN under controlled conditions. The hard-age. For instance, for the BIGANN dataset, ScaDANN used
ware environment included two Gold 6326 16-core proces- around 200GB of memory for indexing, while DiskANN
resors and 1.9 TB of main memory, while the software en- mained under the 256GB limit, but required fewer partitions
vironment was based on Ubuntu 20.04.6 LTS. To simulate to store its graph.
memory-constrained scenarios, the available memory was We conducted a detailed comparison of the index
builddeliberately restricted to 256 GB during the experiments.</p>
        <p>ing process with DiskANN, as illustrated in Fig6u. rIen
Under these constraints, both Vamana and ParlayANN ent-he subgraph construction phase, our method completed
countered Out of Memory (OOM) errors, making them un- approximately twice as fast as DiskANN, with the BIGANN
suitable for comparison. As a result, our evaluation focuseddataset showing a significant speedup. However, the
subon comparing ScaDANN with DiskANN, a state-of-the-art graph merging phase took longer in our approach. This is
disk-based ANN indexing method. due to the finer-grained distance comparisons performed</p>
        <p>Table 2 provides a summary of the datasets used in the during the merging process, which operates at the level of
experiments, including their data types and dimensions.grid blocks. Unlike DiskANN, which uses a more
straightThe datasets used were BIGANN, Microsoft SPACEV, Yan- forward merging process, ScaDANN’s overlapping block
dex DEEP, and Yandex Text-to-Image, each posing unique approach requires additional computations during the
mergchallenges for ANN indexing. BIGANN consists of 1 bil- ing phase to ensure accurate distance calculations between
lion 128-dimensional SIFT descriptors from a large-scale partitioned blocks.
image dataset, serving as a benchmark for high-dimensional In terms of I/O costs, our method demonstrated greater
similarity search. Microsoft SPACEV includes 1 billion 100- eficiency. This improvement can be attributed to the use
dimensional document and query vectors reflecting real- of overlapping block techniques, which optimize data
loadworld web search scenarios. Yandex DEEP comprises 1 ing between memory and disk. Specifically, ScaDANN’s
billion 96-dimensional image descriptors designed to eval- approach minimizes unnecessary disk reads and eficiently
uate deep learning-based image representations. Yandex manages memory, leading to a faster construction process
Text-to-Image contains 1 billion multi-modal embeddings, despite the increased memory overhead. This is evident
where image embeddings serve as the database and text in the I/O time breakdown in Figu6r,ewhere ScaDANN
embeddings form the query set, representing a cross-modal exhibits a more balanced distribution of time between
subretrieval task with distinct distributions for database agnrdaph building, partitioning, and merging, compared to
query vectors. DiskANN, which spends a significant portion of time in I/O
operations.
4.2. Results and Analysis
4.2.1. Billion-scale index building in limited memory
e 20
tim )s 15
g ru 10
in o
ild (h 5
uB 0</p>
        <p>300
e
g
a
su ) 200
y B
ro (G 100
m
e
M
16.5
10.7
DiskANN</p>
        <p>ScaDANN(ours)
18.8
14.6
16.3</p>
        <p>11.3
(a) BIGANN-1B (b) MSSPACEV-1B (c) DEEP-1B</p>
        <p>248 241 239 2L5i6mGiBt
199 187
160
Figure5 presents the results of indexing at a billion-scale 5 8 5 8 13 16
under a 256GB memory constraint. The top graph in Fig- 0
ure 5 illustrates the index building time across diferent (a) BIGANN-1B (b) MSSPACEV-1B (c) DEEP-1B
datasets. ScaDANN significantly reduced the index-building
time compared to DiskANN across all datasets. For example, Figure 5: Billion-scale index building experiment results,
includin the BIGANN dataset, ScaDANN completed the indexing ing building time and memory usage, conducted under a 256GB
task in 10.7 hours, compared to 16.5 hours for DiskANN, bmeermoofrpyacrotintsiotrnasindty.nTahmemicaemllyordyetuesramgeinveadludeusriinndgictahtee
itnhdeenxuinmgachieving approximately a 35% reduction in build time. This process.
eficiency improvement can be attributed to ScaDANN’s
optimized data partitioning and overlapping block techniques, Building Time (min) – BIGANN 1B
which minimize disk I/O and improve memory access pat- 3 27
terns. DiskANN 886 72
SdcuaTrDhineAgbNtoNhteteominmpdlogerxyaecdpohndsiyntnrFauimgciutcir5opesnahrpotrwiotscieotsnhsie.nBmgo,etwmhohDricyihsukasAadNgjuNestasnd Sc(aoDuArsN)N 0 373 208 16
the number of partitions based on available memory capac- 0 200 400 600 800 1000
ity and eficiently divides the data. Since the graph size and Partitioning Sub-graph building Merging I/O
memory usage vary depending on the degree of the neighbor
list, dynamic partitioning enables optimized partitioning byFigure 6: Time taken for each step in the building process,
cateadapting to these variations. Notably, ScaDANN required agorized into partitioning, subgraph building, merging, and I/O
time for comparison.
1.E+05
d
n
o
ec1.E+04
s
r
e
p
se1.E+03
ir
e
u
Q1.E+02
80</p>
        <p>ScaDANN(ours)
70
80
90</p>
        <p>100</p>
        <p>Recall@10
ParlayANN (40)
ScaDANN (42)
DiskANN (55)
4.2.2. Disk-based search performance
100
Figure7 presents the comparison of the search performance 50
based on disks between ScaDANN and DiskANN. The ex-
periments were carried out using the disk search method
for evaluation. Our proposed ScaDANN method
demonstrated better search performance than DiskANN for most
datasets in terms of QPS, especially for the BIGANN-1B 4600
dataset. As shown in the graph, ScaDANN achieved a higher 20</p>
        <sec id="sec-3-1-1">
          <title>QPS while maintaining competitive recall rates. For ex-</title>
          <p>ample, ScaDANN processed queries at a faster rate than</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>DiskANN for BIGANN and MSSPACEV-1B, achieving a sig</title>
        <p>nificant speedup while still maintaining similar or slightly
better recall at top-10 results. However, for the DEEP-1B
dataset, ScaDANN showed a slight performance drop when
Recall@10 exceeded 95. This discrepancy can be attributed
to the disk search method used, which is specifically
optimized for DiskANN.
4.3. Ablation study
4.3.1. Impact of partition size on index construction
1.E+06
d
n
o
ec1.E+05
rse
rsep
ie1.E+04
u
Q
1.E+03 80
ParlayANN (50)
ScaDANN (42)</p>
        <p>DiskANN (82)
1.E+05
d
n
o
rsce
e1.E+04
irseep
u
Q
1.E+03 70 75 80 85 90 95 100</p>
        <p>Recall@10
(b) YANDEX Text-to-image-100M</p>
        <sec id="sec-3-2-1">
          <title>Figure8 illustrates the efects of varying the number of parti- 85 Recall@9010 95 100</title>
          <p>tions on the BIGANN 100M dataset. Increasing the number (a) BIGANN-100M
of partitions led to a significant increase in building time,
with approximately a 70% increase per step. However, RAM Figure 9: In-memory experimental results of BIGANN 100M and
usage decreased by around 40% per step, demonstrating theYandex Text-to-Image 100M. The numbers in parentheses next to
trade-of between memory eficiency and construction time. each method indicate the index building time in minutes.</p>
          <p>Despite the changes in partition numbers, there were no
significant diferences observed in QPS relative to Recall@10 5. Conclusion
across the tested configurations. This indicates that
increasing the partition number primarily afects the constructionThis study proposed ScaDANN, a scalable disk-based graph
phase without adversely impacting search performance. indexing method designed for billion-scale ANN datasets
in memory-constrained environments. By introducing
4.3.2. In-memory graph construction experiments overlapping block-level insertion and grid block merge,</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>ScaDANN efectively mitigates the high memory demands</title>
        <p>The in-memory graph construction experiments on and ineficiencies of existing methods. These techniques
BIGANN-100M and Yandex Text-to-Image-100M compare enable seamless integration of blocks and eficient merging
ParlayANN, ScaDANN, and DiskANN in terms of indexing of partitioned graphs while minimizing I/O overhead and
time and query performance. Since ParlayANN encounters leveraging stored distance information.
out-of-memory (OOM) issues at the billion-scale, the ex- Experimental results demonstrated that ScaDANN
periments were conducted on 100 million data points with-achieves up to 1.5× faster index construction and 1.4× higher
out partitioning for ScaDANN and DiskANN. ScaDANN query throughput compared to DiskANN, while maintaining
achieves indexing speeds comparable to or faster thancompetitive accuracy across diverse datasets. Furthermore,
ParlayANN by storing distances within the graph, reducS-caDANN supports the construction of single indexes 10
ing redundant computations. Both methods also benefit times larger than existing approaches within the same
memfrom batch insertion, leading to faster construction thaonry constraints, making it a robust solution for large-scale</p>
      </sec>
      <sec id="sec-3-4">
        <title>DiskANN. ScaDANN not only achieves a faster build time ANN indexing. This work establishes ScaDANN as an efibut also maintains strong search performance, striking acient and scalable approach, with significant potential for balance between eficient indexing and high query through- real-world applications requiring high-performance ANN put. indexing.</title>
        <p>117–128.</p>
        <p>[15] Y. Kalantidis, Y. Avrithis, Locally optimized product
[1] W. Yu, C. Zhu, Z. Li, Z. Hu, Q. Wang, H. Ji, M. Jiang, A quantization for approximate nearest neighbor search,
survey of knowledge-enhanced text generation, ACM in: Proceedings of the IEEE conference on computer
Computing Surveys 54 (11s) (2022) 1–38. vision and pattern recognition, 2014, pp. 2321–2328.
[2] H. Jin, Y. Zhang, D. Meng, J. Wang, J. Tan, A com- [16] A. Babenko, V. Lempitsky, The inverted multi-index,
prehensive survey on process-oriented automatic text IEEE transactions on pattern analysis and machine
summarization with exploration of llm-based methods, intelligence 37 (6) (2014) 1247–1260.</p>
        <p>arXiv preprint arXiv:2403.02901 (2024). [17] W. Dong, C. Moses, K. Li, Eficient k-nearest neighbor
[3] Y. Zhuang, Y. Yu, K. Wang, H. Sun, C. Zhang, Toolqa: A graph construction for generic similarity measures, in:
dataset for llm question answering with external tools, Proceedings of the 20th international conference on
Advances in Neural Information Processing Systems World wide web, 2011, pp. 577–586.</p>
        <p>36 (2023) 50117–50143. [18] C. Fu, C. Xiang, C. Wang, D. Cai, Fast approximate
[4] M. Cascella, J. Montomoli, V. Bellini, E. Bignami, Eval- nearest neighbor search with the navigating
spreadinguating the feasibility of chatgpt in healthcare: an anal- out graph, arXiv preprint arXiv:1707.00143 (2017).
ysis of multiple clinical and research scenarios, Journal [19] Y. Malkov, A. Ponomarenko, A. Logvinov, V. Krylov,
of medical systems 47 (1) (2023) 33. Approximate nearest neighbor algorithm based on
[5] M. S. Orenstrakh, O. Karnalim, C. A. Suarez, M. Liut, navigable small world graphs, Information Systems 45</p>
      </sec>
      <sec id="sec-3-5">
        <title>Detecting llm-generated text in computing education: (2014) 61–68.</title>
      </sec>
      <sec id="sec-3-6">
        <title>Comparative study for chatgpt cases, in: 2024 IEEE [20] Y. A. Malkov, D. A. Yashunin, Eficient and robust ap</title>
        <p>48th Annual Computers, Software, and Applications proximate nearest neighbor search using hierarchical</p>
      </sec>
      <sec id="sec-3-7">
        <title>Conference (COMPSAC), IEEE, 2024, pp. 121–126. navigable small world graphs, IEEE transactions on</title>
        <p>[6] K. Pandya, M. Holia, Automating customer service us- pattern analysis and machine intelligence 42 (4) (2018)
ing langchain: Building custom open-source gpt chat- 824–836.
bot for organizations, arXiv preprint arXiv:2310.05421 [21] J. Chen, H.-r. Fang, Y. Saad, Fast approximate knn
(2023). graph construction for high dimensional data via
re[7] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, cursive lanczos bisection., Journal of Machine
Learn</p>
      </sec>
      <sec id="sec-3-8">
        <title>Q. Chen, W. Peng, X. Feng, B. Qin, et al., A survey on ing Research 10 (9) (2009).</title>
        <p>hallucination in large language models: Principles, tax- [22] M. Wang, X. Xu, Q. Yue, Y. Wang, A comprehensive
onomy, challenges, and open questions, ACM Trans- survey and experimental comparison of graph-based
actions on Information Systems (2023). approximate nearest neighbor search, arXiv preprint
[8] Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, Y. Zhang, A survey arXiv:2101.12631 (2021).</p>
        <p>on large language model (llm) security and privacy: [23] K. Echihabi, K. Zoumpatianos, T. Palpanas, New trends</p>
      </sec>
      <sec id="sec-3-9">
        <title>The good, the bad, and the ugly, High-Confidence in high-d vector similarity search: al-driven, progres</title>
      </sec>
      <sec id="sec-3-10">
        <title>Computing (2024) 100211. sive, and distributed, Proceedings of the VLDB Endow</title>
        <p>[9] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, ment 14 (12) (2021) 3198–3201.</p>
      </sec>
      <sec id="sec-3-11">
        <title>N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rock- [24] H. V. Simhadri, G. Williams, M. Aumüller, M. Douze,</title>
        <p>täschel, et al., Retrieval-augmented generation for A. Babenko, D. Baranchuk, Q. Chen, L. Hosseini,
knowledge-intensive nlp tasks, Advances in Neural R. Krishnaswamny, G. Srinivasa, et al., Results of</p>
      </sec>
      <sec id="sec-3-12">
        <title>Information Processing Systems 33 (2020) 9459–9474. the neurips’21 challenge on billion-scale approximate</title>
        <p>[10] J. J. Pan, J. Wang, G. Li, Survey of vector database nearest neighbor search, in: NeurIPS 2021
Compemanagement systems, The VLDB Journal 33 (5) (2024) titions and Demonstrations Track, PMLR, 2022, pp.
1591–1615. 177–189.
[11] Q. Chen, B. Zhao, H. Wang, M. Li, C. Liu, Z. Li, [25] S. Jayaram Subramanya, F. Devvrit, H. V. Simhadri,</p>
      </sec>
      <sec id="sec-3-13">
        <title>M. Yang, J. Wang, Spann: Highly-eficient billion-scale R. Krishnawamy, R. Kadekodi, Diskann: Fast accurate</title>
        <p>approximate nearest neighborhood search, Advances billion-point nearest neighbor search on a single node,
in Neural Information Processing Systems 34 (2021) Advances in Neural Information Processing Systems
5199–5212. 32 (2019).
[12] J. V. Munoz, M. A. Gonçalves, Z. Dias, R. d. S. Torres, [26] S. Arya, D. M. Mount, Approximate nearest
neighHierarchical clustering-based graphs for large scale bor queries in fixed dimensions., in: SODA, Vol. 93,
approximate nearest neighbor search, Pattern Recog- Citeseer, 1993, pp. 271–280.</p>
        <p>nition 96 (2019) 106970. [27] M. D. Manohar, Z. Shen, G. Blelloch, L. Dhulipala,
[13] S. Yu, J. Engels, Y. Huang, J. Shun, Pecann: Parallel ef- Y. Gu, H. V. Simhadri, Y. Sun, Parlayann: Scalable
ifcient clustering with graph-based approximate near- and deterministic parallel graph-based approximate
est neighbor search, arXiv preprint arXiv:2312.03940 nearest neighbor search algorithms, in: Proceedings
(2023). of the 29th ACM SIGPLAN Annual Symposium on
[14] H. Jegou, M. Douze, C. Schmid, Product quantization Principles and Practice of Parallel Programming, 2024,
for nearest neighbor search, IEEE transactions on pat- pp. 270–285.
tern analysis and machine intelligence 33 (1) (2010)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>