1. Introduction

Hiding Latencies in Network-Based Image Loading for Deep Learning

Francesco Versaci

Giovanni Busonera

0 0 CRS4 - Center for Advanced Studies, Research and Development , Cagliari , Italy

2025

In the last decades, the computational power of GPUs has grown exponentially, allowing current deep learning (DL) applications to handle increasingly large amounts of data at a progressively higher throughput. However, network and storage latencies cannot decrease at a similar pace due to physical constraints, leading to data stalls, and creating a bottleneck for DL tasks. Additionally, managing vast quantities of data and their associated metadata has proven challenging, hampering and slowing the productivity of data scientists. Moreover, existing data loaders have limited network support, necessitating, for maximum performance, that data be stored on local iflesystems close to the GPUs, overloading the storage of computing nodes. In this paper we propose a strategy, aimed at DL image applications, to address these challenges by: storing data and metadata in fast, scalable NoSQL databases; connecting the databases to state-of-the-art loaders for DL frameworks; enabling high-throughput data loading over high-latency networks through our out-of-order, incremental prefetching techniques. To evaluate our approach, we showcase our implementation and assess its data loading capabilities through local, medium and high-latency (intercontinental) experiments.

eol>Data loading Deep learning High-Throughput Latency optimizations Scalable storage Image classification

1. Introduction

Over the last two decades, the rapid increase in GPU computational power has transformed the field of machine learning. This surge in processing capabilities has allowed deep learning (DL) models to manage vast datasets and conduct intricate computations with unmatched eficiency. Consequently, DL techniques have gained widespread traction across various sectors, leading to advancements in areas such as cybersecurity, natural language processing, bioinformatics, and healthcare, among others [ 1 ].

However, despite these advancements in computational power, the performance of DL systems is increasingly constrained by bottlenecks related to data movement and access. While GPUs continue to achieve remarkable gains in processing throughput, which is matched by an increase in bandwidth for networking and storage, the latencies in these areas cannot decrease at the same pace due to inherent physical limitations. These discrepancies result in significant data stalls, as the transfer of data between storage systems, memory, and processing units becomes a critical bottleneck, leading to ineficient resource utilization in DL workflows [ 2, 3 ].

Furthermore, managing the vast amounts of data and associated metadata required for DL has become an increasingly complex challenge, significantly impacting the productivity of data scientists. As datasets scale from sources such as ImageNet [ 4 ], which comprises millions of images and spans hundreds of gigabytes, to more extensive datasets like LAION-5B [ 5 ], containing billions of images and reaching hundreds of terabytes, the exponential growth in data volume places substantial strain on traditional filesystem-based approaches. These systems are often inflexible and ill-suited for dynamic data management requirements. The limitations of these approaches are particularly evident when tasks such as balancing class distributions [ 6 ], or adjusting the proportions of training, validation, and test sets, must be performed, especially when considering metadata to avoid introducing unwanted biases. The static nature of traditional filesystems hampers the ability to eficiently update or modify datasets, restricting the flexibility needed for iterative development and fine-tuning of DL models.

Moreover, DL applications often involve datasets comprising numerous small inputs that are fully scanned and randomly permuted at each training epoch. This access pattern diminishes the benefits of caching since the data is continuously reloaded and reshufled [ 7 ]. Additionally, high-performance computing (HPC) storage systems are typically optimized for sequential access to large files, which contrasts sharply with the small, random access patterns typical of DL workloads. As a result, parallel file systems such as GPFS or Lustre in HPC environments struggle to handle these workloads eficiently [ 8 ]. A common mitigation strategy is to maintain local copies of datasets on all compute nodes. However, this approach introduces substantial storage overhead and capacity constraints. An alternative is to partition the dataset into shards distributed across nodes, which becomes essential when the dataset exceeds the storage capacity of individual nodes. Yet, ensuring unbiased data sharding poses significant challenges [ 8 ]. This strategy also compromises the ability to perform uniform random shufling, which can negatively impact training performance and model accuracy [ 9 ].

To address these limitations, specialized data loading systems and strategies have been developed, as elaborated in the next section. However, these methods have critical shortcomings, leaving unresolved issues such as: • Decoupling of data and metadata, leading to potential inconsistencies; • Inflexibility of record file formats, which impose constraints on shufling; • Limited support for network-based data loading, resulting in local storage overload on computing nodes.

In response to these issues, in [ 10 ] we proposed leveraging scalable NoSQL databases to store both data and metadata, with preliminary performance evaluations conducted within the DeepHealth Toolkit [ 11 ], a DL framework tailored for biomedical applications. In this work, we build upon this concept by introducing and evaluating a novel data loader. Specifically, our key contributions are as follows: • We develop an eficient data loader implemented in C ++ with a Python API, designed to integrate seamlessly with Cassandra-compatible NoSQL databases and NVIDIA DALI [ 12, 13 ]. This loader supports data loading across the network and is compatible with popular DL frameworks such as PyTorch and TensorFlow. • We introduce out-of-order, incremental prefetching techniques that enable high-throughput data loading, even in high-latency network environments; • We conduct a comprehensive evaluation of our approach, demonstrating its implementation and benchmarking its performance through extensive experiments in local, medium and high-latency settings, comparing it against the state-of-the-art tools.

Note that, as DALI is primarily optimized for image processing, our examples will focus on DL applications involving images; however, the techniques presented are of general applicability.

2. Background and Related Work

In this section we briefly review record file formats used in DL, followed by a summary of modern data loading tools and their tradeofs. Finally, we introduce the NoSQL Cassandra-compatible databases leveraged by our data loader.

2.1. Record File Formats in DL

Many DL applications, such as image classification, exhibit limited temporal and spatial locality due to their scan-and-reshufle data access patterns [ 7 ], impeding I/O optimization strategies such as block reading, which involves retrieving multiple images in a single request. To address this, various record file formats group data to artificially enforce locality, enabling more eficient prefetching and reducing file system stress. Examples include TFRecord [ 14 ], RecordIO, Beton [ 15 ], and MDS (used by MosaicML [ 16 ]).

However, optimization algorithms like Stochastic Gradient Descent and Adam require uniform shufling of data for optimal convergence [ 9 ]. Block reading conflicts with this requirement, leading to the implementation of workarounds in data loaders. Furthermore, the primary drawback of the ifle-batching approach is that it further rigidifies the dataset. Writing record files is time-consuming, consumes additional storage space, and makes it even more challenging to modify datasets – an already cumbersome task when dealing with numerous files in a filesystem.

2.2. Data Loading Frameworks

Eficient data loading is critical for DL training performance. Conventional Python-based pipelines sufer from GIL-induced bottlenecks and underutilized GPUs due to copy overhead betweeen processes [ 17, 18 ] and CPU-bound preprocessing. NVIDIA DALI [ 12 ] addresses these issues with GPU-accelerated loading, decoding, and preprocessing. It integrates with PyTorch and TensorFlow, using an asynchronous pipeline to reduce idle GPU time. However, networked data access support is still limited and experimental1. TensorFlow tf.data [ 14 ] provides an eficient, highly-parallel data pipeline API within TensorFlow, supporting local and Google Cloud-based sources. The experimental tf.data service [19] further extends this functionality for distributed pipelines by decoupling data processing and training. It can also be utilized to enable network-based data loading, with the dispatcher and workers supplying data to the nodes responsible for model training. MosaicML Streaming Dataset [ 16 ] is designed for training on datasets larger than memory by streaming and bufering data dynamically. It uses a proprietary format (MDS) and metadata to support sharding and determinism. Deep Lake [20] ofers a hybrid between data lakes and vector databases, storing various data types via a proprietary format with remote access support. Its high-performance data loader is closed-source and cloud-dependent, which may limit adoption. MADlib [21] takes a database-centric approach, embedding ML operations within relational database management systems (RDBMS) to reduce data movement. While recent extensions support DL (e.g., with Keras), its capabilities are nascent and limited to basic image classification with uncompressed tensors and rigid minibatch layouts.

2.3. Cassandra-Compatible Databases

Our data loader, outlined in the following section, interfaces with Apache Cassandra or ScyllaDB, two distributed NoSQL databases designed for scalability and low-latency access. Apache Cassandra is an open-source, Java-based system known for high availability and geographic distribution. It supports sub-10ms response times and is used at scale by companies like Netflix and Spotify. ScyllaDB ofers a C++-based, high-performance alternative compatible with the Cassandra API. It avoids Java GC overhead and employs a shard-per-core design to achieve lower latency and more predictable performance, with adopters including Discord, Ticketmaster, and Rakuten. 3. High-performance network data loading: design and implementation Our data loader is designed to address key challenges in deep learning workflows through the following principles: Unified data and metadata storage: Storing data alongside its metadata improves consistency and reliability, preventing errors during updates.

Fast, scalable network access: The architecture supports high-throughput, low-latency access across varied network setups, from local clusters to cloud environments.

Original

dataset

Extraction

Data CassandraMetadata compatible DB

Split Generators Split / Pandas

Metadata

Cassandra UUIDs opDerAaLtoIrs Data

R a w lif e s + l a b e l s DALI pipeline Images (decoding, augmenting, ...) Labels U U Is D

D a t a

Batch Loader PyTorch / TensorFlow

Full random access: Enables eficient, unbiased shufling by allowing retrieval of any sample, even in distributed, large-scale datasets.

Simplified data partitioning: Decouples storage from partitioning, allowing dynamic splits (e.g., cross-validation) and eficient distribution across compute nodes.

These principles underpin a robust, scalable loader tailored for large-scale DL tasks, especially image classification. To implement this, we built a plugin integrating NVIDIA DALI with a Cassandracompatible NoSQL backend. Written in C++, it avoids Python interprocess communication overhead, significantly improving throughput and latency.

3.1. Data flow and model

• Samples are retrieved by UUID, preprocessed via DALI, and passed to the DL engine (PyTorch/TensorFlow).

This architecture enables data loading over the network using TCP, allowing full random access to datasets. Additionally, by leveraging Cassandra or ScyllaDB for storage, it ofers significant advantages such as easy scalability and secure data access through SSL. It also allows for the definition of roles with detailed permissions and supports straightforward geographic replication.

As an example of image classification (which initially motivated our work) that involves a complex set of metadata, in the appendix we present SQL tables in Listing 1 that are applicable to medical imaging, specifically for tumor detection in digital pathology [ 22]. In this context, gigapixel images are divided into small patches, with each patch being labeled to indicate the degree or severity of cancerous tissue (Gleason score).

Our data loader supports tasks beyond standard image classification. It accepts features as generic BLOBs in any DALI-decodable format (e.g., JPEG, TIFF, PNG) and allows custom decoders. Annotations are optional and can be integers or BLOBs, supporting use cases like multilabel classification (e.g., NumPy tensors) and semantic segmentation (e.g., PNG masks).

3.2. Automatic Split Creation

Creating training, validation, and test splits requires metadata-aware strategies to ensure data independence and class balance. Entity separation – e.g., assigning patients to distinct splits – is essential to avoid data leakage and ensure generalizability. Adjusting class distributions further complicates the process.

Standard approaches, like reorganizing directories or regenerating TFRecord files, are error-prone and labor-intensive. Our plugin automates split creation and data loading, decoupling storage from partitioning. This streamlines workflows, eliminates manual overhead, and improves reproducibility, allowing users to focus on model development.

3.3. Multi-threaded asynchronous data loading

To optimize data retrieval eficiency, we leverage extensive parallelization. Images are retrieved asynchronously across multiple threads and TCP connections, thereby minimizing overall latency. Each TCP connection can handle up to 1024 concurrent requests, with the number of connections being a tunable parameter. Once the images are retrieved, batches of data are assembled in shared memory, which eliminates the need for additional copying and accelerates the process.

In detail, our batch loading workflow starts with the batch loader receiving a list of UUIDs, which it then uses to send all requests to the Cassandra driver at once. Communications for diferent batches are handled concurrently via a thread pool. To manage these requests eficiently, multiple low-level I/O threads are employed, each utilizing two TCP connections. Results are processed through callbacks, which minimizes latency by eliminating busy waiting. After all results for a batch have been received, the output tensor is allocated contiguously in a single operation. Data is then copied into the output tensor concurrently, again using a thread pool. The batch becomes available for output as soon as the copying is complete. 3.4. Prefetching techniques and strategies for high-latency environments Eficient data loading is essential for optimizing training throughput, particularly in high-latency network environments. A common approach leverages the fact that, even if the dataset is shufled at the start of each epoch, the permutation is predetermined and all the future requests are known at the beginning of each epoch. This allows to apply prefetching techniques, enabling subsequent batches to be retrieved while the GPUs process the current one, thereby minimizing idle time and improving overall eficiency.

Our data loader features a prefetching mechanism with a configurable number of batch bufers, designed to mask latencies of varying magnitudes. Despite this, during preliminary tests over real high-latency internet connections, we observed significant underperformance compared to internal tests with artificially induced latency using tc-netem [ 23]. Trafic analysis indicated significant bandwidth variability due to multiple TCP connections traversing diferent network routes, some of which experienced congestion, resulting in a wide disparity between the best and worst-performing connections. This directly impacts batch loading times: since images are retrieved in parallel but assembled in order, the system must wait for the slowest connection before proceeding, which ultimately slows down the entire process.

To address this issue, we implemented an out-of-order prefetching strategy. Given that DL training is robust to uniformly random permutations of the dataset, we can concurrently request multiple batches (e.g., 8) and reassemble them based on the arrival time of the contained images. This approach reduces the impact of slow connections by prioritizing images that arrive first. For this strategy to function correctly, it is essential that labels are retrieved together with their corresponding features, a requirement met by our architecture, which retrieves both features and annotations with a single query.

Further testing over real high-latency internet connections revealed another, second-order, issue: an aggressive filling of the prefetch bufers (e.g., 8 bufers per GPU across 8 GPUs) can cause a burst of requests that temporaly overwhelms the network capacity, leading to unstable throughput during bufer filling. To mitigate this, we can stagger the prefetch requests over time. For example, instead of front-loading all prefetch requests, we can request an extra batch every four regular ones: for every four batches consumed, five new ones are requested until the bufer is full. This approach limits the increase in transient throughput to only 25% above the steady-state level.

These optimizations collectively enhance throughput and stabilize data loading in high-latency environments, significantly improving resource utilization, as detailed in Sec. 4.

4. Software evaluation and results 4.1. Availability and usage

Our data loader is free software released under the Apache-2.0 license and is accessible on GitHub at the following URL: https://github.com/crs4/cassandra-dali-plugin/. The repository includes a Docker container that provides a pre-configured environment with DALI, Cassandra DB, a sample dataset for experimentation, and comprehensive instructions for conducting further tests. The repository provides scripts to streamline the ingestion of datasets into Cassandra, either serially or in parallel via Apache Spark. It also includes examples of multi-GPU DL training in PyTorch, using both plain PyTorch and PyTorch Lightning. Additionally, it features tools for automatic dataset splitting and high-performance inference using NVIDIA Triton2. This setup enables clients to request inference of images stored on a remote Cassandra server to be processed on a diferent GPU-powered remote server.

Listing 2 in the appendix presents the Python code used to initialize a typical DALI pipeline, which includes data loading via the standard file reader, decoding, and standard preprocessing steps such as resizing, cropping, and normalization. To use our custom data loader, which supports data reading from a Cassandra-compatible DB, one simply needs to replace the standard file reader with our module, as demonstrated in Listing 3. The modified lines are highlighted in blue.

4.2. Comparative analysis at varying latencies

We evaluated our network data loader alongside two state-of-the-art competitors, leveraging Amazon EC2 instances located in diferent geographical regions to introduce varying latencies. Specifically, we tested data consumption in Oregon using an 8-GPU p4d.24xlarge node, while storing images at locations characterized by the following latencies: • Low: Data stored in Oregon, round-trip time (RTT) < 1 ms. • Medium: Data stored in Northern California, RTT ≃ 20 ms.

• High: Data stored in Stockholm, Sweden, RTT ≃ 150 ms.

While storing images on a diferent continent from the GPUs is not a common practice, the high-latency scenario was included to highlight the challenges posed by latency-induced bottlenecks. Such challenges are expected to further intensify in the future, as computational power and bandwidth improve, whereas latency remains constrained by physical limits.

For our experiments, we utilized the standard ImageNet-1k dataset (training set: 1,281,167 images, average image size: 115 kB, total size: 147 GB, batch size: 512 images.) The dataset was prepared in the formats required by each data loader and stored on a single node equipped with four NVMe SSDs, configured as a single striped logical volume for optimized data access. Specifically, we used r5dn.24xlarge instances in Oregon and Stockholm, and the similar m6in.24xlarge instance in Northern California, where the previous instance type was not available. As the RAM capacity of these machines surpasses the size of our test dataset (i.e., ImageNet-1k), we reserved memory to maintain approximately only 70 GB of free RAM (i.e., half the size of the dataset). This approach prevents dataset caching in main memory, ensuring that data is consistently read from the disks during testing. By doing so, we simulate conditions involving larger datasets that exceed the available memory capacity.

Three data loaders, summarized in Table 1, were compared in this study: our Cassandra-DALI data loader (with data stored in high-performance ScyllaDB), MosaicML SD (with data hosted on an S3 2https://developer.nvidia.com/triton-inference-server MinIO server), and TensorFlow’s tf.data service (with data stored as TFRecords in the filesystem). All servers were hosted on the same node within Docker containers, with all data residing on the same logical volume. The test code, including the Dockerfiles, is available in the following GitHub repository, under the paper branch: https://github.com/fversaci/cassandra-dali-plugin/.

We conducted two experiments to evaluate performance under varying latencies: Tight-loop read: This benchmark assesses raw data-loading capabilities by maximizing data reads without GPU processing or image decoding.

Training: A standard PyTorch multi-GPU ResNet-50 training workload, including image decoding and preprocessing steps such as resizing, normalization, and cropping.

It is important to note that the tight-loop read test utilizes a single data loader, whereas the training process employs a separate data loader for each GPU. Consequently, the tight-loop read test establishes an upper bound on the data throughput that can be consumed by a single GPU.

To minimize AWS usage costs, each data loader was evaluated in a single test run for up to four epochs. The experimental results presented below compare the performance of the data loaders under varying latency conditions. 4.2.1. Tight-loop reading As a baseline for comparing network data loaders, we measured the performance of NVIDIA DALI when reading images stored as TFRecords from the local filesystem, without performing image decoding. This configuration achieved a data throughput of 7.4 GB/s.

The p4d.24xlarge instance ofers a total network bandwidth of 100 Gb/s; however, only half of this bandwidth is accessible through its public interface3. Thus, the maximum raw bandwidth available for data transfers in our tests was limited to 50 Gb/s (6.25 GB/s). As shown in Fig. 2a and Tab. 2, our data loader nearly saturated the available bandwidth when reading from ScyllaDB in both local and medium-latency settings, achieving a throughput of approximately 6 GB/s in each case. In the high-latency setting, throughput decreased to around 4 GB/s.

In contrast, MosaicML SD demonstrated significantly lower performance, with throughput measured at 326 MB/s in the low-latency setting, 308 MB/s in the medium-latency setting, and 203 MB/s in the high-latency setting. tf.data service exhibited better performance than MosaicML SD in the low-latency configuration, achieving a throughput of 437 MB/s. However, its performance degraded substantially in higher-latency environments, with throughput dropping to 57 MB/s in the medium-latency setting and just 12 MB/s in the high-latency setting. 4.2.2. Training For a data loader to be efective, its performance must integrate smoothly into the DL pipeline, ensuring that tensors are eficiently delivered to DL engines without introducing bottlenecks or delays. To evaluate this, we assessed the performance of data loaders within a standard image classification training workflow using the ResNet-50 architecture. Due to the diferences in training performance between TensorFlow and PyTorch, we focused exclusively on one framework to ensure a fair comparison. Given that MosaicML SD demonstrated superior performance compared to tf.data service in medium3https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-network-bandwidth.html DALI tfr MosaicML SD tf.data service Cassandra-DALI No IO MosaicML SD

Cassandra-DALI (a) Normalized tight-loop reading throughput. (b) Normalized train reading throughput. and high-latency settings in the previous evaluation, we chose to test it against our data loader in a training using PyTorch.

To establish a performance upper bound, we first performed training using a fixed input tensor, thereby eliminating the overhead associated with data loading and preprocessing. This setup enabled us to measure the maximum achievable data throughput during training. Specifically, we recorded the number of images processed per second on a single GPU and across all 8 GPUs during multi-GPU training. Our results indicate that a single NVIDIA A100 GPU consumes about 1450 images/s, while an 8-GPU configuration reaches 11200 images/s. Given the average ImageNet-1k training image size of 115 kB, the data loaders must sustain a steady throughput of approximately 1.3 GB/s to meet these requirements.

As demonstrated in Fig. 2b and Tab. 3, the MosaicML SD data loader is unable to sustain the throughput required to fully utilize all 8 GPUs, achieving 57%, 49%, and 33% of the target throughput under low, medium, and high-latency conditions, respectively. In contrast, our data loader achieves 94%, 95%, and 96% of the theoretical upper bound in these settings. 4.3. Impact of prefetching and database choice on data loader performance Finally, we conducted further tests to better investigate our data loader’s performance, focusing specifically on the impact of the proposed out-of-order prefetching optimization. Additionally, we analyzed how the choice of the underlying database – either Cassandra or ScyllaDB – afects data loading performance. 4.3.1. Impact of out-of-order, incremental prefetching We evaluated the tight-loop reading performance under high-latency conditions, comparing results with and without our out-of-order, incremental prefetching optimization.

High-latency, high-bandwidth internet communications are prone to significant variability in TCP throughput. In fact these conditions exacerbate the efects of packet loss, as TCP congestion control mechanisms respond conservatively to retransmissions and recover slowly due to extended RTTs, resulting in reduced throughput [24, 25].

Figure 3 highlights the significant variance in batch loading times between in-order and out-of-order prefetching. In the in-order case (Figure 3a), when the prefetching queue is exhausted, the system experiences delays of up to several seconds while waiting for all transfers, including those over congested routes, to complete. This results in a cyclical pattern: once all transfers are completed, the queue is refilled, but it is quickly depleted again, triggering a new cycle. In contrast, the out-of-order approach (Figure 3b) maintains a highly consistent batch loading time, staying always below 30 ms after the initial transient period.

Figure 4a illustrates the throughput over time for 16 of the 32 connections utilized by our data loader when employing the standard in-order prefetching mechanism. The throughput curves exhibit a strong correlation, highlighting that simultaneous transfers are constrained by the in-order batch assembly process, since the system must wait for the slowest transfer to finish before dispatching a batch to the DL pipeline and requesting a new batch from the database. As a result, the throughputs tend to converge and the aggregated throughput exhibits considerable fluctuations, ranging roughly from 300 MB/s to 1300 MB/s.

In contrast, relaxing this in-order constraint allows transfers to proceed independently, as shown in Figure 4b. In this approach, batches are formed as soon as a suficient number of images are available, irrespective of their originating connections. This optimization significantly enhances overall throughput, resulting in higher and more consistent performance, maintaining an average throughput of approximately 4 GB/s. 4.3.2. Cassandra vs ScyllaDB The tight-loop reading test under high-latency conditions was also performed using Cassandra as the storage backend for images, replacing ScyllaDB. Cassandra achieved a throughput of 1.6 GB/s, 0 200 400 600 0 200 400 600 0 200 400 600 0 200 400 600

Time (s) significantly lower than the 4.0 GB/s observed with ScyllaDB, highlighting the superior performance of the latter. Notably, Cassandra exhibited a disk I/O rate considerably higher than its achieved data throughput (3.6 GB/s versus 1.6 GB/s), likely attributable to diferences in its block-reading strategy compared to ScyllaDB.

5. Conclusion

Advances in GPU power have propelled deep learning but also highlighted bottlenecks in data access and movement, due to the increasing discrepancy between processing throughput and data access latencies. We address these challenges by integrating scalable NoSQL databases with a high-performance, image-optimized data loader.

Our key contribution is a novel loader using advanced prefetching, including out-of-order strategies to reduce the efects of network latency. By coupling data with metadata in a database-driven design, we ofer a scalable and consistent solution for DL datasets. Experiments under varying latency conditions show significant gains in throughput and stability over existing methods.

The source code for our implementation is publicly available, providing a resource for further research and practical deployment in diverse DL scenarios.

Acknowledgments

This work was supported by the Italian Ministry of Health under the program grant H2ub (Hybrid Hub: Cellular and computation models, micro- and nano-technologies for personalized innovative therapies, project code T4-AN-10) and by the Regione Autonoma della Sardegna, Sardegna Ricerche, under the program grant XDATA.

Declaration on Generative AI

During the preparation of this work, the authors used ChatGPT-4 for grammar and spelling check. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content.

URL: https://doi.org/10.1145/3359619.3359747. doi:10.1145/3359619.3359747. [18] A. Ho, [rfc] polylithic: Enabling multi-threaded dataloading through non-monolithic parallelism, https://github.com/pytorch/data/issues/1318, 2024. [19] A. Audibert, Y. Chen, D. Graur, A. Klimovic, J. Šimša, C. A. Thekkath, tf.data service: A case for disaggregating ml input data processing, in: Proceedings of the 2023 ACM Symposium on Cloud Computing, SoCC ’23, Association for Computing Machinery, New York, NY, USA, 2023, p. 358–375. URL: https://doi.org/10.1145/3620678.3624666. doi:10.1145/3620678.3624666. [20] S. Hambardzumyan, A. Tuli, L. Ghukasyan, F. Rahman, H. Topchyan, D. Isayan, M. McQuade, M. Harutyunyan, T. Hakobyan, I. Stranic, et al., Deep lake: A lakehouse for deep learning, in: Conference on Innovative Data Systems Research (CIDR 2023), 2023. [21] J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton,

X. Feng, K. Li, et al., The madlib analytics library, Proceedings of the VLDB Endowment 5 (2012). [22] S. Banerji, S. Mitra, Deep learning in histopathology: A review, WIREs Data Mining and Knowledge Discovery 12 (2022) e1439. URL: https://wires.onlinelibrary. wiley.com/doi/abs/10.1002/widm.1439. doi:https://doi.org/10.1002/widm.1439. arXiv:https://wires.onlinelibrary.wiley.com/doi/pdf/10.1002/widm.1439. [23] S. Hemminger, et al., Network emulation with netem, in: Linux conf au, volume 5, 2005, p. 2005. [24] D. Katabi, M. Handley, C. Rohrs, Congestion control for high bandwidth-delay product networks, SIGCOMM Comput. Commun. Rev. 32 (2002) 89–102. URL: https://doi.org/10.1145/964725.633035. doi:10.1145/964725.633035. [25] S. Ha, I. Rhee, L. Xu, Cubic: a new tcp-friendly high-speed tcp variant, SIGOPS Oper. Syst. Rev. 42 (2008) 64–74. URL: https://doi.org/10.1145/1400097.1400105. doi:10.1145/1400097.1400105. Listing 1 Example of SQL data model for tumor detection CREATE TABLE p a t c h e s . metadata ( p a t i e n t _ i d text , sl ide_ num int , / / p a t i e n t s can have s e v e r a l s l i d e s x int , / / c o o r d i n a t e s y int , / / w i t h i n t h e s l i d e l a b e l int , / / Gleason s c o r e p a t c h _ i d uuid ,

PRIMARY KEY ( ( p a t c h _ i d ) ) ) ; CREATE TABLE p a t c h e s . data ( p a t c h _ i d uuid , l a b e l int , / / Gleason s c o r e d a t a blob , / / image / t e n s o r f i l e ( JPEG , TIFF , NPY , e t c . )

PRIMARY KEY ( ( p a t c h _ i d ) ) ) ; Listing 2 Initializing DALI pipeline using DALI standard file reader @ p i p e l i n e _ d e f ( b a t c h _ s i z e = 1 2 8 , num_threads =4 , d e v i c e _ i d = d e v i c e _ i d ) def g e t _ d a l i _ p i p e l i n e ( ) : images , l a b e l s = fn . readers . f i l e ( name= " Reader " ,

f i l e _ r o o t = " / d a t a / imagenet / t r a i n " ) l a b e l s = l a b e l s . gpu ( ) images = fn . d e c o d e r s . image ( images , d e v i c e = " mixed " ,

o u t p u t _ t y p e = t y p e s . RGB ) images = fn . r e s i z e ( images , r e s i z e _ x = 2 5 6 , r e s i z e _ y = 2 5 6 ) images = fn . c r o p _ m i r r o r _ n o r m a l i z e ( images , dtype = t y p e s . FLOAT , o u t p u t _ l a y o u t = "CHW" , crop = ( 2 2 4 , 2 2 4 ) , mean = [ 0 . 4 8 5 ∗ 2 5 5 , 0 . 4 5 6 ∗ 2 5 5 , 0 . 4 0 6 ∗ 2 5 5 ] , s t d = [ 0 . 2 2 9 ∗ 2 5 5 , 0 . 2 2 4 ∗ 2 5 5 , 0 . 2 2 5 ∗ 2 5 5 ] , ) return images , l a b e l s Listing 3 Initializing DALI using our Cassandra-DALI plugin u u i d s = l i s t _ m a n a g e r . g e t _ l i s t _ o f _ u u i d s ( . . . ) @ p i p e l i n e _ d e f ( b a t c h _ s i z e = 1 2 8 , num_threads =4 , d e v i c e _ i d = d e v i c e _ i d ) def g e t _ d a l i _ p i p e l i n e ( ) : images , l a b e l s = fn . crs4 . cassandra ( name= " Reader " , c a s s a n d r a _ i p s = [ 1 . 2 . 3 . 4 , 5 . 6 . 7 . 8 ] , username= " g u e s t " , password = " t e s t " , t a b l e = imagenet . d a t a _ t r a i n , u u i d s = uuids , p r e f e t c h _ b u f f e r s =16 , i o _ t h r e a d s =8 ) l a b e l s = l a b e l s . gpu ( ) # [ . . . ] same a s b e f o r e return images , l a b e l s

[1]

Alzubaidi ,

Zhang ,

A. J.

Humaidi ,

Al-Dujaili ,

Duan ,

Al-Shamma ,

Santamaría ,

M. A.

Fadhel ,

Al-Amidie ,

Farhan , Review of deep learning: concepts, cnn architectures, challenges, applications, future directions , Journal of big Data 8 ( 2021 ) 1 - 74 .

[2]

Kuchnik ,

Klimovic ,

Simsa ,

Smith , G. Amvrosiadis, Plumber: Diagnosing and removing performance bottlenecks in machine learning data pipelines , in: D. Marculescu , Y. Chi , C. Wu (Eds.), Proceedings of Machine Learning and Systems , volume 4 , 2022 , pp. 33 - 51 .

[3]

Mohan ,

Phanishayee ,

Raniwala ,

Chidambaram , Analyzing and mitigating data stalls in dnn training , Proc. VLDB Endow . 14 ( 2021 ) 771 - 784 . URL: https://doi.org/10.14778/3446095.3446100. doi: 10 .14778/3446095.3446100.

[4]

Deng ,

Dong ,

Socher ,

L.-J.

Li ,

Fei-Fei , Imagenet: A large-scale hierarchical image database , in: 2009 IEEE Conference on Computer Vision and Pattern Recognition , 2009 , pp. 248 - 255 . doi: 10 .1109/CVPR. 2009 . 5206848 .

[5]

Schuhmann ,

Beaumont ,

Vencu ,

Gordon ,

Wightman ,

Cherti ,

Coombes ,

Katta ,

Mullis ,

Wortsman ,

Schramowski ,

Kundurthy ,

Crowson ,

Schmidt ,

Kaczmarczyk ,

Jitsev , Laion-5b: An open large-scale dataset for training next generation image-text models , in: S. Koyejo,

Mohamed ,

Agarwal ,

Belgrave ,

Cho , A . Oh (Eds.), Advances in Neural Information Processing Systems , volume 35 , 2022 , pp. 25278 - 25294 .

[6]

Ghosh ,

Bellinger ,

Corizzo ,

Branco ,

Krawczyk ,

Japkowicz , The class imbalance problem in deep learning , Machine Learning 113 ( 2024 ) 4845 - 4901 .

[7]

Lee ,

Bahn , Analyzing data reference characteristics of deep learning workloads for improving bufer cache performance , Applied Sciences 13 ( 2023 ). URL: https://www.mdpi.com/2076-3417/13/ 22/12102. doi: 10 .3390/app132212102.

[8]

Schimmelpfennig , M. -

A. Vef , R.

Salkhordeh , A.

Miranda , R.

Nou , A.

Brinkmann , Streamlining distributed deep learning i/o with ad hoc file systems , in: 2021 IEEE International Conference on Cluster Computing (CLUSTER) , 2021 , pp. 169 - 180 . doi: 10 .1109/Cluster48925. 2021 . 00062 .

[9]

Xu ,

Qiu ,

Yuan ,

Jiang ,

Renggli ,

Gan ,

Kara ,

Li ,

Liu ,

Wu , et al., Stochastic gradient descent without full data shufle: with applications to in-database machine learning and deep learning systems , The VLDB Journal ( 2024 ) 1 - 25 .

[10]

Versaci , G. Busonera, Scaling deep learning data management with cassandra db , in: 2021 IEEE International Conference on Big Data (Big Data) , 2021 , pp. 5301 - 5310 . doi: 10 .1109/BigData52589. 2021 . 9672005 .

[11]

Cancilla ,

Canalini ,

Bolelli ,

Allegretti ,

Carrión ,

Paredes ,

J. A.

Gómez ,

Leo ,

M. E.

Piras ,

Pireddu , et al., The deephealth toolkit: a unified framework to boost biomedical applic ations , in: 2020 25th International Conference on Pattern Recognition (ICPR) , IEEE, 2021 , pp. 9881 - 9888 .

[12]

J. A.

Guirao ,

Łęcki ,

Lisiecki ,

Panev ,

Szołucha ,

Wolant ,

Zientkiewicz , Fast AI Data Preprocessing with NVIDIA DALI , https://developer.nvidia.com/blog/ fast-ai -data-preprocessing-with-nvidia- dali

, 2019 . [Online; accessed August-2024].

[13]

E. J.

Martinez-Noriega ,

Peng ,

Yokota , High-performance data loader for large-scale data processing , Electronic Imaging 36 ( 2024 ) 196 -1 -196-1 . URL: https://library.imaging.org/ei/articles/ 36/12/HPCI-196. doi: 10 .2352/EI. 2024 . 36 .12.HPCI- 196 .

[14]

D. G.

Murray ,

Simsa ,

Klimovic , I. Indyk, tf. data: A machine learning data processing framework , arXiv preprint arXiv:2101.12127 ( 2021 ).

[15]

Leclerc ,

Ilyas ,

Engstrom ,

S. M.

Park ,

Salman ,

Mądry , Ffcv: Accelerating training by removing data bottlenecks , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023 , pp. 12011 - 12020 .

[16] T. M. M. Team , streaming, <https://github.com/mosaicml/streaming/>, 2022 .

[17]

Meier ,

T. R.

Gross , Reflections on the compatibility, performance, and scalability of parallel python , in: Proceedings of the 15th ACM SIGPLAN International Symposium on Dynamic Languages, DLS 2019 , Association for Computing Machinery , New York, NY, USA, 2019 , p. 91 - 103 .