<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hiding Latencies in Network-Based Image Loading for Deep Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Francesco Versaci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Busonera</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CRS4 - Center for Advanced Studies, Research and Development</institution>
          ,
          <addr-line>Cagliari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <issue>0</issue>
      <abstract>
        <p>In the last decades, the computational power of GPUs has grown exponentially, allowing current deep learning (DL) applications to handle increasingly large amounts of data at a progressively higher throughput. However, network and storage latencies cannot decrease at a similar pace due to physical constraints, leading to data stalls, and creating a bottleneck for DL tasks. Additionally, managing vast quantities of data and their associated metadata has proven challenging, hampering and slowing the productivity of data scientists. Moreover, existing data loaders have limited network support, necessitating, for maximum performance, that data be stored on local iflesystems close to the GPUs, overloading the storage of computing nodes. In this paper we propose a strategy, aimed at DL image applications, to address these challenges by: storing data and metadata in fast, scalable NoSQL databases; connecting the databases to state-of-the-art loaders for DL frameworks; enabling high-throughput data loading over high-latency networks through our out-of-order, incremental prefetching techniques. To evaluate our approach, we showcase our implementation and assess its data loading capabilities through local, medium and high-latency (intercontinental) experiments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data loading</kwd>
        <kwd>Deep learning</kwd>
        <kwd>High-Throughput</kwd>
        <kwd>Latency optimizations</kwd>
        <kwd>Scalable storage</kwd>
        <kwd>Image classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Over the last two decades, the rapid increase in GPU computational power has transformed the field
of machine learning. This surge in processing capabilities has allowed deep learning (DL) models to
manage vast datasets and conduct intricate computations with unmatched eficiency. Consequently, DL
techniques have gained widespread traction across various sectors, leading to advancements in areas
such as cybersecurity, natural language processing, bioinformatics, and healthcare, among others [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        However, despite these advancements in computational power, the performance of DL systems is
increasingly constrained by bottlenecks related to data movement and access. While GPUs continue to
achieve remarkable gains in processing throughput, which is matched by an increase in bandwidth for
networking and storage, the latencies in these areas cannot decrease at the same pace due to inherent
physical limitations. These discrepancies result in significant data stalls, as the transfer of data between
storage systems, memory, and processing units becomes a critical bottleneck, leading to ineficient
resource utilization in DL workflows [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
      </p>
      <p>
        Furthermore, managing the vast amounts of data and associated metadata required for DL has
become an increasingly complex challenge, significantly impacting the productivity of data scientists.
As datasets scale from sources such as ImageNet [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which comprises millions of images and spans
hundreds of gigabytes, to more extensive datasets like LAION-5B [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], containing billions of images and
reaching hundreds of terabytes, the exponential growth in data volume places substantial strain on
traditional filesystem-based approaches. These systems are often inflexible and ill-suited for dynamic
data management requirements. The limitations of these approaches are particularly evident when
tasks such as balancing class distributions [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], or adjusting the proportions of training, validation, and
test sets, must be performed, especially when considering metadata to avoid introducing unwanted
biases. The static nature of traditional filesystems hampers the ability to eficiently update or modify
datasets, restricting the flexibility needed for iterative development and fine-tuning of DL models.
      </p>
      <p>
        Moreover, DL applications often involve datasets comprising numerous small inputs that are fully
scanned and randomly permuted at each training epoch. This access pattern diminishes the benefits
of caching since the data is continuously reloaded and reshufled [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Additionally, high-performance
computing (HPC) storage systems are typically optimized for sequential access to large files, which
contrasts sharply with the small, random access patterns typical of DL workloads. As a result, parallel file
systems such as GPFS or Lustre in HPC environments struggle to handle these workloads eficiently [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
A common mitigation strategy is to maintain local copies of datasets on all compute nodes. However,
this approach introduces substantial storage overhead and capacity constraints. An alternative is to
partition the dataset into shards distributed across nodes, which becomes essential when the dataset
exceeds the storage capacity of individual nodes. Yet, ensuring unbiased data sharding poses significant
challenges [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This strategy also compromises the ability to perform uniform random shufling, which
can negatively impact training performance and model accuracy [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>To address these limitations, specialized data loading systems and strategies have been developed, as
elaborated in the next section. However, these methods have critical shortcomings, leaving unresolved
issues such as:
• Decoupling of data and metadata, leading to potential inconsistencies;
• Inflexibility of record file formats, which impose constraints on shufling;
• Limited support for network-based data loading, resulting in local storage overload on computing
nodes.</p>
      <p>
        In response to these issues, in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] we proposed leveraging scalable NoSQL databases to store both data
and metadata, with preliminary performance evaluations conducted within the DeepHealth Toolkit [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
a DL framework tailored for biomedical applications. In this work, we build upon this concept by
introducing and evaluating a novel data loader. Specifically, our key contributions are as follows:
• We develop an eficient data loader implemented in C ++ with a Python API, designed to integrate
seamlessly with Cassandra-compatible NoSQL databases and NVIDIA DALI [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ]. This loader
supports data loading across the network and is compatible with popular DL frameworks such as
PyTorch and TensorFlow.
• We introduce out-of-order, incremental prefetching techniques that enable high-throughput data
loading, even in high-latency network environments;
• We conduct a comprehensive evaluation of our approach, demonstrating its implementation and
benchmarking its performance through extensive experiments in local, medium and high-latency
settings, comparing it against the state-of-the-art tools.
      </p>
      <p>Note that, as DALI is primarily optimized for image processing, our examples will focus on DL
applications involving images; however, the techniques presented are of general applicability.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <p>In this section we briefly review record file formats used in DL, followed by a summary of modern data
loading tools and their tradeofs. Finally, we introduce the NoSQL Cassandra-compatible databases
leveraged by our data loader.</p>
      <sec id="sec-2-1">
        <title>2.1. Record File Formats in DL</title>
        <p>
          Many DL applications, such as image classification, exhibit limited temporal and spatial locality due
to their scan-and-reshufle data access patterns [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], impeding I/O optimization strategies such as
block reading, which involves retrieving multiple images in a single request. To address this, various
record file formats group data to artificially enforce locality, enabling more eficient prefetching and
reducing file system stress. Examples include TFRecord [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], RecordIO, Beton [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], and MDS (used by
MosaicML [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]).
        </p>
        <p>
          However, optimization algorithms like Stochastic Gradient Descent and Adam require uniform
shufling of data for optimal convergence [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Block reading conflicts with this requirement, leading
to the implementation of workarounds in data loaders. Furthermore, the primary drawback of the
ifle-batching approach is that it further rigidifies the dataset. Writing record files is time-consuming,
consumes additional storage space, and makes it even more challenging to modify datasets – an already
cumbersome task when dealing with numerous files in a filesystem.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Data Loading Frameworks</title>
        <p>
          Eficient data loading is critical for DL training performance. Conventional Python-based pipelines sufer
from GIL-induced bottlenecks and underutilized GPUs due to copy overhead betweeen processes [
          <xref ref-type="bibr" rid="ref17">17, 18</xref>
          ]
and CPU-bound preprocessing. NVIDIA DALI [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] addresses these issues with GPU-accelerated
loading, decoding, and preprocessing. It integrates with PyTorch and TensorFlow, using an asynchronous
pipeline to reduce idle GPU time. However, networked data access support is still limited and
experimental1. TensorFlow tf.data [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] provides an eficient, highly-parallel data pipeline API within
TensorFlow, supporting local and Google Cloud-based sources. The experimental tf.data service [19]
further extends this functionality for distributed pipelines by decoupling data processing and training.
It can also be utilized to enable network-based data loading, with the dispatcher and workers supplying
data to the nodes responsible for model training. MosaicML Streaming Dataset [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] is designed
for training on datasets larger than memory by streaming and bufering data dynamically. It uses a
proprietary format (MDS) and metadata to support sharding and determinism. Deep Lake [20] ofers a
hybrid between data lakes and vector databases, storing various data types via a proprietary format with
remote access support. Its high-performance data loader is closed-source and cloud-dependent, which
may limit adoption. MADlib [21] takes a database-centric approach, embedding ML operations within
relational database management systems (RDBMS) to reduce data movement. While recent extensions
support DL (e.g., with Keras), its capabilities are nascent and limited to basic image classification with
uncompressed tensors and rigid minibatch layouts.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Cassandra-Compatible Databases</title>
        <p>Our data loader, outlined in the following section, interfaces with Apache Cassandra or ScyllaDB, two
distributed NoSQL databases designed for scalability and low-latency access. Apache Cassandra is an
open-source, Java-based system known for high availability and geographic distribution. It supports
sub-10ms response times and is used at scale by companies like Netflix and Spotify. ScyllaDB ofers a
C++-based, high-performance alternative compatible with the Cassandra API. It avoids Java GC overhead
and employs a shard-per-core design to achieve lower latency and more predictable performance, with
adopters including Discord, Ticketmaster, and Rakuten.
3. High-performance network data loading: design and
implementation
Our data loader is designed to address key challenges in deep learning workflows through the following
principles:
Unified data and metadata storage: Storing data alongside its metadata improves consistency and
reliability, preventing errors during updates.</p>
        <p>Fast, scalable network access: The architecture supports high-throughput, low-latency access across
varied network setups, from local clusters to cloud environments.</p>
        <sec id="sec-2-3-1">
          <title>Original</title>
          <p>dataset</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>Extraction</title>
          <p>Data
CassandraMetadata compatible DB</p>
        </sec>
        <sec id="sec-2-3-3">
          <title>Split Generators Split / Pandas</title>
          <p>Metadata</p>
          <p>Cassandra UUIDs
opDerAaLtoIrs Data</p>
          <p>R
a
w
lif
e
s
+
l
a
b
e
l
s
DALI pipeline Images
(decoding,
augmenting, ...) Labels
U
U
Is
D</p>
          <p>D
a
t
a</p>
        </sec>
        <sec id="sec-2-3-4">
          <title>Batch Loader</title>
        </sec>
        <sec id="sec-2-3-5">
          <title>PyTorch /</title>
        </sec>
        <sec id="sec-2-3-6">
          <title>TensorFlow</title>
          <p>Full random access: Enables eficient, unbiased shufling by allowing retrieval of any sample, even
in distributed, large-scale datasets.</p>
          <p>Simplified data partitioning: Decouples storage from partitioning, allowing dynamic splits (e.g.,
cross-validation) and eficient distribution across compute nodes.</p>
          <p>These principles underpin a robust, scalable loader tailored for large-scale DL tasks, especially image
classification. To implement this, we built a plugin integrating NVIDIA DALI with a
Cassandracompatible NoSQL backend. Written in C++, it avoids Python interprocess communication overhead,
significantly improving throughput and latency.</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>3.1. Data flow and model</title>
        <p>• Samples are retrieved by UUID, preprocessed via DALI, and passed to the DL engine
(PyTorch/TensorFlow).</p>
        <p>This architecture enables data loading over the network using TCP, allowing full random access to
datasets. Additionally, by leveraging Cassandra or ScyllaDB for storage, it ofers significant advantages
such as easy scalability and secure data access through SSL. It also allows for the definition of roles
with detailed permissions and supports straightforward geographic replication.</p>
        <p>As an example of image classification (which initially motivated our work) that involves a complex set
of metadata, in the appendix we present SQL tables in Listing 1 that are applicable to medical imaging,
specifically for tumor detection in digital pathology [ 22]. In this context, gigapixel images are divided
into small patches, with each patch being labeled to indicate the degree or severity of cancerous tissue
(Gleason score).</p>
        <p>Our data loader supports tasks beyond standard image classification. It accepts features as generic
BLOBs in any DALI-decodable format (e.g., JPEG, TIFF, PNG) and allows custom decoders. Annotations
are optional and can be integers or BLOBs, supporting use cases like multilabel classification (e.g.,
NumPy tensors) and semantic segmentation (e.g., PNG masks).</p>
      </sec>
      <sec id="sec-2-5">
        <title>3.2. Automatic Split Creation</title>
        <p>Creating training, validation, and test splits requires metadata-aware strategies to ensure data
independence and class balance. Entity separation – e.g., assigning patients to distinct splits – is essential to
avoid data leakage and ensure generalizability. Adjusting class distributions further complicates the
process.</p>
        <p>Standard approaches, like reorganizing directories or regenerating TFRecord files, are error-prone
and labor-intensive. Our plugin automates split creation and data loading, decoupling storage from
partitioning. This streamlines workflows, eliminates manual overhead, and improves reproducibility,
allowing users to focus on model development.</p>
      </sec>
      <sec id="sec-2-6">
        <title>3.3. Multi-threaded asynchronous data loading</title>
        <p>To optimize data retrieval eficiency, we leverage extensive parallelization. Images are retrieved
asynchronously across multiple threads and TCP connections, thereby minimizing overall latency. Each
TCP connection can handle up to 1024 concurrent requests, with the number of connections being a
tunable parameter. Once the images are retrieved, batches of data are assembled in shared memory,
which eliminates the need for additional copying and accelerates the process.</p>
        <p>In detail, our batch loading workflow starts with the batch loader receiving a list of UUIDs, which it
then uses to send all requests to the Cassandra driver at once. Communications for diferent batches
are handled concurrently via a thread pool. To manage these requests eficiently, multiple low-level I/O
threads are employed, each utilizing two TCP connections. Results are processed through callbacks,
which minimizes latency by eliminating busy waiting. After all results for a batch have been received,
the output tensor is allocated contiguously in a single operation. Data is then copied into the output
tensor concurrently, again using a thread pool. The batch becomes available for output as soon as the
copying is complete.
3.4. Prefetching techniques and strategies for high-latency environments
Eficient data loading is essential for optimizing training throughput, particularly in high-latency
network environments. A common approach leverages the fact that, even if the dataset is shufled at
the start of each epoch, the permutation is predetermined and all the future requests are known at the
beginning of each epoch. This allows to apply prefetching techniques, enabling subsequent batches
to be retrieved while the GPUs process the current one, thereby minimizing idle time and improving
overall eficiency.</p>
        <p>Our data loader features a prefetching mechanism with a configurable number of batch bufers,
designed to mask latencies of varying magnitudes. Despite this, during preliminary tests over real
high-latency internet connections, we observed significant underperformance compared to internal tests
with artificially induced latency using tc-netem [ 23]. Trafic analysis indicated significant bandwidth
variability due to multiple TCP connections traversing diferent network routes, some of which
experienced congestion, resulting in a wide disparity between the best and worst-performing connections.
This directly impacts batch loading times: since images are retrieved in parallel but assembled in order,
the system must wait for the slowest connection before proceeding, which ultimately slows down the
entire process.</p>
        <p>To address this issue, we implemented an out-of-order prefetching strategy. Given that DL training
is robust to uniformly random permutations of the dataset, we can concurrently request multiple
batches (e.g., 8) and reassemble them based on the arrival time of the contained images. This approach
reduces the impact of slow connections by prioritizing images that arrive first. For this strategy to
function correctly, it is essential that labels are retrieved together with their corresponding features, a
requirement met by our architecture, which retrieves both features and annotations with a single query.</p>
        <p>Further testing over real high-latency internet connections revealed another, second-order, issue:
an aggressive filling of the prefetch bufers (e.g., 8 bufers per GPU across 8 GPUs) can cause a burst
of requests that temporaly overwhelms the network capacity, leading to unstable throughput during
bufer filling. To mitigate this, we can stagger the prefetch requests over time. For example, instead of
front-loading all prefetch requests, we can request an extra batch every four regular ones: for every
four batches consumed, five new ones are requested until the bufer is full. This approach limits the
increase in transient throughput to only 25% above the steady-state level.</p>
        <p>These optimizations collectively enhance throughput and stabilize data loading in high-latency
environments, significantly improving resource utilization, as detailed in Sec. 4.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Software evaluation and results</title>
      <sec id="sec-3-1">
        <title>4.1. Availability and usage</title>
        <p>Our data loader is free software released under the Apache-2.0 license and is accessible on GitHub at
the following URL: https://github.com/crs4/cassandra-dali-plugin/. The repository includes a Docker
container that provides a pre-configured environment with DALI, Cassandra DB, a sample dataset for
experimentation, and comprehensive instructions for conducting further tests. The repository provides
scripts to streamline the ingestion of datasets into Cassandra, either serially or in parallel via Apache
Spark. It also includes examples of multi-GPU DL training in PyTorch, using both plain PyTorch and
PyTorch Lightning. Additionally, it features tools for automatic dataset splitting and high-performance
inference using NVIDIA Triton2. This setup enables clients to request inference of images stored on a
remote Cassandra server to be processed on a diferent GPU-powered remote server.</p>
        <p>Listing 2 in the appendix presents the Python code used to initialize a typical DALI pipeline, which
includes data loading via the standard file reader, decoding, and standard preprocessing steps such as
resizing, cropping, and normalization. To use our custom data loader, which supports data reading from
a Cassandra-compatible DB, one simply needs to replace the standard file reader with our module, as
demonstrated in Listing 3. The modified lines are highlighted in blue.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Comparative analysis at varying latencies</title>
        <p>We evaluated our network data loader alongside two state-of-the-art competitors, leveraging Amazon
EC2 instances located in diferent geographical regions to introduce varying latencies. Specifically,
we tested data consumption in Oregon using an 8-GPU p4d.24xlarge node, while storing images at
locations characterized by the following latencies:
• Low: Data stored in Oregon, round-trip time (RTT) &lt; 1 ms.
• Medium: Data stored in Northern California, RTT ≃ 20 ms.</p>
        <p>• High: Data stored in Stockholm, Sweden, RTT ≃ 150 ms.</p>
        <p>While storing images on a diferent continent from the GPUs is not a common practice, the high-latency
scenario was included to highlight the challenges posed by latency-induced bottlenecks. Such challenges
are expected to further intensify in the future, as computational power and bandwidth improve, whereas
latency remains constrained by physical limits.</p>
        <p>For our experiments, we utilized the standard ImageNet-1k dataset (training set: 1,281,167 images,
average image size: 115 kB, total size: 147 GB, batch size: 512 images.) The dataset was prepared
in the formats required by each data loader and stored on a single node equipped with four NVMe
SSDs, configured as a single striped logical volume for optimized data access. Specifically, we used
r5dn.24xlarge instances in Oregon and Stockholm, and the similar m6in.24xlarge instance in Northern
California, where the previous instance type was not available. As the RAM capacity of these machines
surpasses the size of our test dataset (i.e., ImageNet-1k), we reserved memory to maintain approximately
only 70 GB of free RAM (i.e., half the size of the dataset). This approach prevents dataset caching in
main memory, ensuring that data is consistently read from the disks during testing. By doing so, we
simulate conditions involving larger datasets that exceed the available memory capacity.</p>
        <p>Three data loaders, summarized in Table 1, were compared in this study: our Cassandra-DALI data
loader (with data stored in high-performance ScyllaDB), MosaicML SD (with data hosted on an S3
2https://developer.nvidia.com/triton-inference-server
MinIO server), and TensorFlow’s tf.data service (with data stored as TFRecords in the filesystem). All
servers were hosted on the same node within Docker containers, with all data residing on the same
logical volume. The test code, including the Dockerfiles, is available in the following GitHub repository,
under the paper branch: https://github.com/fversaci/cassandra-dali-plugin/.</p>
        <p>We conducted two experiments to evaluate performance under varying latencies:
Tight-loop read: This benchmark assesses raw data-loading capabilities by maximizing data reads
without GPU processing or image decoding.</p>
        <p>Training: A standard PyTorch multi-GPU ResNet-50 training workload, including image decoding and
preprocessing steps such as resizing, normalization, and cropping.</p>
        <p>It is important to note that the tight-loop read test utilizes a single data loader, whereas the training
process employs a separate data loader for each GPU. Consequently, the tight-loop read test establishes
an upper bound on the data throughput that can be consumed by a single GPU.</p>
        <p>To minimize AWS usage costs, each data loader was evaluated in a single test run for up to four
epochs. The experimental results presented below compare the performance of the data loaders under
varying latency conditions.
4.2.1. Tight-loop reading
As a baseline for comparing network data loaders, we measured the performance of NVIDIA DALI when
reading images stored as TFRecords from the local filesystem, without performing image decoding.
This configuration achieved a data throughput of 7.4 GB/s.</p>
        <p>The p4d.24xlarge instance ofers a total network bandwidth of 100 Gb/s; however, only half of this
bandwidth is accessible through its public interface3. Thus, the maximum raw bandwidth available
for data transfers in our tests was limited to 50 Gb/s (6.25 GB/s). As shown in Fig. 2a and Tab. 2,
our data loader nearly saturated the available bandwidth when reading from ScyllaDB in both local
and medium-latency settings, achieving a throughput of approximately 6 GB/s in each case. In the
high-latency setting, throughput decreased to around 4 GB/s.</p>
        <p>In contrast, MosaicML SD demonstrated significantly lower performance, with throughput measured
at 326 MB/s in the low-latency setting, 308 MB/s in the medium-latency setting, and 203 MB/s in the
high-latency setting. tf.data service exhibited better performance than MosaicML SD in the low-latency
configuration, achieving a throughput of 437 MB/s. However, its performance degraded substantially in
higher-latency environments, with throughput dropping to 57 MB/s in the medium-latency setting and
just 12 MB/s in the high-latency setting.
4.2.2. Training
For a data loader to be efective, its performance must integrate smoothly into the DL pipeline, ensuring
that tensors are eficiently delivered to DL engines without introducing bottlenecks or delays. To
evaluate this, we assessed the performance of data loaders within a standard image classification
training workflow using the ResNet-50 architecture. Due to the diferences in training performance
between TensorFlow and PyTorch, we focused exclusively on one framework to ensure a fair comparison.
Given that MosaicML SD demonstrated superior performance compared to tf.data service in
medium3https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-network-bandwidth.html
DALI tfr MosaicML SD tf.data service Cassandra-DALI
No IO MosaicML SD</p>
        <p>Cassandra-DALI
(a) Normalized tight-loop reading throughput.
(b) Normalized train reading throughput.
and high-latency settings in the previous evaluation, we chose to test it against our data loader in a
training using PyTorch.</p>
        <p>To establish a performance upper bound, we first performed training using a fixed input tensor,
thereby eliminating the overhead associated with data loading and preprocessing. This setup enabled
us to measure the maximum achievable data throughput during training. Specifically, we recorded
the number of images processed per second on a single GPU and across all 8 GPUs during multi-GPU
training. Our results indicate that a single NVIDIA A100 GPU consumes about 1450 images/s, while
an 8-GPU configuration reaches 11200 images/s. Given the average ImageNet-1k training image size
of 115 kB, the data loaders must sustain a steady throughput of approximately 1.3 GB/s to meet these
requirements.</p>
        <p>As demonstrated in Fig. 2b and Tab. 3, the MosaicML SD data loader is unable to sustain the throughput
required to fully utilize all 8 GPUs, achieving 57%, 49%, and 33% of the target throughput under low,
medium, and high-latency conditions, respectively. In contrast, our data loader achieves 94%, 95%, and
96% of the theoretical upper bound in these settings.
4.3. Impact of prefetching and database choice on data loader performance
Finally, we conducted further tests to better investigate our data loader’s performance, focusing
specifically on the impact of the proposed out-of-order prefetching optimization. Additionally, we analyzed
how the choice of the underlying database – either Cassandra or ScyllaDB – afects data loading
performance.
4.3.1. Impact of out-of-order, incremental prefetching
We evaluated the tight-loop reading performance under high-latency conditions, comparing results
with and without our out-of-order, incremental prefetching optimization.</p>
        <p>High-latency, high-bandwidth internet communications are prone to significant variability in TCP
throughput. In fact these conditions exacerbate the efects of packet loss, as TCP congestion control
mechanisms respond conservatively to retransmissions and recover slowly due to extended RTTs,
resulting in reduced throughput [24, 25].</p>
        <p>Figure 3 highlights the significant variance in batch loading times between in-order and out-of-order
prefetching. In the in-order case (Figure 3a), when the prefetching queue is exhausted, the system
experiences delays of up to several seconds while waiting for all transfers, including those over congested
routes, to complete. This results in a cyclical pattern: once all transfers are completed, the queue is
refilled, but it is quickly depleted again, triggering a new cycle. In contrast, the out-of-order approach
(Figure 3b) maintains a highly consistent batch loading time, staying always below 30 ms after the
initial transient period.</p>
        <p>Figure 4a illustrates the throughput over time for 16 of the 32 connections utilized by our data loader
when employing the standard in-order prefetching mechanism. The throughput curves exhibit a strong
correlation, highlighting that simultaneous transfers are constrained by the in-order batch assembly
process, since the system must wait for the slowest transfer to finish before dispatching a batch to the
DL pipeline and requesting a new batch from the database. As a result, the throughputs tend to converge
and the aggregated throughput exhibits considerable fluctuations, ranging roughly from 300 MB/s to
1300 MB/s.</p>
        <p>In contrast, relaxing this in-order constraint allows transfers to proceed independently, as shown
in Figure 4b. In this approach, batches are formed as soon as a suficient number of images are
available, irrespective of their originating connections. This optimization significantly enhances overall
throughput, resulting in higher and more consistent performance, maintaining an average throughput
of approximately 4 GB/s.
4.3.2. Cassandra vs ScyllaDB
The tight-loop reading test under high-latency conditions was also performed using Cassandra as
the storage backend for images, replacing ScyllaDB. Cassandra achieved a throughput of 1.6 GB/s,
0 200 400 600 0 200 400 600 0 200 400 600 0 200 400 600</p>
        <p>Time (s)
significantly lower than the 4.0 GB/s observed with ScyllaDB, highlighting the superior performance
of the latter. Notably, Cassandra exhibited a disk I/O rate considerably higher than its achieved data
throughput (3.6 GB/s versus 1.6 GB/s), likely attributable to diferences in its block-reading strategy
compared to ScyllaDB.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion</title>
      <p>Advances in GPU power have propelled deep learning but also highlighted bottlenecks in data access
and movement, due to the increasing discrepancy between processing throughput and data access
latencies. We address these challenges by integrating scalable NoSQL databases with a high-performance,
image-optimized data loader.</p>
      <p>Our key contribution is a novel loader using advanced prefetching, including out-of-order strategies
to reduce the efects of network latency. By coupling data with metadata in a database-driven design, we
ofer a scalable and consistent solution for DL datasets. Experiments under varying latency conditions
show significant gains in throughput and stability over existing methods.</p>
      <p>The source code for our implementation is publicly available, providing a resource for further research
and practical deployment in diverse DL scenarios.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was supported by the Italian Ministry of Health under the program grant H2ub (Hybrid Hub:
Cellular and computation models, micro- and nano-technologies for personalized innovative therapies,
project code T4-AN-10) and by the Regione Autonoma della Sardegna, Sardegna Ricerche, under the
program grant XDATA.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT-4 for grammar and spelling check.
After using this tool, the authors reviewed and edited the content as needed and take full responsibility
for the publication’s content.</p>
      <p>URL: https://doi.org/10.1145/3359619.3359747. doi:10.1145/3359619.3359747.
[18] A. Ho, [rfc] polylithic: Enabling multi-threaded dataloading through non-monolithic parallelism,
https://github.com/pytorch/data/issues/1318, 2024.
[19] A. Audibert, Y. Chen, D. Graur, A. Klimovic, J. Šimša, C. A. Thekkath, tf.data service: A case
for disaggregating ml input data processing, in: Proceedings of the 2023 ACM Symposium on
Cloud Computing, SoCC ’23, Association for Computing Machinery, New York, NY, USA, 2023, p.
358–375. URL: https://doi.org/10.1145/3620678.3624666. doi:10.1145/3620678.3624666.
[20] S. Hambardzumyan, A. Tuli, L. Ghukasyan, F. Rahman, H. Topchyan, D. Isayan, M. McQuade,
M. Harutyunyan, T. Hakobyan, I. Stranic, et al., Deep lake: A lakehouse for deep learning, in:
Conference on Innovative Data Systems Research (CIDR 2023), 2023.
[21] J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton,</p>
      <p>X. Feng, K. Li, et al., The madlib analytics library, Proceedings of the VLDB Endowment 5 (2012).
[22] S. Banerji, S. Mitra, Deep learning in histopathology: A review, WIREs Data
Mining and Knowledge Discovery 12 (2022) e1439. URL: https://wires.onlinelibrary.
wiley.com/doi/abs/10.1002/widm.1439. doi:https://doi.org/10.1002/widm.1439.
arXiv:https://wires.onlinelibrary.wiley.com/doi/pdf/10.1002/widm.1439.
[23] S. Hemminger, et al., Network emulation with netem, in: Linux conf au, volume 5, 2005, p. 2005.
[24] D. Katabi, M. Handley, C. Rohrs, Congestion control for high bandwidth-delay product networks,
SIGCOMM Comput. Commun. Rev. 32 (2002) 89–102. URL: https://doi.org/10.1145/964725.633035.
doi:10.1145/964725.633035.
[25] S. Ha, I. Rhee, L. Xu, Cubic: a new tcp-friendly high-speed tcp variant, SIGOPS Oper. Syst. Rev. 42
(2008) 64–74. URL: https://doi.org/10.1145/1400097.1400105. doi:10.1145/1400097.1400105.
Listing 1 Example of SQL data model for tumor detection
CREATE TABLE p a t c h e s . metadata (
p a t i e n t _ i d text ,
sl ide_ num int , / / p a t i e n t s can have s e v e r a l s l i d e s
x int , / / c o o r d i n a t e s
y int , / / w i t h i n t h e s l i d e
l a b e l int , / / Gleason s c o r e
p a t c h _ i d uuid ,</p>
      <p>PRIMARY KEY ( ( p a t c h _ i d ) )
) ;
CREATE TABLE p a t c h e s . data (
p a t c h _ i d uuid ,
l a b e l int , / / Gleason s c o r e
d a t a blob , / / image / t e n s o r f i l e ( JPEG , TIFF , NPY , e t c . )</p>
      <p>PRIMARY KEY ( ( p a t c h _ i d ) )
) ;
Listing 2 Initializing DALI pipeline using DALI standard file reader
@ p i p e l i n e _ d e f ( b a t c h _ s i z e = 1 2 8 , num_threads =4 , d e v i c e _ i d = d e v i c e _ i d )
def g e t _ d a l i _ p i p e l i n e ( ) :
images , l a b e l s = fn . readers . f i l e ( name= " Reader " ,</p>
      <p>f i l e _ r o o t = " / d a t a / imagenet / t r a i n " )
l a b e l s = l a b e l s . gpu ( )
images = fn . d e c o d e r s . image ( images , d e v i c e = " mixed " ,</p>
      <p>o u t p u t _ t y p e = t y p e s . RGB )
images = fn . r e s i z e ( images , r e s i z e _ x = 2 5 6 , r e s i z e _ y = 2 5 6 )
images = fn . c r o p _ m i r r o r _ n o r m a l i z e ( images , dtype = t y p e s . FLOAT ,
o u t p u t _ l a y o u t = "CHW" , crop = ( 2 2 4 , 2 2 4 ) ,
mean = [ 0 . 4 8 5 ∗ 2 5 5 , 0 . 4 5 6 ∗ 2 5 5 , 0 . 4 0 6 ∗ 2 5 5 ] ,
s t d = [ 0 . 2 2 9 ∗ 2 5 5 , 0 . 2 2 4 ∗ 2 5 5 , 0 . 2 2 5 ∗ 2 5 5 ] ,
)
return images , l a b e l s
Listing 3 Initializing DALI using our Cassandra-DALI plugin
u u i d s = l i s t _ m a n a g e r . g e t _ l i s t _ o f _ u u i d s ( . . . )
@ p i p e l i n e _ d e f ( b a t c h _ s i z e = 1 2 8 , num_threads =4 , d e v i c e _ i d = d e v i c e _ i d )
def g e t _ d a l i _ p i p e l i n e ( ) :
images , l a b e l s = fn . crs4 . cassandra ( name= " Reader " ,
c a s s a n d r a _ i p s = [ 1 . 2 . 3 . 4 , 5 . 6 . 7 . 8 ] ,
username= " g u e s t " , password = " t e s t " ,
t a b l e = imagenet . d a t a _ t r a i n , u u i d s = uuids ,
p r e f e t c h _ b u f f e r s =16 , i o _ t h r e a d s =8
)
l a b e l s = l a b e l s . gpu ( )
# [ . . . ] same a s b e f o r e
return images , l a b e l s</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Alzubaidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Humaidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Dujaili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Al-Shamma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Santamaría</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Fadhel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Al-Amidie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Farhan</surname>
          </string-name>
          ,
          <article-title>Review of deep learning: concepts, cnn architectures, challenges, applications, future directions</article-title>
          ,
          <source>Journal of big Data</source>
          <volume>8</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kuchnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Klimovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Simsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Amvrosiadis, Plumber: Diagnosing and removing performance bottlenecks in machine learning data pipelines</article-title>
          , in: D.
          <string-name>
            <surname>Marculescu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Chi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Wu (Eds.),
          <source>Proceedings of Machine Learning and Systems</source>
          , volume
          <volume>4</volume>
          ,
          <year>2022</year>
          , pp.
          <fpage>33</fpage>
          -
          <lpage>51</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Phanishayee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Raniwala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chidambaram</surname>
          </string-name>
          ,
          <article-title>Analyzing and mitigating data stalls in dnn training</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>14</volume>
          (
          <year>2021</year>
          )
          <fpage>771</fpage>
          -
          <lpage>784</lpage>
          . URL: https://doi.org/10.14778/3446095.3446100. doi:
          <volume>10</volume>
          .14778/3446095.3446100.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <article-title>Imagenet: A large-scale hierarchical image database</article-title>
          ,
          <source>in: 2009 IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2009</year>
          .
          <volume>5206848</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Schuhmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Beaumont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vencu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gordon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wightman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cherti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Coombes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Katta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mullis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wortsman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schramowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kundurthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Crowson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kaczmarczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jitsev</surname>
          </string-name>
          , Laion-5b:
          <article-title>An open large-scale dataset for training next generation image-text models</article-title>
          , in: S. Koyejo,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Belgrave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Oh (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>35</volume>
          ,
          <year>2022</year>
          , pp.
          <fpage>25278</fpage>
          -
          <lpage>25294</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bellinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Corizzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Branco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Krawczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Japkowicz</surname>
          </string-name>
          ,
          <article-title>The class imbalance problem in deep learning</article-title>
          ,
          <source>Machine Learning</source>
          <volume>113</volume>
          (
          <year>2024</year>
          )
          <fpage>4845</fpage>
          -
          <lpage>4901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bahn</surname>
          </string-name>
          ,
          <article-title>Analyzing data reference characteristics of deep learning workloads for improving bufer cache performance</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>13</volume>
          (
          <year>2023</year>
          ). URL: https://www.mdpi.com/2076-3417/13/ 22/12102. doi:
          <volume>10</volume>
          .3390/app132212102.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Schimmelpfennig</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Vef</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Salkhordeh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Miranda</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Nou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Brinkmann</surname>
          </string-name>
          ,
          <article-title>Streamlining distributed deep learning i/o with ad hoc file systems</article-title>
          ,
          <source>in: 2021 IEEE International Conference on Cluster Computing (CLUSTER)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>169</fpage>
          -
          <lpage>180</lpage>
          . doi:
          <volume>10</volume>
          .1109/Cluster48925.
          <year>2021</year>
          .
          <volume>00062</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Renggli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wu</surname>
          </string-name>
          , et al.,
          <article-title>Stochastic gradient descent without full data shufle: with applications to in-database machine learning and deep learning systems</article-title>
          ,
          <source>The VLDB Journal</source>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Versaci</surname>
          </string-name>
          , G. Busonera,
          <article-title>Scaling deep learning data management with cassandra db</article-title>
          ,
          <source>in: 2021 IEEE International Conference on Big Data (Big Data)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>5301</fpage>
          -
          <lpage>5310</lpage>
          . doi:
          <volume>10</volume>
          .1109/BigData52589.
          <year>2021</year>
          .
          <volume>9672005</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cancilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Canalini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bolelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Allegretti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Carrión</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Paredes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Gómez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Leo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Piras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pireddu</surname>
          </string-name>
          , et al.,
          <article-title>The deephealth toolkit: a unified framework to boost biomedical applic ations</article-title>
          ,
          <source>in: 2020 25th International Conference on Pattern Recognition (ICPR)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>9881</fpage>
          -
          <lpage>9888</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Guirao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Łęcki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lisiecki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Panev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Szołucha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wolant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zientkiewicz</surname>
          </string-name>
          ,
          <article-title>Fast AI Data Preprocessing with NVIDIA DALI</article-title>
          , https://developer.nvidia.com/blog/ fast-ai
          <article-title>-data-preprocessing-with-nvidia-</article-title>
          <string-name>
            <surname>dali</surname>
            <given-names>/</given-names>
          </string-name>
          ,
          <year>2019</year>
          . [Online; accessed August-2024].
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Martinez-Noriega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yokota</surname>
          </string-name>
          ,
          <article-title>High-performance data loader for large-scale data processing</article-title>
          ,
          <source>Electronic Imaging</source>
          <volume>36</volume>
          (
          <year>2024</year>
          )
          <fpage>196</fpage>
          -1
          <article-title>-196-1</article-title>
          . URL: https://library.imaging.org/ei/articles/ 36/12/HPCI-196. doi:
          <volume>10</volume>
          .2352/EI.
          <year>2024</year>
          .
          <volume>36</volume>
          .12.HPCI-
          <volume>196</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D. G.</given-names>
            <surname>Murray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Simsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Klimovic</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Indyk, tf. data: A machine learning data processing framework</article-title>
          ,
          <source>arXiv preprint arXiv:2101.12127</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G.</given-names>
            <surname>Leclerc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ilyas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Engstrom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Salman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mądry</surname>
          </string-name>
          , Ffcv:
          <article-title>Accelerating training by removing data bottlenecks</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>12011</fpage>
          -
          <lpage>12020</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>T. M. M. Team</surname>
          </string-name>
          , streaming, &lt;https://github.com/mosaicml/streaming/&gt;,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>Meier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <article-title>Reflections on the compatibility, performance, and scalability of parallel python</article-title>
          ,
          <source>in: Proceedings of the 15th ACM SIGPLAN International Symposium on Dynamic Languages, DLS</source>
          <year>2019</year>
          ,
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>91</fpage>
          -
          <lpage>103</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>