<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic Code Decomposition, Parallelization, and Distribution on Heterogeneous Systems. A Case Study on Training a Neural Network for Image Classification.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Salvatore D'Angelo</string-name>
          <email>salvatore.dangelo@unicampania.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Beniamino Di Martino</string-name>
          <email>beniamino.dimartino@unicampania.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pasquale Vassallo</string-name>
          <email>pasquale.vassallo@unicampania.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vito Alessandro Liccardo</string-name>
          <email>vitoalessandro.liccardo@unicampania.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Esposito</string-name>
          <email>antonio.esposito@unicampania.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Carollo</string-name>
          <email>a.carollo@zerodivision.it</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giacomo Corridori</string-name>
          <email>g.corridori@zerodivision.it</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianmarco Spinatelli</string-name>
          <email>g.spinatelli@zerodivision.it</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Polzella</string-name>
          <email>f.polzella@zerodivision.it</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Information Engineering, Asia University</institution>
          ,
          <country country="TW">Taiwan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, University of Vienna</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Campania Luigi Vanvitelli</institution>
          ,
          <addr-line>Via Roma 29, 81031 Aversa (CE)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Zerodivision Systems S.r.l.</institution>
          ,
          <addr-line>Piazza S. Francesco, 1 - 56127 Pisa (PI)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Modern software architectures are increasingly complex, often distributed across multiple computational and storage nodes. This complexity demands careful attention during both the design and implementation phases. While traditional compilers once automated certain parallelization tasks in simpler environments, today's heterogeneous infrastructures call for new and more sophisticated approaches. We introduce a novel system for automatically decomposing, parallelizing, and distributing Python code across heterogeneous systems. Our approach combines a skeleton-based compiler with lightweight decorators to annotate computational patterns, enabling automated translation into parallel workflows. This system is embedded within a custom Jupyter kernel and frontend, allowing interactive development and execution. The backend supports diverse environments, including Docker, Kubernetes, and Slurm-based HPC clusters. We demonstrate the efectiveness of our method by training a convolutional neural network for image classification, achieving near-linear speedup across multiple GPUs and nodes. Our results highlight the potential of this approach to democratize scalable computing for non-expert developers.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;HPC</kwd>
        <kwd>Distributed Systems</kwd>
        <kwd>Code Decomposition</kwd>
        <kwd>Parallelization</kwd>
        <kwd>Neural Networks</kwd>
        <kwd>Image Classification</kwd>
        <kwd>Heterogeneous Systems</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The growing complexity of modern software systems, often distributed across multiple computing
and storage nodes, presents substantial challenges during both the design and implementation phases.
These systems span heterogeneous architectures, including multi-core CPUs, many-core GPUs, and
geographically distributed environments orchestrated via container-based platforms such as Docker
and Kubernetes, or batch-scheduled clusters using systems like Slurm. Within this context, optimizing
code for performance while preserving developer productivity is an increasingly critical concern.</p>
      <p>Historically, compilers have played a central role in automating aspects of code transformation
and parallelization. However, such tools were primarily designed for homogeneous, shared-memory
environments and focused on local optimization techniques. As computing infrastructures evolve
toward highly heterogeneous and hybrid configurations that combine cloud resources, HPC clusters,
and edge systems, these traditional compiler strategies are no longer suficient.</p>
      <p>Despite the availability of powerful infrastructure, developers and researchers face several persistent
barriers when attempting to harness distributed and high-performance computing (HPC) resources.
Writing parallel or distributed code remains a largely manual and error-prone process, requiring
expertise in low-level synchronization, resource scheduling, and architecture-specific tuning. The
learning curve for widely-used technologies such as MPI, Docker, Slurm, and Kubernetes is steep, and
often misaligned with the priorities of domain experts in fields like data science, AI, or computational
biology.</p>
      <p>Moreover, development and execution environments frequently diverge, leading to issues with code
portability, reproducibility, and deployment consistency. Code that runs locally may require extensive
reconfiguration to operate correctly on a remote cluster or cloud platform, introducing friction into the
software lifecycle and slowing down iterative development.</p>
      <p>To address these challenges, we propose a novel system for the automatic decomposition,
parallelization, and distribution of Python code targeting heterogeneous execution environments. At the core of
our system is a skeleton-based compiler that uses lightweight Python decorators to annotate segments
of code. These annotations encode high-level computational patterns that the compiler interprets to
automatically construct an execution graph. This graph is then mapped to a backend-aware scheduler,
which dispatches computation across available resources.</p>
      <p>The system is tightly integrated into a custom Jupyter kernel and frontend, allowing developers
to remain within a familiar, interactive notebook environment while transparently targeting a wide
range of execution platforms—including Docker containers, Kubernetes clusters, and Slurm-managed
HPC infrastructures. The use of decorators and cell-level metadata enables a clear separation between
domain logic and orchestration semantics, reducing the need for users to manage low-level system
details.</p>
      <p>We evaluate our approach through a case study involving the distributed training of a convolutional
neural network on the Fashion-MNIST dataset. Training is performed within Docker-based containers
leveraging GPU acceleration and TensorFlow’s distribution strategy. Results show that our system
achieves near-linear speedup across multiple GPUs and compute nodes, while maintaining usability
and minimal overhead for developers.</p>
      <p>Unlike prior tools such as Dask, Ray, or Jupyter-Workflow, our system combines automatic
compilerbased transformation with metadata-driven workflow construction and seamless backend abstraction.
This combination enables rapid prototyping and scalable execution without requiring users to adopt
new languages, workflow engines, or runtime frameworks.</p>
      <p>By lowering the barrier to entry for scalable, distributed computing, our work aims to democratize
access to HPC-level performance for a broader class of users—particularly those working in applied
research and machine learning.</p>
      <p>The remainder of this paper is organized as follows: Section 2 reviews related work in compiler-based
parallelization and distributed notebook workflows. Section 3 presents the design and implementation
of our system. Section 4 describes the case study and evaluates the performance of our approach. Finally,
Section 5 discusses conclusions and future directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The imperative to harness the full potential of modern computing architectures from on-premise
multicore and many-core processors to the vast, distributed resources of the cloud has positioned advanced
compilers as a critical component in the software development landscape. This chapter provides a
comprehensive overview of the current state of the art in compiler technology, focusing specifically
on methodologies and tools available for automatic work decomposition. We will explore compilers
designed not only for the traditional parallelization of sequential code on a single machine but also for
the more recent challenge of decomposing applications into tasks that can be ofloaded and executed
elsewhere, such as on remote cloud infrastructure.</p>
      <p>As software and hardware systems continue to escalate in complexity, the ability of a compiler
to autonomously identify and exploit concurrency, whether local or distributed, is paramount for
enhancing application performance and reducing the intricate programming burden on developers.</p>
      <p>The fundamental purpose of such a compiler is to analyse sequential source code, identify sections
that can be executed concurrently, and transform them for eficient execution.</p>
      <p>This transformation can manifest in two primary ways: as a parallelized form executed across
multiple local processing units, or as discrete, independent tasks packaged for execution on remote
computing resources. This automated process is crucial for unlocking the performance gains ofered
by both tightly coupled parallel hardware and distributed cloud environments. It aims to achieve this
without necessitating that developers engage in the complexities of explicit parallel or distributed
systems programming.</p>
      <sec id="sec-2-1">
        <title>2.1. ROSE Compiler</title>
        <p>
          A significant and distinct tool in the compiler landscape is ROSE [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], an open-source compiler
infrastructure developed at Lawrence Livermore National Laboratory (LLNL). Unlike traditional compilers
that translate source code into machine-executable object code, ROSE is a source-to-source framework.
Its primary function is to build custom tools for program analysis and transformation. It reads source
code, creates a detailed internal representation, allows tools to analyse and modify this representation,
and then generates new, human-readable source code. This approach makes ROSE an exceptionally
lfexible platform for compiler research and the development of specialized tools.
        </p>
        <p>The ROSE infrastructure is particularly well-suited for creating custom tools for static analysis,
program optimization, domain-specific optimizations, and performance analysis. A notable tool included
within the framework is AutoPar, an automatic parallelization compiler for C and C++ that inserts
OpenMP directives into serial code.</p>
        <p>
          Through several parsers it’s frontend supports many languages such as C, C++, Fortran, Java, Python
and many more. ROSE does not include code generation features specifically designed for the cloud,
although it provides all the tools necessary for their implementation. In addition, because Rose is a C++
framework, you have maximum support by using the same language to write your own analysis tool as
well.
2.2. PIPS
PIPS [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] is an open-source, source-to-source compilation framework whose development was initiated
by MINES ParisTech and continued over time by several groups. It is engineered to analyse and
transform numerical applications written in C and Fortran 77. The core philosophy of PIPS is to achieve
efective automatic parallelization by first building a deep, "global" understanding of the entire program.
This emphasis on comprehensive, inter-procedural analysis allows it to perform transformations such
as loop optimizations and task parallelization.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.3. OMNI Compiler</title>
        <p>
          The Omni Compiler [
          <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
          ] is a source-to-source compilation framework designed to transform C and
Fortran programs annotated with XcalableMP and OpenACC directives into parallel code optimized
for execution on high-performance computing systems. It enables the generation of code compatible
with native compilers by linking against the Omni runtime library. Additionally, it supports the
XcalableACC programming model, which combines directive-based parallelism with accelerator ofloading
for use in heterogeneous cluster environments. The project is actively developed by the Programming
Environment Research Team at the RIKEN Center for Computational Science in collaboration with the
HPCS Laboratory at the University of Tsukuba, Japan.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.4. Cetus Compiler</title>
      </sec>
      <sec id="sec-2-4">
        <title>2.5. Others</title>
        <p>
          Cetus [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] is a source-to-source compiler intended primarily for programs written in C. Its focus is
automatic code parallelization by annotation with OpenMP directives to take advantage of execution
on multicore systems.
        </p>
        <p>
          Additional tools such as OpenUH [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], and domain-specific languages like Rascal [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and TXL [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], ofer
extensibility for compiler research, but their usage typically requires specialized language knowledge
and is disconnected from mainstream Python-based workflows.
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>2.6. Cloud oriented compilers</title>
        <p>The compilers and languages seen so far refer to shared-memory architectures and to the
messagepassing paradigm, so they do not support, for instance, the generation of microservices even though
some ofer function libraries to implement it.</p>
        <p>
          In more recent times, attempts have been made to distribute the execution of Python code over
multiple nodes and, at the same time, to improve performance by performing optimization steps before
execution [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Generally, two approaches have been followed: the first is to organize the computation
as a workflow [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and point out its dependencies, so that an eventual Workflow Management System
can organize the workload [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]; the second approach is to compile a subset of Python instructions,
collected in a kernel function, into a lower-level language but with higher performance [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], or to
simply optimize the code without changing languages [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>Approaches involving the translation of a subset of instructions are frowned upon, as they require
learning a new variant of the language, thus reducing productivity in the development of software
solutions.</p>
        <p>
          There are tools that provide a more user-friendly interface by integrating the compiler functionalities
in the integrated development environment such as jupyter-workflow [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], on which our work is based,
that provides the compiler as a jupyter kernel so that it can be accessed without leaving the current
environment and, to a certain extent, the user is unaware of the compiler. Jupyter-workflow is in turn
based on Streamflow.
        </p>
        <p>
          The StreamFlow framework [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], developed and maintained by the Alpha research group at the
University of Turin (UniTO), is a container-native Workflow Management System written in Python 3
that relies on the Common Workflow Language (CWL) standard. StreamFlow is designed around two
principles: executing tasks in multi-container environments to support concurrent, communicating
tasks and relaxing the requirement for a single, shared data space to enable hybrid executions on
multi-cloud or hybrid cloud/HPC infrastructures.
        </p>
        <p>Based on this foundation, the same research group has developed the Jupyter Workflow extension
for the IPython kernel, which is designed to support distributed literate workflows directly within
Jupyter Notebooks. The Jupyter Workflow kernel facilitates the description of intricate workflows
and their distributed execution on hybrid cloud/HPC infrastructures. In this paradigm, code cells
are regarded as the nodes of a distributed workflow graph, and cell metadata are used to express
data dependencies, control flow and parallel execution patterns, such as scatter/gather, as well as
target execution environments. This reliance on cell metadata ofers several key advantages. Firstly, it
maintains a clear separation between the host logic and coordination semantics, which improves the
readability and maintainability of complex applications. Furthermore, it avoids technology lock-in, as
the same metadata format can be interpreted by diferent Jupyter kernels to support various languages,
execution architectures or commercial software stacks. This approach also eases the transition for users
who are already familiar with Jupyter Notebooks, as they can scale their experiments without having
to learn a completely new framework. Jupyter Workflow leverages the capabilities of the StreamFlow
Workflow Management System for its underlying runtime support.</p>
        <p>As shown in Table 1, the comparison of various tools and frameworks highlights diferences in
language support, parallelization strategies, and cloud integration capabilities.</p>
        <p>Tool/Framework Language Sup- Parallelization Execution
port Strategy Model</p>
        <p>Cloud / HPC Sup- Notebook Inte- Python Workflow
port gration Support
C, C++, Fortran, Source-to-source, Local
Java, Python OpenMP memory
shared No (cloud not natively No
supported)</p>
        <p>Partial (Python not
primary target)
C, Fortran 77
C, Fortran
C</p>
        <p>Interprocedural, Local
loop/task memory</p>
        <p>shared No
Directive-based HPC cluster + Yes (HPC support No
(XcalableMP, GPU ofload only)
OpenACC,
XcalableACC)
OpenMP annota- Multicore
tions tems</p>
        <p>sys- No
Fortran, C/C++ OpenMP com- Local or cluster Partial (HPC focused) No
(OpenMP) piler optimization
DSLs for source Customizable
analysis transformation</p>
        <p>Static/source
analysis</p>
        <p>No
Yes</p>
        <p>No
No
No
No
Shirako et al.</p>
        <p>Python</p>
        <p>Automatic paral- Distributed
lelization (HPC/Cloud)
No
No
No
No
No
Yes
Our System</p>
        <p>Python</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Neural Network Training on Heterogeneous Systems Using</title>
    </sec>
    <sec id="sec-4">
      <title>Skeleton-Based Compilation</title>
      <p>from developer-provided decorators.</p>
      <p>units with explicit data-dependency graphs.</p>
      <p>This section presents an in-depth case study illustrating the end-to-end process of training a
convolutional neural network (CNN) for image classification on a heterogeneous distributed execution
environment. The experiment showcases how our skeleton-based compiler—integrated into a custom
Jupyter kernel—can automatically decompose Python code, identify parallelizable segments, and
distribute execution transparently across multiple GPUs hosted in Docker-based environments. The goal
is to validate that compiler-assisted decomposition can not only simplify the development of distributed
machine learning pipelines but also achieve near-optimal hardware utilization without requiring users
to manually manage low-level synchronization, inter-process communication, or GPU afinity.</p>
      <p>As shown in Figure 1, the proposed workflow unifies three tightly coupled components:
1. A custom Jupyter kernel capable of intercepting Python code and extracting structural metadata
2. A compiler back-end that transforms annotated blocks into optimized, parallelizable execution
3. A distributed runtime operating inside Docker containers, enabling GPU-aware execution and
resource orchestration across multiple devices.</p>
      <sec id="sec-4-1">
        <title>3.1. Scenario Overview</title>
        <p>
          The case study targets the Fashion-MNIST dataset, consisting of 70,000 grayscale images (28 × 28
pixels) spanning 10 fashion categories. The data is split into 60,000 training and 10,000 test samples.
Preprocessing steps include:
• Normalization of pixel intensities to the [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] range.
• Reshaping into tensors compatible with TensorFlow’s convolutional input format
(batch, height, width, channels).
        </p>
        <p>• One-hot encoding of labels for multi-class classification.</p>
        <p>The training runs for 20 epochs with a batch size tuned to saturate available GPU memory
without inducing excessive communication overhead. Distribution is handled using TensorFlow’s
MirroredStrategy, which replicates the model across all visible GPUs, performs synchronous
gradient aggregation, and updates weights identically on each replica.</p>
        <p>The custom Jupyter kernel captures standard Python cell code, with developers marking
computationally significant functions or training loops using lightweight decorators (e.g., @parallel_task,
@gpu_task). These annotations are translated into compiler metadata describing:
1. The task’s role in the workflow (data loading, preprocessing, training step, evaluation).
2. Data dependencies and communication requirements.</p>
        <p>3. Preferred execution targets (CPU, single GPU, multi-GPU).</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Execution Strategy</title>
        <p>The training notebook is containerized inside a GPU-enabled Docker image. The container:
• Exposes NVIDIA GPUs through the NVIDIA Container Toolkit (–gpus all flag).
• Pre-installs TensorFlow with GPU support, CUDA drivers, and NCCL for eficient multi-GPU
collective operations.</p>
        <p>• Contains the compiler runtime and backend scheduler.</p>
        <p>The execution flow is as follows:
1. User interaction remains unchanged—commands are entered in a Jupyter Notebook.
2. Kernel interception parses decorated functions and training loops, passing them to the
skeletonbased compiler.
3. The compiler generates a directed acyclic graph (DAG) of tasks, inserting implicit barriers and
data transfers where needed.
4. The backend scheduler maps tasks to available GPUs, ensuring balanced workload distribution
and overlapping computation with communication.
5. TensorFlow’s strategy scope (with strategy.scope():) wraps model definition and
training, enabling synchronous data-parallel execution.</p>
        <p>This transparency is key: from the user’s perspective, they write “normal” TensorFlow code, but
behind the scenes, the compiler manages placement, communication, and execution.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Compiler-Based Optimization</title>
        <p>The compiler performs a three-stage optimization pipeline:
1. Code analysis. Detects parallelizable loops and functions, identifies I/O bottlenecks, and
locates potential data prefetch points. Recognizes reusable intermediate tensors to avoid redundant
recomputation.
2. Skeleton mapping. Assigns each computational segment to a predefined “skeleton” pattern (e.g.,
map-reduce, pipeline, data-parallel batch training). For CNN training, the dominant skeleton is data
parallelism: identical model replicas process distinct mini-batches.
3. Task transformation and scheduling. Generates intermediate code units with explicit device
placement hints, produces an execution graph capturing data dependencies, and leverages NCCL’s
ring-allreduce for gradient synchronization. This minimizes inter-GPU latency and ensures eficient
scaling.</p>
        <p>The outcome is a compiled training workflow that is both hardware-aware and backend-portable,
capable of running on:
• A single multi-GPU workstation.
• A Kubernetes-managed GPU cluster.</p>
        <p>• A Slurm HPC node with Docker-in-Singularity.</p>
      </sec>
      <sec id="sec-4-4">
        <title>3.4. Training Behavior and Observations</title>
        <p>The CNN architecture is intentionally simple to focus on distribution behavior:
• Conv2D layer (32 filters, 3 × 3 kernel, ReLU activation).
• MaxPooling2D (2 × 2 pool size).
• Flatten layer to transition to dense layers.
• Dense layer with 128 units (ReLU).</p>
        <p>• Output Dense layer with softmax activation for 10-class classification.
• No manual synchronization was required—the compiler-generated execution graph and</p>
        <p>MirroredStrategy handled all gradient aggregation.
• GPU utilization stayed consistently above 90% on all devices, confirmed by execution logs and
nvidia-smi profiling.
• Near-linear speedup was observed when scaling from 1 to 4 GPUs, with only minor diminishing
returns due to synchronization overhead.
• Execution artifacts—trained model weights, logs, and performance metrics—were stored in a
structured directory hierarchy, automatically versioned by the runtime.</p>
        <p>Table 2 summarizes the main diferences observed between sequential and distributed execution in
our experiments. As expected, distributing training across four GPUs reduced the total runtime from
approximately 30 minutes to just under 8 minutes, yielding a speedup factor of 3.75× . This near-linear
scaling confirms that the compiler-generated execution graph and TensorFlow’s MirroredStrategy
efectively minimize idle time and communication overhead. The average GPU utilization remained
consistently above 92%, indicating that the scheduler maintained a balanced workload across devices. A
small overhead (6–8%) was measured for gradient synchronization via NCCL’s ring-allreduce, but this
was largely ofset by the increase in efective batch size and the ability to overlap communication with
computation.</p>
        <p>Perhaps most notably, the transition from sequential to distributed execution required minimal code
changes, limited to the addition of lightweight decorators and kernel metadata, whereas a traditional
manual parallelization approach would require substantial boilerplate for synchronization, device
placement, and inter-process communication. The high reproducibility of results across runs further
supports the robustness of the proposed methodology.</p>
        <p>This experiment confirms that compiler-level decomposition combined with distributed
runtime integration in Jupyter enables scalable training without burdening the developer with HPC-level
coding skills, paving the way for democratizing access to heterogeneous computing in AI research.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusion and Future Work</title>
      <p>This work has presented a compiler-assisted approach for the distributed training of convolutional neural
networks on heterogeneous systems. By integrating a skeleton-based compilation mechanism directly
into a custom Jupyter kernel, we demonstrated that it is possible to automatically decompose Python
code, identify parallelizable segments, and schedule execution across multiple GPUs in a containerized
environment. The case study on the Fashion-MNIST dataset confirmed that the proposed system can
deliver near-linear scaling while preserving a high-level, interactive development experience for
endusers. Importantly, the approach eliminates the need for manual synchronization, device management,
or low-level configuration, thereby lowering the barrier to entry for leveraging HPC-class resources in
machine learning workflows.</p>
      <p>The proposed methodology not only achieves eficient hardware utilization but also ensures
portability across diverse execution backends, including Docker, Kubernetes, and HPC environments managed
by Slurm. This capability is particularly valuable for researchers and practitioners who require
reproducibility, scalability, and minimal friction in transitioning between local prototyping and large-scale
deployment.</p>
      <sec id="sec-5-1">
        <title>Future Work</title>
        <p>Building on the encouraging results obtained, several directions for further research and development
are envisaged:
• Support for more complex models and workflows. Extending the compiler patterns to cover
deeper neural networks, transformer-based architectures, and multi-modal learning pipelines.
• Dynamic resource adaptation. Incorporating runtime feedback mechanisms to dynamically
adjust task scheduling, batch sizes, and data distribution based on current load, network latency,
and GPU utilization.
• Hybrid execution strategies. Enabling mixed CPU–GPU and multi-node training with
optimized data transfer strategies, including zero-copy memory sharing and gradient compression.
• Integration with additional frameworks. Extending support beyond TensorFlow to PyTorch
and JAX, while retaining transparent user interaction in Jupyter.
• Fault tolerance and checkpointing. Implementing automatic recovery from hardware or
network failures through fine-grained checkpointing and workflow resumption.
• Broader heterogeneous support. Adapting the compiler for emerging accelerators such as</p>
        <p>TPUs, IPUs, and FPGA-based inference engines.</p>
        <p>By pursuing these enhancements, we aim to transform the proposed system into a fully
generalpurpose platform for distributed computing in AI and scientific applications, capable of bridging the
gap between high-performance computing and accessible, notebook-based development environments.
This work was supported by the FLUENDO project, a subgrantee of the National HPC, Big Data, and
Quantum Computing Center - ICSC, under the Italian NRRP MUR program, funded by the European
Union - Next Generation EU, Mission 4, Component 1, CUP J33C22001170001.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Quinlan</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Liao, The ROSE source-to-source compiler infrastructure, in: Cetus users and compiler infrastructure workshop</article-title>
          ,
          <source>in conjunction with PACT</source>
          , volume
          <year>2011</year>
          , Citeseer,
          <year>2011</year>
          , p.
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Keryell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ancourt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Coelho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Eatrice</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Frann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Irigoin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jouvelot</surname>
          </string-name>
          ,
          <article-title>Pips: a workbench for building interprocedural parallelizers, compilers and optimizers technical paper (</article-title>
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Murai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nakao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Metaprogramming framework for existing hpc languages based on the omni compiler infrastructure</article-title>
          ,
          <source>in: 2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>250</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Omni</given-names>
            <surname>Compiler Project</surname>
          </string-name>
          , https://omni-compiler.org/, ???? Accessed:
          <fpage>2025</fpage>
          -08-07.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-J.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Eigenmann</surname>
          </string-name>
          , S. Midkif,
          <article-title>CETUS: a Source-to-Source compiler infrastructure for multicores</article-title>
          ,
          <source>Computer</source>
          <volume>42</volume>
          (
          <year>2009</year>
          )
          <fpage>36</fpage>
          -
          <lpage>42</lpage>
          . URL: https://doi.org/10.1109/
          <string-name>
            <surname>mc</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <volume>385</volume>
          . doi:
          <volume>10</volume>
          .1109/
          <string-name>
            <surname>mc</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <volume>385</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chapman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , W. Zheng,
          <article-title>Openuh: an optimizing, portable openmp compiler</article-title>
          ,
          <source>Concurrency and Computation: Practice and Experience</source>
          <volume>19</volume>
          (
          <year>2007</year>
          )
          <fpage>2317</fpage>
          -
          <lpage>2332</lpage>
          . URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.1174. doi:https://doi.org/10.1002/cpe. 1174. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.1174.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Klint</surname>
          </string-name>
          , T. van der Storm,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Vinju</surname>
          </string-name>
          ,
          <article-title>Rascal: A domain specific language for source code analysis and manipulation</article-title>
          ,
          <source>in: Ninth IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM)</source>
          ,
          <source>IEEE Computer Society</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>168</fpage>
          -
          <lpage>177</lpage>
          . doi:http://doi. ieeecomputersociety.
          <source>org/10</source>
          .1109/SCAM.
          <year>2009</year>
          .
          <volume>28</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Cordy</surname>
          </string-name>
          ,
          <article-title>Txl - a language for programming language tools and applications</article-title>
          ,
          <source>Electron. Notes Theor. Comput. Sci</source>
          .
          <volume>110</volume>
          (
          <year>2005</year>
          )
          <fpage>3</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Shirako</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hayashi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Paul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tumanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sarkar</surname>
          </string-name>
          ,
          <article-title>Automatic parallelization of python programs for distributed heterogeneous computing</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2203.06233. arXiv:
          <volume>2203</volume>
          .
          <fpage>06233</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Köster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rahmann</surname>
          </string-name>
          ,
          <article-title>Snakemake-a scalable bioinformatics workflow engine</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>28</volume>
          (
          <year>2012</year>
          )
          <fpage>2520</fpage>
          -
          <lpage>2522</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bux</surname>
          </string-name>
          , U. Leser,
          <article-title>Parallelization in scientific workflow management systems</article-title>
          ,
          <source>CoRR abs/1303</source>
          .7195 (
          <year>2013</year>
          ). URL: http://arxiv.org/abs/1303.7195. arXiv:
          <volume>1303</volume>
          .
          <fpage>7195</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>O.</given-names>
            <surname>Castro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bruneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-S.</given-names>
            <surname>Sottet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Torregrossa</surname>
          </string-name>
          ,
          <article-title>Landscape of high-performance python to develop data science and machine learning applications</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>56</volume>
          (
          <year>2023</year>
          ). URL: https://doi.org/10.1145/3617588. doi:
          <volume>10</volume>
          .1145/3617588.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>I.</given-names>
            <surname>Colonnelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Aldinucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cantalupo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Padovani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rabellino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Spampinato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Di</given-names>
            <surname>Carlo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Magini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cavazzoni</surname>
          </string-name>
          ,
          <article-title>Distributed workflows with jupyter</article-title>
          ,
          <source>Future Generation Computer Systems</source>
          <volume>128</volume>
          (
          <year>2022</year>
          )
          <fpage>282</fpage>
          -
          <lpage>298</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/ S0167739X21003976. doi:https://doi.org/10.1016/j.future.
          <year>2021</year>
          .
          <volume>10</volume>
          .007.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>I.</given-names>
            <surname>Colonnelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cantalupo</surname>
          </string-name>
          , I. Merelli,
          <string-name>
            <given-names>M.</given-names>
            <surname>Aldinucci</surname>
          </string-name>
          , Streamflow:
          <article-title>Cross-breeding cloud with hpc</article-title>
          ,
          <source>IEEE Transactions on Emerging Topics in Computing</source>
          <volume>9</volume>
          (
          <year>2021</year>
          )
          <fpage>1723</fpage>
          -
          <lpage>1737</lpage>
          . doi:
          <volume>10</volume>
          .1109/TETC.
          <year>2020</year>
          .
          <volume>3019202</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>