<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>miniLB: A Performance Portability Study of Lattice-Boltzmann Simulations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luigi Crisci</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Biagio Cosenza</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgio Amati</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Turisini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CINECA</institution>
          ,
          <addr-line>Via dei Tizii, 6b, 00185 Roma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Salerno</institution>
          ,
          <addr-line>Via Giovanni Paolo II 132, 80084, Fisciano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Lattice Boltzmann Method (LBM) is a computational technique of Computational Fluid Dynamics (CFD) that has gained popularity due to its high parallelism and ability to handle complex geometries with minimal efort. Although LBM frameworks are increasingly important in various industries and research fields, their complexity makes them dificult to modify and can lead to suboptimal performance. This paper presents miniLB, the first, to the best of our knowledge, SYCL-based LBM mini-app. miniLB addresses the need for a performance-portable LBM proxy app capable of abstracting complex fluid dynamics simulations across heterogeneous computing systems. We analyze SYCL semantics for performance portability and evaluate miniLB on multiple GPU architectures using various SYCL implementations. Our results, compared against a manually-tuned FORTRAN version, demonstrate efectiveness of miniLB in assessing LBM performance across diverse hardware, ofering valuable insights for optimizing large-scale LBM frameworks in modern computing environments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Lattice Boltzmann Methods</kwd>
        <kwd>GPU</kwd>
        <kwd>heterogeneous computing</kwd>
        <kwd>SYCL</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In High-Performance Computing (HPC), mini-apps (or proxy-apps) are simplified codes that allow
application developers to share and analyze important key features of large applications without forcing
users to assimilate large and complex code bases. Mini-apps are often used as abstract models to
evaluate performance and assess performance, portability, and performance portability ( PP). Mini-app
can also capture programming methods and styles that drive requirements for algorithms, compilers,
and other toolchain elements. Developing mini-apps for relevant use cases is an important challenge in
pushing the boundaries of HPC application performance. Important projects such as the Exascale Proxy
Applications Project [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] aim to improve the quality of proxies produced by the Exascale Computing
Project by defining standards for documentation, building and testing systems, performance models
and evaluations, and templates and best practices for proxy developers to help meet these standards.
      </p>
      <p>
        In recent years, the Lattice Boltzmann Method (LBM) has further strengthened its position as a
valuable tool in the field of computational fluid dynamics [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. LBM has attracted increasing interest
in many industries and research organizations due to its high parallelism eficiency and ability to
discretize complex geometries with little efort. This has led to the development of large frameworks,
typically focused on specific LBM domains, with extremely complex and large code bases [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ].
However, LBM frameworks have not evolved with the evolution of (massively parallel and distributed)
computing systems, resulting in complex codebases that are very dificult to modify. Unfortunately, no
mini-app for LBM can eficiently abstract the problem while providing hints for performance tuning
and optimization.
      </p>
      <p>This paper proposes the first mini-application for LBM, with an implementation in SYCL that allows
not only performance evaluation but also performance portability on modern heterogeneous computing
systems. Specifically, this paper makes the following contributions:
• The, to the best of our knowledge, first performance portable, tunable, SYCL-based
LatticeBoltzmann mini-app, translated from an original FORTRAN implementation, capable of targeting
a wide range of multicore CPUs, GPUs, and accelerators.
• A performance portability analysis of SYCL semantics, including Unified Shared Memory, range,
and ND-range kernels using a performance portability metric.
• An experimental evaluation of miniLB on NVIDIA V100S, AMD MI100, and Intel Max 1100 GPUs,
using multiple SYCL implementations and diferent SYCL semantic combinations, compared to
the manually-tuned FORTRAN version using diferent parallelism paradigms and compilers.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <p>Lattice Boltzmann Method The Lattice Boltzmann Method (LBM) emerged in the late 1980s as
an evolution of lattice gas cellular automata. It has since found numerous applications across various
complex flow problems, from fully developed turbulence to micro and nanofluidics, and even
quarkgluon plasmas.</p>
      <p>The core concept of LBM is to solve a simplified Boltzmann kinetic equation for a set of discrete
distribution functions, known as populations, (x; ). These functions represent the probability of
ifnding a particle at position x and time , with a discrete velocity v = c. The discrete velocities are
selected to ensure suficient symmetry, thereby satisfying the mass, momentum, and energy conservation
laws of macroscopic hydrodynamics and maintaining rotational symmetry. Figure 1 illustrates the
lattices used for 2D LB simulations, featuring a set of nine discrete velocities (D2Q9). Instead of directly
solving the Navier-Stokes equations (⃗ ), LBM solves a kinetic equation storing nine populations for
each grid point, corresponding to diferent c velocity directions, including c = (0, 0). It is not necessary
to store derived quantities like velocity and density.</p>
      <p>In its compact form, the main LB equation is as follows:
 (⃗ + ⃗,  + 1) − (⃗; ) = − ((⃗; ) − (⃗; )) + ,  ∈ [0,  + 1]
 is the equilibrium distribution
where ⃗ and ⃗ are position and velocity vectors in ordinary space, 
function,  is a source term for the fluid interaction with external (or internal) sources. The local
equilibria are provided by a lattice truncation, to the second order in the Mach number  = /, of
the Maxwell-Boltzmann distribution, where  is the lattice sound speed.
where  is a set of weights normalized to the unit and,</p>
      <p>
        (⃗, ) =  (1 +  + )
 = 3 ⃗ * ⃗

 = 9
(⃗ * ⃗)2
22
− 3
2
22
(1)
(2)
(3)
where the left term is linear in velocity, the right one quadratic. The equation 1 represents two key
processes: the collision step (right-hand side), where the populations locally relax towards equilibrium,
and the streaming step (left-hand side), where the populations are propagated to neighboring locations
at x + c at time  + 1. This scheme can be demonstrated to reproduce the Navier-Stokes equations for
an isothermal, quasi-incompressible fluid in terms of density and velocity [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        In literature, several large Lattice Boltzmann frameworks exist, e.g. OpenLB[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], waLBerla[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
Palabos[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, such frameworks usually present very large and complex code bases, which makes
it dificult to experiment with for research purpose. Our work aims at staying as simple as possible
while providing a playground for experimenting with SYCL-specific or LB-specific optimizations while
providing good performance on multiple hardware. Other attempts to verify the performance of
heterogeneous programming models on LB methods have been proposed in the literature: Ding et.al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
explore the performance of the SYCL and Kokkos programming model on an LB application, showing
performance pitfalls of both implementations. However, their work does not focus on evaluating single
SYCL features like miniLB but rather the raw application performance. Moreover, their analysis only
focuses on NVIDIA GPUs, while we examine performance portability across multiple vendors.
SYCL SYCL [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is a royalty-free, cross-platform C++ abstraction layer that enables developers to
write code for multiple heterogeneous devices, such as CPUs, GPUs, and FPGAs, in a convenient and
performance-portable way. SYCL enhances the C++ programming language by adding abstractions for
managing heterogeneous computing within ISO C++, aiming to align closely with the core language
specifications. Originally designed to map onto OpenCL, the third revision of the SYCL 2020 specification
allowed for custom backends, like NVIDIA CUDA, AMD HIP, OpenMP, and others. Key implementations
of SYCL include Intel’s OneAPI Data-Parallel C++ [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and AdaptiveCPP [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], along with several other
smaller-size implementations [
        <xref ref-type="bibr" rid="ref11 ref12 ref13">11, 12, 13</xref>
        ]. The versatility of SYCL has led to various extensions for
specific heterogeneous computing scenarios, including distributed computing [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], real-time energy
optimization [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], and approximate computing [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>Performance Portability As HPC systems evolve with diverse hardware architectures, developing
eficient, cross-device application code becomes crucial. This has led to the rise of "performance
portability" in academic circles, which measures both an application’s ability to meet performance
benchmarks on specific platforms and its capacity to run across various hardware configurations.</p>
      <p>
        However, a universally accepted definition is absent. In our research, we embrace the definition
of performance portability by Pennycook et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]: "A measurement of an application’s performance
eficiency for a given task that can be successfully executed on all platforms within a specified set." The
formula to quantify performance portability is presented in Equation 4:
      </p>
      <p>⎧ || 1
PP(, , ) = ⎨ ∑︀∈ (,)
⎩0,
, if platform  is supported for all  ∈ 
otherwise
(4)</p>
      <p>Here,  represents the application,  denotes the problem addressed by , and  signifies the set
of target hardware. The performance portability metric PPis defined as the harmonic mean of the
application’s performance eficiency (, ) over the set of hardware .</p>
      <p>
        Pennycook et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] highlight various methods for calculating application performance eficiency,
specifically: architectural eficiency , which measures achieved performance as a fraction of peak hardware
performance; and application eficiency , which measures achieved performance as a fraction of the
best-observed performance against the most optimized native implementation.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. miniLB Overview</title>
      <sec id="sec-3-1">
        <title>3.1. Computational Description</title>
        <p>
          miniLB is a bidimensional computational fluid dynamic code for single-phase incompressible flows,
with nine discrete velocities (D2Q9 using CFD jargon). It is a downsizing of a 3D FORTRAN90
MPI+OpenACC full application, developed by CINECA [
          <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
          ]. miniLB is written in C++20 and SYCL,
a single-source abstraction layer for heterogeneous computing. miniLB has no external dependencies
and uses no SYCL compiler-specific feature, allowing it to run out-of-the-box on every platform with
any SYCL compiler. The code is open-source and available on GitHub1.
miniLB implements a fused approach [
          <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
          ], where the collision and streaming operation are performed
in a single kernel. In this approach, the app holds a pre-collision population   and a post-collision
population  : at time , input values read through a scattered read from  , and post-collision
results are written in  . Finally, the two populations are swapped at time  + 1. Populations are
(a) Lid-Driven cavity
(b) Von Karman Street
stored using a Structure-of-Array (SoA) layout, with a unit-strided vector for each population . miniLB
has been designed to be highly and easily tunable to measure SYCL performance in a wide range of
scenarios, with a total of 96 possible configurations, highlighted below in section 4.
3.1.1. Numerical precision
miniLB supports four diferent numerical precisions to control how data is computed and stored:(i)
Single: quantities are stored in single precision and all floating point operations are performed in
single precision. (ii) Double: quantities are stored in double precision and all floating point operations
are performed in double precision. (iii) Mixed1: quantities are stored in half precision and all floating
point operations are performed in single precision. (iv) Mixed2: quantities are stored in single precision
and all floating point operations are performed in double precision.
        </p>
        <p>
          The Single precision option is the default one. A more comprehensive analysis of LBM numerical
precision has been explored in [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. FORTRAN-based Parallelization</title>
        <p>The original FORTRAN app uses a directive-based parallelization that minimize code refactoring from
CPUs to GPUs implementation. It implements several programming models: OpenACC pragmas,
OpenMP Offload, and the FORTRAN built-in operator DOCONCURRENT.</p>
        <p>While GPU vendors provide native compilation toolchains for some of those programming models,
not all of them are supported by each vendor: for example, OpenACC supports only NVIDIA GPUs and
AMD through the CRAY compiler, while DOCONCURRENT is not supported on AMD discrete GPUs.
Furthermore, to achieve optimal performance on specific hardware, users must compile their programs
using the proprietary vendor compiler (e.g. nvFORTRAN for NVIDIA, amdclang for AMD, ifx for Intel).
This requirement increases fragmentation and adds complexity, as it necessitates testing with additional
toolchains.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Use Cases</title>
        <p>
          miniLB supports three classic CFD benchmarks: Lid-Driven Cavity (LDC), Taylor-Green Vortex, and
Von Karman Street (VKS). The app also produces VTK output files for ofline visualization with tools
like ParaView [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. In this paper, we focus on the latter and the former: LDC and VKS.
Lid-Driven Cavity The LDC problem involves a square or rectangular cavity closed on all sides. The
top lid of the cavity moves at a constant velocity, while the other three walls remain stationary. This
setup generates a complex flow pattern within the cavity, characterized by the following: (a) no-slip
boundary conditions: all walls, including the moving lid, have a no-slip boundary condition, meaning
the fluid velocity relative to the wall is zero; (b) driven flow: the movement of the top lid at a constant
tangential velocity 0 drives the flow within the cavity. Figure 1a shows an example output.
Von Karman Street The VKS occurs when fluid flows past a cylindrical object and the flow separates
alternately from either side of the object, creating a pattern of vortices in the wake. This phenomenon
is characterized by: (a) Periodic Vortex Shedding: alternating vortices are shed from opposite sides of
the cylinder, creating a staggered array of vortices downstream. (b) Flow Regimes: the flow pattern
depends on the Reynolds number, which is a dimensionless number representing the ratio of inertial
forces to viscous forces in the flow. A visual example is given in figure 1b.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. SYCL Porting</title>
      <sec id="sec-4-1">
        <title>4.1. Kernel Parallelism</title>
        <p>In this section, we analyze the principal SYCL features used during the porting and the available miniLB
configurations.</p>
        <p>
          SYCL uses parallel_for to declare parallel code regions. In particular, SYCL ofers two variants of it:
range and NDrange. The range is the simplest as it requires the user to specify only the iteration
space. This allows the runtime to select the most appropriate number of threads depending on the
target device without user intervention. On the other hand, NDrange allows to manually tweak the
local iteration space (e.g. local workgroup size on OpenCL), allowing more fine-grained optimization
but also requiring the user to manually optimize the local size to match the current device. miniLB
kernels support both range and NDrange. The app defaults to range parallel for, but the latter can
be activated by setting the -DBGK_SYCL_ND_RANGE compile-time parameter. Currently, the app only
supports the tweaking of the collide and stream kernel size through the -DBGK_SYCL_ND_RANGE_[X,Y]
_SIZE parameter at compile-time.
4.2. Data Management and Access
miniLB uses Unified Shared Memory for data management instead of the Bufer-Accessor (BA)
paradigm as it shows more reliable and stable performance [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. USM provides three allocation kinds:
malloc_device are allocated directly on the target device, malloc_host allocates host page-locked
memory accessible from both host and device, and malloc_shared, which are shared between devices
using an automatic memory migration system.
miniLB implements two memory management backends, one based on malloc_device and
malloc_host and one using malloc_shared, controllable via the compile-time parameter
DBGK_SYCL_MALLOC_SHARED. In the former, miniLB stores a device and host pointer for each
population and manually migrates memory from the host and device. The host device is pinned
because this increases the bandwidth with some GPU architectures [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]; in the latter, a single
pointer is stored and the SYCL runtime handles the memory migration. In addition, the parameter
-DBGK_SYCL_ENALBE_PREFETCH enables hints to prefetch memory on the host/device to the SYCL
runtime.
        </p>
        <p>
          Multi-dimensional data are defined with MDspan [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. It is a lightweight, non-owning view that
allows a piece of memory to be treated as a multi-dimensional entity. MDspan allows us to define the
extents (i.e. the number of dimensions and sizes), the layout (e.g. row-major, column-major, etc.), and
the data accessor (i.e. how to translate the pointer/index pair to a memory location). miniLB stores
a two-dimensional MDspan for both host and device to reduce the view construction overhead. In
addition, a compile-time parameter -DBGK_SYCL_LAYOUT_[RIGHT|LEFT] switches the data layout to
row-major or column-major respectively. miniLB defaults to column-major as it is the default layout in
the original FORTRAN application.
        </p>
        <sec id="sec-4-1-1">
          <title>Feature Precision Queue USM</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.3. Task Scheduling</title>
        <p>To submit tasks, a SYCL user needs to create a queue that binds to a specific device. SYCL supports
both out-of-order and in-order submissions. With the former, submitted tasks are executed without a
defined order, allowing for parallel kernels execution. With USM, data dependencies between kernels
must be explicitly tracked. With in-order submission, instead, kernels are executed in FIFO order. This
hinders the ability to parallelize kernel executions, but it removes the overhead of dependency tracking.
miniLB supports both queue configuration, controlled by the parameter -DBGK_SYCL_IN_ORDER_QUEUE.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental evaluation</title>
      <sec id="sec-5-1">
        <title>5.1. Experimental Setup</title>
        <p>We evaluated miniLB on three GPUs from three principal GPU vendors, i.e. NVIDIA Tesla V100S, AMD
MI100, and Intel Max 1100. We tested the app on two use cases: Lid-Driven Cavity (LDC) and Von
Karman Street (VKS), using a 4096 * 4096 grid with Reynolds number  = 10000 and with 100.000
timesteps. As performance metric, we use MLUPs (Mega Lattice Update per second) that indicates how
many millions of gridpoints are updated each second.</p>
        <p>For the SYCL implementations, we chose AdaptiveCpp (commit sha a3c5c9d), and DPC++(commit sha
ea0c067), where both support all three architectures. For AdaptiveCpp, we target the generic backend,
which can target every hardware through an integrated JIT compiler. For the FORTRAN version, we
used NVHPC 24.5 for NVIDIA, amdclang ROCM 6.0 for AMD, and IFX 2024.0.0 for Intel. To reduce the
number of combinations in the tuning space, this paper explores all combinations except Mixed 1
and Mixed 2 precision, SYCL out-of-order queues, row-major layout, and the Shared+Prefetch
configuration.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Use Case Evaluation</title>
        <p>1.75
1.50
1.25
p
du1.00
e
pe0.75
S
0.50
0.25</p>
        <p>0 LDCNVIDIA V100SVKS LDCAMD MI100 VKS LDCIntel Max 1100VKS LDCNVIDIA V100SVKS LDCAMD MI100 VKS LDCIntel Max 1100VKS</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. SYCL Feature Evaluation</title>
        <p>
          SYCL provides a great variety of built-in constructs to parallelize an application on heterogeneous
hardware. However, the performance of each construct heavily depends on the target use case [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ].
Furthermore, diferent SYCL platforms could implement the same feature in diferent way, adding
additional complexity. Figure 4 shows the performance of the Lid-Driven Cavity use case with each
combination of USM allocations and SYCL kernel’s type on every hardware and precisions, using both
AdaptiveCpp and Intel DPC++. For NDrange kernels, the work group size is the biggest one possible
on the target GPU (i.e. 1024 threads), organized as a block of 1x1024 threads. The values are normalized
with the FORTRAN OpenMP ofload backend, as it is the only one available on every hardware. In
those benchmarks, we add a checkpoint at T = 50.000 to force data movement between host and device.
        </p>
        <p>From the picture is clear that kernel performance in SYCL is heavily dependent on the adopted
SYCL parallelization type. In particular, SYCL range are above the baseline in 33% of the cases. On
the other hand, NDrange kernels beat the baseline in 85% of the configuration. However, some
discrepancy between SYCL implementations arises: AdaptiveCpp range uses a work group size of
128 threads, organized in a 16x16 grid with 2-dimensional kernels. On NVIDIA GPUs, the gridsize is
divided into 256x256 blocks. This results in 15% more uncoalasced global memory acccess compared to
the NDrange version, where the work group size is unrolled along the y-axis (1x1024, or 1024x1 if
in row-major). Similar considerations apply also to other architectures. On the other hand, DPC++
range kernels always select the largest possible work group size on GPU and put all the threads in
one dimension (e.g. 1x1024 if the kernel is 2-dimensional). This heuristic performs well on both AMD
and Intel, where range picks the same size as the one manually specified for NDrange. However,
on NVIDIA GPUs, range kernels performance are detrimental, achieving only 15% of the NDrange
performance. The diference in performance between hardware is due to a small, but significant change
in the work group size definition heuristic on Nvidia hardware: while on AMD and Intel hardware,
threads are placed on the first dimension (e.g. 1024x1x1), on NVIDIA hardware DPC++ place the</p>
        <p>Hardware
NVIDIA V100S
AMD MI100
Intel Max 1100
Precision
Single
Double
Device
NDRange</p>
        <p>Shared
NDRange</p>
        <p>Device
Range</p>
        <p>Shared Device
Range NDRange</p>
        <p>Shared
NDRange</p>
        <p>Device
Range</p>
        <p>Shared
Range
threads on the second dimension (1x1024x1). This results in 93% more uncoalasced access on the Tesla
V100S compared to the other hardware. By switching to a row-major layout, NVIDIA performance
improves but it crashes performance on AMD and Intel GPUs for the same reason. This discrepancy
between work group size definitions severely limits the range performance portability across hardware.
Regarding data management, shared allocations are not shown on AMD hardware: on AMD
GPUs, on-demand page migration between host and device memory relies on the XNACK feature,
which is disabled by default. However, XNACK is known to be experimental and unstable. When we
enabled it, we encountered random kernel failures and GPU hangs. Consequently, we disabled it for
this analysis. Without XNACK enabled, shared allocations function like host allocations, meaning the
data is allocated on the host and transferred to the device at each memory access, generating up to
1000x slowdown compared to the other implementation. Therefore, for this evaluation, we consider
the AMD shared backend as not available. On average, device allocation beat the baseline in 60% of
the configuration, while shared allocation only on 43% of the cases. However, shared allocations
performance depends on hardware support: for example, on NVIDIA GPU they beat the baseline 50%
of the time, while on the Intel GPU they only achieve better performance than the baseline on 37%.
This variability, together with the unreliability of UVM on AMD, raises questions on its application to
production scenarios.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Performance Portability evaluation</title>
        <p>SYCL Impl. Hardware</p>
        <p>USM Kernel  (GFlop/s) (TFlop/s) ′
Shared Range 0,38 125 0,43 29%</p>
        <p>Shared NDrange 1.37 976 1.55 63%
NVIDIA V100S Device Range 0.38 124 0.43 29%</p>
        <p>Device NDrange 1.27 934 1.55 60%
lIte++nPCD AMD MI100 (Est.) SSDDhheeaavvrriicceeeedd RRNNaaDDnnrrggaaeennggee 11XX..4411 111133XX22 11..66XX00 7700%%XX</p>
        <p>Shared Range 1.22 781 0.976 80%
Intel Max 1100 SDheavriceed RNaDnrgaenge 11..2221 778722 00..997668 8709%%</p>
        <p>
          Device NDrange 1.21 775 0.968 80%
To measure the application performance portability, we employ the Pennycook PPmetric [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
However, we don’t have a native optimized application version for each target hardware. In addition,
calculating the application architectural eficiency can be challenging, as it requires the identification of
relevant bottlenecks on each hardware. For those reasons, to calculate the performance portability we
employ the roofline eficiency , which measures the distance between the application FLOP/s to the top
of the roofline. Roofline eficiency has been demonstrated to successfully approximate architectural
eficiency [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]. To calculate roofline eficiency, one needs to calculate the device peak performance
 = min ( ,  × )
where   is the device floating point peak,  is the device bandwidth peak, and  is the
arithmetic intensity for the application , measured as the ratio between the application FLOP   and
memory transferred. To capture those values, we used Nsight Compute on NVIDIA platform [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] and
Intel Advisor on Intel [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]. However, we encountered two dificulties.
        </p>
        <p>
          First, the AMD MI100 does not provide FLOP counters, therefore it is not possible to calculate the
application FLOP/s. However, miniLB kernels do not have any device-dependent branch, therefore we
expect the number of floating point operations to be the same on all three platform. To estimate the
application FLOP/s, we use a similar methodology to the one defined in [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]:
        </p>
        <p>_ , where   is the floating point ratio of the corresponding
  =   * _
implementation on NVIDIA hardware, and __ is the ratio between the two application
kernel performance.</p>
        <p>The second problem is related to AdaptiveCpp on Intel. Profiling an AdaptiveCpp-compiled
application with Intel Advisor results in a profiler’s internal exception, therefore we couldn’t measure the flop
rate on Intel Max. Depending on the kernel type, we followed diferent procedures:
• NDrange: Both DPC++ and AdaptiveCpp use the same work group size, therefore we
approximate the flop rate by multiplying DPC++ flop rate by the ratio between the AdaptiveCpp and
DPC++ performance.
• range Because AdaptiveCpp uses a diferent work group size compared to DPC++, we can’t
approximate the kernel bandwidth with high precision, therefore we skip this configuration.</p>
        <p>We measured the memory bandwidth of each hardware: our results showed a 1.1TB/s bandwidth
on the NVIDIA V100S, 0.89TB/s on the AMD MI100, and 0.8TB/s on the Intel Max 1100. Table 2, 3,
shows the performance value collected for the fused collide and stream kernel, called col_MC kernel. ′
indicates the distance from the roofline peak. The roofline results for both precision are shown in figure
5. As with every LB application, miniLB is bandwidth-bound, therefore the device peak depends on the
device bandwidth and arithmetic intensity. For space constrain, we only show the results for single
precision. Interestingly, miniLB achieves at least 62% of the device peak on every target hardware. We
can see that on Intel Max we achieve the highest roofline eficiency, getting up to 91% of the peak on
AdaptiveCpp with NDrange. While DPC++ and AdaptiveCpp show similar roofline eficiency for
NDrange kernel, AdaptiveCpp range is 37% and 8% slower than DPC++ respectively on AMD and
Intel device allocation. However, because of the previously mentioned uncoalesced access issue,
DPC++ is 46% and 42% slower than AdaptiveCpp with shared and device allocation respectively
on the NVIDIA V100S. This means that, while DPC++ could achieve better performance, on average
AdaptiveCpp range heuristic is more portable among devices.</p>
        <p>Finally, table 4 and 5 show the performance portability ( PP) metric results for each precision. PP′
represents the value of performance portability considering only the current combination of data
management backend and kernel type, while PPis the maximum PP′ across all data management
backend . Because we treated shared allocation as not available on AMD hardware, PPis 0 for each
shared configuration. miniLB achieves a minimum of 60% of performance portability among every
precision, showing how SYCL can eficiently target any of the major vendor GPUs. NDrange achieve
a medium portability of 78% among all precision, while range gets a medium portability of 62%. It is
worth noting that, while NDrange required a tuning phase to find the best work group size for each
hardware, range achieved such results without any user intervention.
13000
7000
/) 4000
S
F
G
(
te 1000
ra 700
PO 400
L
F 200</p>
        <p>Precision
Intel Ma iSDmionpuglbelelmeentation</p>
        <p>AdaptiveCpp + Range
Intel DPC++ + Range
AdaptiveCpp + NDrange</p>
        <p>Intel DPC++ + NDrange
0.7 Ar1ithmetic2Intensity4(FLOPs7/By1te0) 20</p>
        <p>FP64: 9.15 TF/s
x1100HBM2BW: 0.8TB/s</p>
        <p>Precision
Intel Ma iSDmionpuglbelelmeentation</p>
        <p>AdaptiveCpp + Range
Intel DPC++ + Range
AdaptiveCpp + NDrange</p>
        <p>Intel DPC++ + NDrange
0.7 Ar1ithmetic2Intensity4(FLOPs7/By1te0) 20</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <sec id="sec-6-1">
        <title>Precision Kernel type USM allocation</title>
        <p>e
l
b
u
o
D</p>
      </sec>
      <sec id="sec-6-2">
        <title>Range</title>
      </sec>
      <sec id="sec-6-3">
        <title>NDrange</title>
      </sec>
      <sec id="sec-6-4">
        <title>Device</title>
      </sec>
      <sec id="sec-6-5">
        <title>Shared</title>
      </sec>
      <sec id="sec-6-6">
        <title>Device</title>
      </sec>
      <sec id="sec-6-7">
        <title>Shared</title>
        <p>PP′
We presented miniLB, the first, highly tunable, SYCL-based lattice Boltzmann mini-app. We successfully
ported the original FORTRAN application to C++ and SYCL, achieving a considerable speedup on every
platform. We analyzed a subset of the 96 possible miniLB configuration settings to evaluate multiple
combinations of SYCL features. We found that AdaptiveCpp and DPC++ portability can severely
depend on the target SYCL feature, e.g. DPC++ range heuristic being less performance portable than
AdaptiveCpp. Finally, we analyze miniLB performance portability using the well-known PPmetric. Our
results show that miniLB achieves high performance portability, with a PPvalue up to 78%.
As a future work, we plan to implement more SYCL features and optimizations, e.g. local memory,
Buffer-Accessors, specialization constants, etc., as well as extending the app to multi-GPU
systems, using both low-level MPI calls and high-level SYCL frameworks like Celerity. Furthermore, we
would like to extend miniLB to the 3D case and measure the impact of mixed precision computation in
SYCL, for both numerical stability and energy consumption.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This project has received funding from the Italian Ministry of University and Research under PRIN 2022
grant No. 2022CC57PY (LibreRT project).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] Exascale proxy applications project</article-title>
          , https://proxyapps.exascaleproject.org/,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K. V.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Straka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. W.</given-names>
            <surname>Tavares</surname>
          </string-name>
          ,
          <article-title>Lattice boltzmann methods for industrial applications</article-title>
          ,
          <source>Industrial &amp; Engineering Chemistry Research</source>
          <volume>58</volume>
          (
          <year>2019</year>
          )
          <fpage>16205</fpage>
          -
          <lpage>16234</lpage>
          . URL: https://doi.org/10.1021/ acs.iecr.9b02008. doi:
          <volume>10</volume>
          .1021/acs.iecr.9b02008.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Krause</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kummerländer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Avis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kusumaatmaja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dapelo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Klemens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gaedtke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hafen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mink</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Trunk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Marquardt</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-L. Maier</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Haussmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Simonis</surname>
          </string-name>
          ,
          <article-title>Openlb-open source lattice boltzmann code</article-title>
          ,
          <source>Computers and Mathematics with Applications</source>
          <volume>81</volume>
          (
          <year>2021</year>
          )
          <fpage>258</fpage>
          -
          <lpage>288</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/S0898122120301875. doi:https://doi. org/10.1016/j.camwa.
          <year>2020</year>
          .
          <volume>04</volume>
          .033,
          <article-title>development and Application of Open-source Software for Problems with Numerical PDEs</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Eibl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Godenschwager</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kohl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kuron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rettinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schornbaum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schwarzmeier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Thönnes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Köstler</surname>
          </string-name>
          , U. Rüde,
          <article-title>walberla: A block-structured high-performance framework for multiphysics simulations</article-title>
          ,
          <source>Computers and Mathematics with Applications</source>
          <volume>81</volume>
          (
          <year>2021</year>
          )
          <fpage>478</fpage>
          -
          <lpage>501</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/S0898122120300146. doi:https: //doi.org/10.1016/j.camwa.
          <year>2020</year>
          .
          <volume>01</volume>
          .007,
          <article-title>development and Application of Opensource Software for Problems with Numerical PDEs</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Latt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Malaspinas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontaxakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Parmigiani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lagrava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Brogi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. B.</given-names>
            <surname>Belgacem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Thorimbert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Leclaire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Marson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lemus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kotsalos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Conradin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Coreixas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Petkantchin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Raynaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Beny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chopard</surname>
          </string-name>
          ,
          <article-title>Palabos: Parallel lattice boltzmann solver</article-title>
          ,
          <source>Computers and Mathematics with Applications</source>
          <volume>81</volume>
          (
          <year>2021</year>
          )
          <fpage>334</fpage>
          -
          <lpage>350</lpage>
          . URL: https://www.sciencedirect.com/science/article/ pii/S0898122120301267. doi:https://doi.org/10.1016/j.camwa.
          <year>2020</year>
          .
          <volume>03</volume>
          .022,
          <article-title>development and Application of Open-source Software for Problems with Numerical PDEs</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kusumaatmaja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kuzmin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Shardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Silva</surname>
          </string-name>
          , E. Viggen,
          <source>The Lattice Boltzmann Method: Principles and Practice</source>
          , Graduate Texts in Physics, Springer,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Che,</surname>
          </string-name>
          <article-title>Evaluating performance portability of sycl and kokkos: A case study on lbm simulations, in: 2023 IEEE Intl Conf on Parallel and Distributed Processing with Applications, Big Data and Cloud Computing, Sustainable Computing and Communications, Social Computing and Networking</article-title>
          (ISPA/BDCloud/SocialCom/SustainCom),
          <year>2023</year>
          , pp.
          <fpage>328</fpage>
          -
          <lpage>335</lpage>
          . doi:
          <volume>10</volume>
          .1109/
          <string-name>
            <surname>ISPA-BDCloud-SocialCom-SustainCom59178</surname>
          </string-name>
          .
          <year>2023</year>
          .
          <volume>00075</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Sycl</surname>
            <given-names>specification</given-names>
          </string-name>
          , https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ashbaugh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bader</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Brodman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hammond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kinsner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennycook</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schulz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sewall</surname>
          </string-name>
          ,
          <article-title>Data parallel c++: Enhancing sycl through extensions for productivity and performance</article-title>
          , in: Int. Workshop on OpenCL,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          . doi:
          <volume>10</volume>
          .1145/3388333.3388653.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Alpay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Heuveline</surname>
          </string-name>
          ,
          <article-title>Sycl beyond opencl: The architecture, current state and future direction of hipsycl</article-title>
          , in: International Workshop on OpenCL,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          . doi:
          <volume>10</volume>
          .1145/3388333. 3388658.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gozillon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Keryell</surname>
          </string-name>
          , L.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          , G. Harnisch, P. Keir,
          <article-title>trisycl for xilinx fpga</article-title>
          ,
          <source>in: Int. Conference on High Performance Computing and Simulation (HPCS)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Agung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Takizawa</surname>
          </string-name>
          ,
          <article-title>Neosycl: A sycl implementation for sx-aurora tsubasa</article-title>
          ,
          <source>in: The International Conference on High Performance Computing in Asia-Pacific Region, HPC Asia</source>
          <year>2021</year>
          ,
          <year>2021</year>
          , p.
          <fpage>50</fpage>
          -
          <lpage>57</lpage>
          . doi:
          <volume>10</volume>
          .1145/3432261.3432268.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Thoman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Knorr</surname>
          </string-name>
          , L. Crisci,
          <article-title>Simsycl: A sycl implementation targeting development, debugging, simulation and conformance</article-title>
          ,
          <source>in: Proceedings of the 12th International Workshop on OpenCL and SYCL</source>
          , IWOCL '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          . URL: https://doi.org/10.1145/3648115.3648136. doi:
          <volume>10</volume>
          .1145/3648115.3648136.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P.</given-names>
            <surname>Salzmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Knorr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Thoman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gschwandtner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cosenza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fahringer</surname>
          </string-name>
          ,
          <article-title>An asynchronous dataflow-driven execution model for distributed accelerator computing</article-title>
          ,
          <source>in: IEEE 23rd Int. Symposium on Cluster, Cloud and Internet Computing (CCGrid)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>82</fpage>
          -
          <lpage>93</lpage>
          . doi:
          <volume>10</volume>
          .1109/CCGrid57682.
          <year>2023</year>
          .
          <volume>00018</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>K.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. D'Antonio</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Carpentieri</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Cosenza</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Ficarelli</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Cesarini</surname>
          </string-name>
          , Synergy:
          <article-title>Fine-grained energy-eficient heterogeneous computing for scalable energy saving</article-title>
          ,
          <source>in: International Conference for High Performance Computing, Networking, Storage and Analysis (SC)</source>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .1145/ 3581784.3607055.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L.</given-names>
            <surname>Carpentieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cosenza</surname>
          </string-name>
          ,
          <article-title>Towards a sycl api for approximate computing</article-title>
          , in: International Workshop on OpenCL,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          . doi:
          <volume>10</volume>
          .1145/3585341.3585374.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Pennycook</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Sewall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. W.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>A metric for performance portability, arXiv preprint (</article-title>
          <year>2016</year>
          ). doi:arXiv:
          <fpage>1611</fpage>
          .
          <fpage>07409</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>G.</given-names>
            <surname>Falcucci</surname>
          </string-name>
          , G. Amati,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fanelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. K.</given-names>
            <surname>Krastev</surname>
          </string-name>
          , G. Polverino,
          <string-name>
            <given-names>M.</given-names>
            <surname>Porfiri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Succi</surname>
          </string-name>
          ,
          <article-title>Extreme flow simulations reveal skeletal adaptations of deep-sea sponges</article-title>
          ,
          <source>Nature</source>
          <volume>595</volume>
          (
          <year>2021</year>
          )
          <fpage>537</fpage>
          -
          <lpage>541</lpage>
          . URL: https://doi.org/10.1038/s41586-021-03658-1. doi:
          <volume>10</volume>
          .1038/s41586-021-03658-1.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>G.</given-names>
            <surname>Amati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Succi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fanelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. K.</given-names>
            <surname>Krastev</surname>
          </string-name>
          , G. Falcucci,
          <article-title>Projecting lbm performance on exascale class architectures: A tentative outlook</article-title>
          ,
          <source>Journal of Computational Science</source>
          <volume>55</volume>
          (
          <year>2021</year>
          )
          <article-title>101447</article-title>
          . URL: https://www.sciencedirect.com/science/article/pii/S1877750321001289. doi:https://doi.org/ 10.1016/j.jocs.
          <year>2021</year>
          .
          <volume>101447</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>K.</given-names>
            <surname>Mattila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hyväluoma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Timonen</surname>
          </string-name>
          , T. Rossi,
          <article-title>Comparison of implementations of the latticeboltzmann method</article-title>
          ,
          <source>Computers and Mathematics with Applications</source>
          <volume>55</volume>
          (
          <year>2008</year>
          )
          <fpage>1514</fpage>
          -
          <lpage>1524</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/S0898122107006232. doi:https://doi.org/ 10.1016/j.camwa.
          <year>2007</year>
          .
          <volume>08</volume>
          .001, mesoscopic Methods in Engineering and Science.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Latt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Coreixas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Beny</surname>
          </string-name>
          ,
          <article-title>Cross-platform programming model for many-core lattice boltzmann simulations</article-title>
          ,
          <source>PloS One</source>
          <volume>16</volume>
          (
          <year>2021</year>
          )
          <article-title>e0250306</article-title>
          . URL: https://doi.org/10.1371/journal.pone.0250306. doi:
          <volume>10</volume>
          .1371/journal.pone.
          <volume>0250306</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Krause</surname>
          </string-name>
          , G. Amati,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Harting</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gekle</surname>
          </string-name>
          ,
          <article-title>Accuracy and performance of the lattice boltzmann method with 64-bit</article-title>
          ,
          <fpage>32</fpage>
          -
          <lpage>bit</lpage>
          , and
          <article-title>customized 16-bit number formats</article-title>
          ,
          <source>Phys. Rev. E</source>
          <volume>106</volume>
          (
          <year>2022</year>
          )
          <article-title>015308</article-title>
          . URL: https://link.aps.org/doi/10.1103/PhysRevE.106.015308. doi:
          <volume>10</volume>
          .1103/ PhysRevE.106.015308.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ahrens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Geveci</surname>
          </string-name>
          , C. Law,
          <string-name>
            <surname>ParaView:</surname>
          </string-name>
          <article-title>An end-user tool for large data visualization</article-title>
          , in: Visualization Handbook, Elesvier,
          <year>2005</year>
          . ISBN 978-0123875822.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>L.</given-names>
            <surname>Crisci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Carpentieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Thoman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alpay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Heuveline</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cosenza</surname>
          </string-name>
          , Sycl-bench
          <year>2020</year>
          :
          <article-title>Benchmarking sycl 2020 on amd, intel, and nvidia gpus</article-title>
          ,
          <source>in: Proceedings of the 12th International Workshop on OpenCL and SYCL</source>
          , IWOCL '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          . URL: https://doi.org/10.1145/3648115.3648120. doi:
          <volume>10</volume>
          .1145/3648115.3648120.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <article-title>How to optimize data transfers in cuda</article-title>
          , https://developer.nvidia.com/blog/ how-optimize
          <article-title>-data-transfers-cuda-cc/</article-title>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Hollman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Adelstein-Lelbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. C.</given-names>
            <surname>Edwards</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hoemmen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sunderland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Trott</surname>
          </string-name>
          , mdspan in C++
          <article-title>: A case study in the integration of performance portable features into international language standards</article-title>
          , CoRR abs/
          <year>2010</year>
          .06474 (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2010</year>
          .06474. arXiv:
          <year>2010</year>
          .06474.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>X.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ramos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. W.</given-names>
            <surname>Schulz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lowery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Curtis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Pietrantonio</surname>
          </string-name>
          ,
          <source>Amdresearch/omniperf: v2.0</source>
          .
          <issue>1</issue>
          (
          <issue>03</issue>
          june 2024),
          <year>2024</year>
          . URL: https://doi.org/10.5281/zenodo.7314631. doi:
          <volume>10</volume>
          .5281/zenodo.7314631.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kwack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tramm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bertoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ghadar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Homerding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Knight</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Parker</surname>
          </string-name>
          ,
          <article-title>Evaluation of performance portability of applications and mini-apps across amd, intel and nvidia gpus</article-title>
          ,
          <source>in: Int. Workshop on Performance, Portability and Productivity in HPC (P3HPC)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>45</fpage>
          -
          <lpage>56</lpage>
          . doi:
          <volume>10</volume>
          .1109/P3HPC54578.
          <year>2021</year>
          .
          <volume>00008</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <article-title>Nvidia profiling tools</article-title>
          , https://developer.nvidia.com/tools-overview,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <article-title>Intel advisor homepage</article-title>
          , https://www.intel.com/content/www/us/en/developer/tools/oneapi/ advisor.html,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>