<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring Trade-offs of Compiler Optimizations to Enable Performance Portability for Multi-level Memory Hierarchies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aleksei Levchenko</string-name>
          <email>a@expx.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Peter the Great St.Petersburg Polytechnic University</institution>
          ,
          <addr-line>Saint Petersburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Performance portability problem is manifested for architectures with deep memory hierarchies, in particular, as a result of insufficient spatial locality support by compiler infrastructures. A polyhedral optimization approach can target spatial locality, but faces a number of challenges like an ambiguity in compatibility with other optimizations, a lack of polyhedral ready benchmarks and effects of non-uniformity of real world systems with a multi-level memory. Complementing the prior research of selecting optimizations, this paper focuses on experimental characterization of loop tiling and vectorization using the full proxy application. The presented approach makes the portability of performance provable for target architectures with deep memory hierarchies. To this end, large-scale ccNUMA macronodes are considered as experimental prototypes for hypothetical HPC designs of a capacity-bandwidth type, capable of imposing singular challenges for performance portability.</p>
      </abstract>
      <kwd-group>
        <kwd>Benchmarking</kwd>
        <kwd>Locality</kwd>
        <kwd>Loop tiling</kwd>
        <kwd>Multi-level memory hierarchy</kwd>
        <kwd>Performance portability</kwd>
        <kwd>Polyhedral model</kwd>
        <kwd>Program performance</kwd>
        <kwd>ccNUMA</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>In the run-up to the era of exascale computing, a forthcoming migration to
advanced future systems involves a joint consideration of a notorious performance
problem in connection with a performance portability issue. The performance
portability challenge is manifested in an architecture-dependent performance
degradation of a specialized version of code when compiling and running it across
multi-core systems developed using new architectural principles that are different
from the original ones. One of the predicted features of the next generation
systems of capacity-bandwidth type is the existence of logically indivisible,
globally addressable petabytes of RAM obtained by combining memories of a
set of computing nodes [ ]. The core of performance portability issue of the
last class of systems remains the spatial locality that would almost likely be
suboptimal for contemporary scientific applications. The polyhedral model is
a promising and verifiable approach to achieve portability of performance over
architectures with a hardly predictable spatial memory access behavior. An
important piece in the puzzle is to evaluate the effect of a complex combination of
polyhedral compilation techniques against the background of syntactic-level loop
transformations already available [ , , ]. Despite the growing number of papers
on polyhedral frameworks development, there are, however, some issues associated
with a lack of methods to achieve and estimate the performance portability, even
with a well-known contemporary hardware. Accordingly, a number of steps might
be proposed to clarify the methodology of performance improvement predictions
based on compiler transformations.</p>
      <p>Ad locum, this paper brings the following contributions. The presented
approach allows to evaluate the performance portability of the code compiled
using the relevant compilation algorithm for systems with different levels of a
deep hierarchical memory. The first step is to define the set of target systems, i.e.,
ccNUMA macronodes. At the second stage ad hoc models should be formulated
for the core parameters of performance portability, specifically spatial locality in
the case at hand. The third stage involves the selection of the most promising
compiler optimizations in conjunction with relevant HPC ready benchmarks.
Further, the results of optimizations are compared with a reference model at the
fourth experimental stage for architecturally affined systems with a deepening
memory hierarchies. In the context of performance portability, a distinctive
feature of proposed approach is a more reliable estimation of the trade-offs
of compilation algorithms for current large-scale systems and a possibility of
extrapolation of these results for hypothetical exascale designs.</p>
      <p>The rest of this paper is organized as follows. Section defines the set of target
systems and dissects the existing background in achievement performance
portability via compiler optimizations. Section presents the core part of approach,
which includes model considerations to evaluate the effect of optimizations.
Section is about selecting transformations for performance portability. Results
of experimental evaluation and discussions are reported in Section . Further,
Section refers to the previous research, proving the interim findings of this
paper and concerning the issues of locality transformations for performance
portability towards multi-level memory and targeting spatial locality using polyhedral
optimizations. Finally, Section contains final remarks and discusses future work.</p>
    </sec>
    <sec id="sec-2">
      <title>Background and Notation</title>
      <p>At the first stage of the proposed approach, this section gives an idea about the
targets, portability of performance over which is investigated. In the context
of this work, a special subset of performance portability is implied when the
performance of unoptimized code is ported to multi-level memory hierarchies
of macronodes. Macronodes are shared memory multi-machine nodes with deep
memory hierarchy based on Cache-Coherent Non-Uniform Memory Access
(ccNUMA) architecture. Table enables to deduce an inference about the available
macronodes configurations. Since the hypothetical systems of capacity-bandwidth
type will have a depth complexity of globally addressable memory that is not
comparable with a currently available one, the prototype of such system should
have the approximate extreme characteristics. Being a shared memory clusters,
current ccNUMAs are equipped with a larger amount of RAM and the on-line
CPUs, than it is available on a typical general-purpose cluster node. The largest
available multi-machine node, so-called jumbonode defined in paper [ ], is capable
of providing more than Tb of RAM at a time with K CPUs running a single
operating system instance. At present, these features allow to extrapolate its
performance for hypothetical characteristics of future machines [ ]. Obviously,
the main difference between the macronodes and the other get-at-able
architectures is the possibility of a much larger indivisible amount of globally addressable
memory controlled by one OS instance with a record number of on-line CPUs. It
is this criterion has formed the basis for considering the ccNUMA architecture
as an affordable prototype and a primary target since the lack of hypothetical
machines that have not been developed yet. This viewpoint is grounded on a
number of the previous works mentioned in the Section .</p>
      <p>Deep memory hierarchy causes significant challenges for loop-intensive
applications, well-functioning in a general-purpose HPC environment, but failing
frequently in circumstances when spatial locality degrades [ ]. In comparison
with other software levels (e.g., performance portability libraries), the way of
compiler analysis and optimizations is evidently more promising in terms of
software/architecture codesign, due to better awareness of both the program and
the underlying hardware. In particular, the polyhedral model, a mathematical
framework to transform affine loop nests, is a promising way to achieve
performance portability over targets with deep memory hierarchies via implementation
of transformations as a single algebraic operation [ ]. Compilers based on the
polyhedral model provide alternative ways to use available resources of parallelism
through analysis and transformations of the loop-based code. Properly
implemented tiling, which reorganizes computation iteration space to improve cache
reuse, can thus improve data locality and, as a consequence, the performance of
iterative algorithms for a class of numeric programs. In the case of macronode,
the tile size is critical, and it is prescribed by the cache/TLB/NUMA node size
that requires a multi-aspect automatic determination of the optimal boundaries
for the loop nest.</p>
      <p>Therefore, it is argued that the deep memory hierarchy of macronodes specified
above can demonstrate singular performance portability challenges that are
difficult to meet in traditional computing clusters. While porting performance to
macronodes, the estimation of improvements and drawdowns should be mapped
to the prediction of performance models, pre-developed contextually.</p>
    </sec>
    <sec id="sec-3">
      <title>Ad-Hoc Models</title>
      <p>At the second stage, it is necessary to examine performance portability models for
the considered targets (macronodes). Strict performance models can be efficient to
characterize improvements provided by optimizers via several techniques. In this
respect, the challenge looks as a number of limitations of the disparate models. A
drawback here is that the cost models of compiler transformations often do not
go beyond transformations in themselves, unlike the external memory-wide view
of the entire memory subsystem. Meanwhile, it is possible to consider the main
characteristics of the macronode that will affect the performance model of the
proxy application. In this context, proxy application is a surrogate, representative
scientific code for which there is a number of equivalent implementations using
alternative parallel programming models and targeting multiple architectures.
The aspects of using proxy applications implied in this paper are within the
framework of the concept proposed in [ ]. Eq.
associates computational proxy
kernels, which the proxy application consists of, with a set of code improvements
applied by the compiler simultaneously or in turn [ ]. Let 
denotes the number of optimized proxy kernels, Ti,m-node,opt is the execution
time of proxy kernel  on macronode after applying every optimization 
. Under
this assumption, the total execution time of proxy application is
,,
( )
( )
( )
 .</p>
      <p>T = ∑︁

 =1

,,</p>
      <p>Ti,m-node,opt
Next, performance portability</p>
      <p>of proxy kernel can be determined according
to the formula</p>
      <p>, proposed by Pennycook et al. [ ], and reinterpreted here for
the case of macronodes:
 
m-node,opt,proxy =
⎧
⎪⎩ 0
⎪⎨ ∑︀  ∈  (,
| |
1
)
, if ∀ ∈  ,
otherwise
where  (,</p>
      <p>) denotes execution time of optimized proxy kernel for
macronode  , |</p>
      <p>| — the set of macronodes. Here, spatial locality is considered as
a key parameter of performance portability to overcome the notorious memory
wall problem. Spatial locality reflects the tendencies in application behavior to
access neighboring memory regions near regions that have been recently accessed.
To consider the impact of compiler optimizations on spatial locality, it is necessary
to analyze access to neighboring memory regions, which is the responsibility of the
compiler. The results of proxy kernels can be compared using Eq. , that yields
the locality measure previously defined by Dümmler et al. [ ] using logarithmic
geometric mean of access distances and reinterpreted here for macronodes as
  −,
:= ∑︁
  ∈ 
︃(   
︁∑
 =1
log2   
︀(   ︀)
︃)
where    (  ) is spatial access distance of a variable   ∈  
 (in the multiset
of variables</p>
      <p>),    is the total number of accesses to a variable  
The idea is to deduce the locality of computationally equivalent versions of the
∈  
proxy application with different compiler optimizations for target macronodes.</p>
      <p>As a result, the components of the performance portability equation,
particularly the set of target systems and the main performance parameter (i.e., spatial
locality), were identified. Additionally, the definition of the proxy kernel/program
and the general formula of its execution time are given.</p>
    </sec>
    <sec id="sec-4">
      <title>Selecting Transformations for Performance Portability</title>
      <p>Even so, the next stage of the proposed approach involves the selection of the most
promising transformations for the proxy application. Currently, the challenge
is a lack of applications, that may be considered as polyhedral benchmarks as
such. Polybench suite, for instance, is commonly used for this purpose [ , ,
]. However, systems with deep memory can achieve deceptively high levels
of performance on small benchmarks, but lose performance in tasks of more
realistic sizes. Therefore, of particular interest is a study of the polyhedral
optimizations effects on behavior of a full-fledged scientific application in a
real-world architecture, at least of a proxy application that can be divided into
multiple proxy kernels. The experimental evaluation of optimized programs may
be limited by the availability of a suitable benchmark code and the ability of a
polyhedral optimizer to perform its transformations. As an example, the High
Performance Conjugate Gradients (HPCG) Benchmark provides SpMV and
symmetric Gauss-Seidel preconditioner with loop carried dependencies. HPCG
can be considered as a proxy application, since it already has versions for different
parallel programming models, namely MPI, OpenMP, SHMEM [ ], etc. Although
in this paper HPCG is still under consideration to be proxy application, run-time
reordering transformations like sparse tiling that improve data locality in general
have high potential for symmetric Gauss-Seidel kernel which dominates in the
HPCG runtime.</p>
      <p>Instead, a thoroughly studied proxy application, Livermore Unstructured
Lagrange Explicit Shock Hydrodynamics (LULESH) has been used so far to
evaluate the optimizations effects. LULESH solves one octant of the spherical
Sedov blast problem using Lagrange hydrodynamics [ ], which is representative
of existing HPC codes and is able to demonstrate the complexity of poor spatial
locality problem. This paper focuses on the traditional OpenMP implementation
of LULESH as a consequence of the core architectural characteristics of the
macronodes. In this respect, OpenMP programming model, the base version
of LULESH alongside with serial and MPI code, is considered to be widely
used mostly at the intra-node level. At the same time, it hypothetically can be
used in hybrid MPI+OpenMP+X fashion which is considered as promising for
exascale supercomputer designs. The presence of more than K of threads living
within the terabytes of shared memory is a great stress-test leastwise against
the background of known benchmarking efforts.† The performance of subsequent
LULESH implementations for emerging programming models is often compared
by researchers with the characteristics of OpenMP code. Currently, LULESH
results are known for target architectures like BG/Q [ ], Cray XE [ ], Power
, AMC [ ], etc. Another advantage of focusing on the traditional OpenMP
programming model is the support by number of polyhedral infrastructures and
corresponding libraries.</p>
      <p>As mentioned above, while Polybench suite is specially designed to contain
predefined static control parts (SCoPs), LULESH is a more realistic, full proxy
ap†Up to</p>
      <p>threads supported using customized OpenMP implementation.
plication, but it is not polyhedral benchmark from the cradle. It provides relatively
more complex SCoPs, accordingly, it has to be prepared to become polyhedral
optimizable. The range expansion of traditional polyhedral benchmarks here is
a consequence of the search for applications like HPCG or LULESH containing
important computational proxy kernels, which ( ) would be representative of
wide range of important scientific applications, and ( ) would allow to select
simplified tasks from the full proxy application, enabling the semi-automatic
iterative selection of compiler optimizations. The most significant modifications
of OpenMP implementation of LULESH include resolving indirect array accesses.
Potential SCoPs can be formed by converting from the most time-consuming large
parallel OpenMP regions. The limitation here is that SCoPs contain multiple
redundant dependencies between various statements, which must be eliminated.
This approach was proposed by Wang et al. [ ], where the variants of LULESH
code were generated using PoCC (the Polyhedral Compiler Collection) [ ]. The
list of optimizations applied in this paper or considered for LULESH in the
well-known studies beyond the scope of this work is shown in Table .
Array contraction [ ] Considered, applicable
Global allocation [ ] Considered, applicable
Loop fusion [ ] Considered, applicable
Loop distribution [ ] Considered, applicable
(+) Tiling Applied
(+) Vectorization Applied</p>
      <sec id="sec-4-1">
        <title>Considered, applicable</title>
        <p>Considered, applicable</p>
        <p>Considered, applicable
Considered, not applicable</p>
        <p>Applied
Applied</p>
        <p>Loop tiling for data locality is an important addition in the context of the
transformations already reviewed, because the right tile size/shape can take
into account the sophisticated characteristics of a non-uniform deep memory.
For example, when optimizing performance of a local memory, kernel loop
can be produced exclusively for the local memory access and bounding loop
can be produced to transfer the DMA operations outside the kernel loop in
conditions of insufficient hardware coherence support for a local and global
memory. Loop tiling requires development of the models in compatibility with
other loop transformations.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experimental Evaluation and Discussions</title>
      <p>The fourth stage includes optimizations of the proxy program and the
measurement of a number of metrics, the most important of which is the spatial locality.
To this end, this paper uses Polly, a tool for the polyhedral optimization of the
LLVM-IR for a data locality and parallelism [ ]. Polly was used to detect SCoPs
in canonicalized code in the front end that can be translated to a polyhedral
representation and subjected to optimization. Optimizations, namely loop tiling
and vectorization, were described manually in JSCoP format, which is specific
to Polly, and applied through import/reimport mechanism of the polyhedral
representation with modified schedules of the statements (Figure ). Finally, the
transformed polyhedral representation is used for the OpenMP code generation.</p>
      <p>IR</p>
      <p>Canonicalization
passes</p>
      <sec id="sec-5-1">
        <title>SCoPs</title>
        <p>detection</p>
      </sec>
      <sec id="sec-5-2">
        <title>Polyhedral</title>
        <p>representation</p>
      </sec>
      <sec id="sec-5-3">
        <title>LLVM-IR regeneration IR</title>
      </sec>
      <sec id="sec-5-4">
        <title>Modified schedule</title>
        <p>JSCoPs (re)import ↕ (+) tiled (+) tiled-vectorized</p>
        <p>Experimental runs were carried out using available ccNUMA macronodes
configurations (Table ), and the average results are reported. The Grind Time
metric, a measurement of the per-element compute time reported by LULESH,
was used to compare early-stage results with reference baseline OpenMP code,
where no SCoPs detection and code generation were used. Lower values of a
Grind Time metric indicate an increase in performance. Table compares the
preliminary results of the optimized code with unoptimized version (NoOpt)
and demonstrates that transformed LULESH exhibits superior Grind Time to a
reference code. As shown in Table , in the case of TB macronode, the stated
optimizations reduced LLC and TLB misses, and the percentage of vectorized
floating point operations has been increased.</p>
        <p>Figure shows the results for the optimized and reference implementation
of LULESH for a 903 problem. When applying tiling+vectorization, there is
an improvement in performance due to NUMA aware allocation over the
intramacronode level when switching Minimal ( Gb) to Medium ( Tb). Although
no special compiler optimizations have been applied to improve locality for HPCG,
the results of HPCG obtained in paper [ ] with some HPCG optimizations first
described in [ ] along with the models that take into account the characteristics
NoOpt (baseline OpenMP code)
(+) Tiled
(+) Vectorized</p>
      </sec>
      <sec id="sec-5-5">
        <title>Target system</title>
        <p>characteristics</p>
      </sec>
      <sec id="sec-5-6">
        <title>Single OS instance macronodes</title>
        <p>Minimal Medium Jumbonode
Architecture details</p>
        <p>Gb Tb Tb
RAM
NUMA node(s)
Board/Socket/Core(s) / / / /</p>
        <p>Description of improvements made
FLOP Vectorization + % + %
LLC cache misses - % - %
TLB misses - % - %
/</p>
        <p>/</p>
      </sec>
      <sec id="sec-5-7">
        <title>Future work Future work Future work</title>
        <p>of the macronode were used additionally to predict the approximate scaling of
LULESH during aggregation of macronode memory to Tb of RAM.</p>
        <p>Gb</p>
        <p>Tb</p>
        <p>Gb</p>
        <p>Tb
NoOpt
Predicted
Tiling
+Vectorization</p>
        <p>Figure illustrates the results of spatial locality measurement for
multithreaded LULESH for 303 . . . 903 sized mesh on the macronodes with Gb and</p>
        <p>Tb of globally addressable memory. Using the previously presented model, tiled
version shows the expected better spatial locality compared to NoOpt version,
which approximately matches the reported values of Grind Time, as well as
the Elapsed time results. The main interest is the measurement of locality for
the TB macronode, which is the maximum value in these experiments. For all
problems, the spatial locality is better (lower) for the optimized code. The surface
bundle is approximately the same for the minimum and maximum problem size.
Hence, in accordance with the previously considered definitions, the performance
portability was achieved for macronodes under consideration.††</p>
        <p>Tb
Tiled</p>
        <p>Tiled+vectorized
)
t
opNO 903 703 Mesh size503 453 303 Gb
o
t
e Fig. . Spatial locality (relative to NoOpt) for multithreaded LULESH with 303 . . . 903
itv . sized mesh on the macronodes with Gb and Tb of shared memory (lower is better)
a
l
e
(r Regarding the trade-offs between the optimizations under consideration in
ty terms of spatial locality and parallelism, loop fusion is also widely ceonsidered to
il improve locality due to a data motion reduction. Fusion reducensod loops to
Tiledlcao laosoppesc,taonfdtnhuismibsetrhoaft OaspepnaMraPllepliasrmallbeleirneggioinncsriesarseeddu,ctehde mfrreoadmcurnodatnot loo[p f]u.sOionne
r
Tiledla+v.ectorized mTayilendot properly use a hardware prefetching, and sppeatial locality will dbegrade.</p>
        <p>itap AmTgugislltoembdee+rsaavttiieosnficetodof,rtahisze eiltodoapppfueasirosnintetnhdes ctaosienocrfhefrauessaeeddasnLuUmLbEeSrHo.f AdetptehnedesanTmcieesttimhpaeet,r m
703S 3 tthhee 03nlouompbfeur3sioofndceapnenpdreenvceinets pthroapt enrepedartaollebleiTzsaattioisnfieodf tinhethOGepbfeunsMedPlocoopdsherw,ebialeldcgsaruoswe.
90 Mesh sTuizshee5ids.7f03acto4r5should be taken503into a4c53co3u03nt when a fused-tiled implemTentation is
b</p>
        <p>G
Threads per macronode</p>
        <p>Tb</p>
        <p>Related MWeosrhksize
The fundamental concepts of performance portability implied in this paper
are closely adjacent to the terms used in dissertation [ ] where the special</p>
        <p>††Studies for hybrid OpenMP/MPI version of LULESH on the Tb Jumbonode now
left for near future work.
optimization issues are considered in the context of massively threaded systems.</p>
        <p>At the same time, work [ ] proposed profitable compiler optimizations for
performance portability of CFD applications on multiple HPC systems. The
subsequent work [ ] clearly concludes that solution of the problem of analytical
determination of tile size limits for loop tiling will help improve performance for a
wide range of systems. The most detailed analysis of the concept of performance
portability as such was proposed in work [ ] and confirmed by the results of the
study [ ]. Papers [ ] and [ ] demonstrate the potential of the polyhedral model
to achieve the portability of performance, in particular, due to the architectual
modeling of spatial effects in the latter research report.</p>
        <p>The conventional consideration of NUMA equipped deep memory systems
as a prototype during the adaptation to exascale supercomputer designs is an
inherent continuation of several recent works [ , , , , ] and particularly
the paper [ ] that directly uses proxy applications for idem. In supercomputing
circles, LULESH as a proxy application has been used in the research [ ] on
the codesign of compiler and Active Memory Cube (AMC), recently developed
Processing in Memory (PIM) architecture for exascale computing, and in a
study on the compiler optimizations selection, particularly for BG/Q by León et
al. [ ]. As for the currently known studies on polyhedral-related optimizations,
LULESH was used in [ ] for the study of tiled Concurrent Collections (CnC)
implementation, superior to performance of traditional OpenMP implementation.</p>
        <p>Wang et al. [ ] used LULESH to evaluate various optimizations including
loop fusion and auto-parallelization of OpenMP baseline implementation, and
the result demonstrates the possibility of using LULESH to characterize the
performance improvement provided by the newly-developed polyhedral techniques.</p>
        <p>Last but not least, Verdoolaege et al. [ ] provides an insight into targeting spatial
locality via polyhedral scheduling using Pluto, and Zinenko et al. [ ] proposes
an algorithmic template capable of modeling the temporal/spatial locality of
multiprocessors.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and Future Work</title>
      <p>The proposed approach allows to achieve the performance portability over a set
of affined targets, that differ significantly in the number of levels of hierarchical
memory. At the same time, this paper does not address the issue of performance
portability from traditionally used massively parallel architectures to ccNUMA
macronodes, which almost certainly will be suboptimal for a class of developed
numerical software based on traditional concepts of a spatial locality. The
abovementioned fundamental works [ , , ] give a sense of conceivable solution
complexity. One aspect of this is that the issue of a backward performance
portability from ccNUMA architecture to a traditional symmetric multiprocessor
and massively parallel architectures will not be as acute as in the case of direct
migration of performance from standard cluster to architectures with a large
number of levels of memory hierarchy (i.e., to ccNUMA macronode).</p>
      <p>Future work will investigate the opportunities for development of loop tiling
performance models to improve fast algorithms to predict performance and
automatic tile size selection. The computational kernels of particular significance
include important stationary iterative methods such as Jacobi, Gauss-Seidel,
SORlike methods that are used as subroutines by other algorithms, e.g., in symmetric
multigrid. Another idea being explored is to model fusion+tiling effects for this
complex algorithms when porting performance to largest macronodes.
Acknowledgments. This work was financially supported by the Ministry of
Education and Science of the Russian Federation in the framework of the state
assignment No. . . / . (the project theme “Methods and technologies
for verification and development of software for modeling and calculations using
HPC platform with extramassive parallelism”).
. Eisymont, L.: Hybrid strategy development of the supercomputer components.</p>
      <p>Open Systems. DBMS ( ), – (Jun )
. Goglin, B.: Memory footprint of locality information on many-core platforms. In:</p>
      <p>IEEE International Parallel and Distributed Processing Symposium Workshops
(IPDPSW). pp. – (May )
. Grosser, T., Groesslinger, A., Lengauer, C.: Polly — performing polyhedral
optimizations on a low-level intermediate representation. Parallel Processing Letters
( ) ( )
. Jacob, A., Nair, R., Chen, T., Sura, Z., Kim, C., Bertolli, C., Antao, S., O’Brien,
K.: Progressive codesign of an architecture and compiler using a proxy
application. In: th International Symposium on Computer Architecture and High
Performance Computing (SBAC-PAD). pp. – (Oct )
. Jing, M., Kong, F., Jin, X., Zeng, X.: An improved automatic parallelizing algorithm
based on polyhedral model. In: th IEEE International Conference on
SolidState and Integrated Circuit Technology (ICSICT). pp. – (Oct )
. Karlin, I., Bhatele, A., Keasler, J., Chamberlain, B.L., Cohen, J., Devito, Z., Haque,
R., Laney, D., Luke, E., Wang, F., Richards, D., Schulz, M., Still, C.H.: Exploring
traditional and emerging parallel programming models using a proxy application. In:</p>
      <p>IEEE th International Symposium on Parallel and Distributed Processing.
pp. – (May )
. Karlin, I., McGraw, J., Gallardo, E., Keasler, J., León, E.A., Still, B.: Memory and
parallelism tuning exploration using the LULESH proxy application. In: SC
Companion: High Performance Computing, Networking Storage and Analysis. pp.</p>
      <p>– (Nov )
. Karlin, I., et al.: LULESH programming model and performance ports overview.</p>
      <p>Tech. Rep. LLNL-TR- , Lawrence Livermore National Laboratory (Dec )
. Kirk, R.O., Mudalige, G.R., Reguly, I.Z., Wright, S.A., Martineau, M.J., Jarvis, S.A.:
Achieving performance portability for a heat conduction solver mini-application
on modern multi-core systems. In: IEEE International Conference on Cluster
Computing (CLUSTER). pp. – (Sep )
. Kotlyarov, V., Drobintsev, P., Levchenko, A., Voinov, N.: Adapting software
applications to hybrid supercomputer. In: Proceedings of the th Central &amp; Eastern
European Software Engineering Conference in Russia. pp. : – : . CEE-SECR ’ ,
St. Petersburg, Russia (Oct )
. León, E.A., Karlin, I., Grant, R.E.: Optimizing explicit hydrodynamics for power,
energy, and performance. In: IEEE International Conference on Cluster
Computing. pp. – (Sep )
. Lin, P.H.: Performance portability strategies for computational fluid dynamics
(CFD) applications on HPC systems. Ph.D. thesis, University of Minnesota (Jun
)
. Liu, C., Kulkarni, M.: Evaluating performance of task and data coarsening in
concurrent collections. In: Ding, C., Criswell, J., Wu, P. (eds.) Languages and
Compilers for Parallel Computing. pp. – . Springer International Publishing,
Cham ( )
. Meeus, W., Stroobandt, D.: Data reuse buffer synthesis using the polyhedral
model. IEEE Transactions on Very Large Scale Integration (VLSI) Systems ( ),
– (Jul )
. Milthorpe, J., Grove, D., Herta, B., Tardieu, O.: Exploring the APGAS
programming model using the LULESH proxy application. Tech. Rep. IBM Research Report.
RC (WAT - ), IBM Research Division, Thomas J. Watson Research
Center (Sep )
. Padoin, E.L., Pilla, L.L., Castro, M., Navaux, P.O.A., Méhaut, J.F.: Exploration
of load balancing thresholds to save energy on iterative applications. In:
Barrios Hernández, C.J., Gitler, I., Klapp, J. (eds.) High Performance Computing. pp.</p>
      <p>– . Springer International Publishing, Cham ( )
. Pennycook, S., Sewall, J., Lee, V.: Implications of a metric for performance
portability. Future Generation Computer Systems ( ), https://doi.org/ . /j.future.</p>
      <p>. .
. Perarnau, S., Zounmevo, J.A., Dreher, M., Essen, B.C.V., Gioiosa, R., Iskra, K.,
Gokhale, M.B., Yoshii, K., Beckman, P.: Argo NodeOS: Toward unified resource
management for exascale. In: IEEE International Parallel and Distributed
Processing Symposium (IPDPS). pp. – (May )
. Saavedra, R.H., Smith, A.J.: Performance characterization of optimizing compilers.</p>
      <p>IEEE Transactions on Software Engineering ( ), – (Jul )
. Sharma, K.: Locality transformations of computation and data for portable
performance. Ph.D. thesis, Rice University (Aug )
. Shirako, J., Pouchet, L.N., Sarkar, V.: Oil and water can mix: An integration of
polyhedral and AST-based transformations. In: SC : International Conference for
High Performance Computing, Networking, Storage and Analysis. pp. –
(Nov )
. Stratton, J.: Performance portability of parallel kernels on shared-memory systems.</p>
      <p>Ph.D. thesis, University of Illinois at Urbana-Champaign ( )
. Verdoolaege, S., Isoard, A.: Extending Pluto-style polyhedral scheduling with
consecutivity. In: Proceedings of the th International Workshop on Polyhedral
Compilation Techniques, IMPACT . Manchester, United Kingdom (Jan )
. Wang, W., Cavazos, J., Porterfield, A.: Energy auto-tuning using the
polyhedral approach. In: Proceedings of the th International Workshop on Polyhedral
Compilation Techniques, IMPACT . Vienna, Austria (Jan )
. Zinenko, O., Verdoolaege, S., Reddy, C., Shirako, J., Grosser, T., Sarkar, V., Cohen,
A.: Unified polyhedral modeling of temporal and spatial locality. Research Report
RR- , Inria Paris (Nov ),
https://hal.inria.fr/hal. Zinenko, O., Verdoolaege, S., Reddy, C., Shirako, J., Grosser, T., Sarkar, V., Cohen,
A.: Modeling the conflicting demands of parallelism and temporal/spatial locality in
affine scheduling. In: Proceedings of the th International Conference on Compiler
Construction. pp. – . CC , Vienna, Austria (Feb )
. Zou, P., Allen, T., Claude H. Davis IV, Feng, X., Ge, R.: CLIP: Cluster-level
intelligent power coordination for power-bounded systems. In: IEEE International
Conference on Cluster Computing (CLUSTER). pp. – (Sep )
. Zounmevo, J.A., Perarnau, S., Iskra, K., Yoshii, K., Gioiosa, R., Van Essen, B.C.,
Gokhale, M.B., Leon, E.A.: A container-based approach to OS specialization for
exascale computing. In: IEEE International Conference on Cloud Engineering.
pp. – (Mar )</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>