Introduction

Exploring Trade-offs of Compiler Optimizations to Enable Performance Portability for Multi-level Memory Hierarchies

Aleksei Levchenko

a@expx.org 0 0 Peter the Great St.Petersburg Polytechnic University , Saint Petersburg , Russia

Performance portability problem is manifested for architectures with deep memory hierarchies, in particular, as a result of insufficient spatial locality support by compiler infrastructures. A polyhedral optimization approach can target spatial locality, but faces a number of challenges like an ambiguity in compatibility with other optimizations, a lack of polyhedral ready benchmarks and effects of non-uniformity of real world systems with a multi-level memory. Complementing the prior research of selecting optimizations, this paper focuses on experimental characterization of loop tiling and vectorization using the full proxy application. The presented approach makes the portability of performance provable for target architectures with deep memory hierarchies. To this end, large-scale ccNUMA macronodes are considered as experimental prototypes for hypothetical HPC designs of a capacity-bandwidth type, capable of imposing singular challenges for performance portability.

Benchmarking Locality Loop tiling Multi-level memory hierarchy Performance portability Polyhedral model Program performance ccNUMA

Introduction

In the run-up to the era of exascale computing, a forthcoming migration to advanced future systems involves a joint consideration of a notorious performance problem in connection with a performance portability issue. The performance portability challenge is manifested in an architecture-dependent performance degradation of a specialized version of code when compiling and running it across multi-core systems developed using new architectural principles that are different from the original ones. One of the predicted features of the next generation systems of capacity-bandwidth type is the existence of logically indivisible, globally addressable petabytes of RAM obtained by combining memories of a set of computing nodes [ ]. The core of performance portability issue of the last class of systems remains the spatial locality that would almost likely be suboptimal for contemporary scientific applications. The polyhedral model is a promising and verifiable approach to achieve portability of performance over architectures with a hardly predictable spatial memory access behavior. An important piece in the puzzle is to evaluate the effect of a complex combination of polyhedral compilation techniques against the background of syntactic-level loop transformations already available [ , , ]. Despite the growing number of papers on polyhedral frameworks development, there are, however, some issues associated with a lack of methods to achieve and estimate the performance portability, even with a well-known contemporary hardware. Accordingly, a number of steps might be proposed to clarify the methodology of performance improvement predictions based on compiler transformations.

Ad locum, this paper brings the following contributions. The presented approach allows to evaluate the performance portability of the code compiled using the relevant compilation algorithm for systems with different levels of a deep hierarchical memory. The first step is to define the set of target systems, i.e., ccNUMA macronodes. At the second stage ad hoc models should be formulated for the core parameters of performance portability, specifically spatial locality in the case at hand. The third stage involves the selection of the most promising compiler optimizations in conjunction with relevant HPC ready benchmarks. Further, the results of optimizations are compared with a reference model at the fourth experimental stage for architecturally affined systems with a deepening memory hierarchies. In the context of performance portability, a distinctive feature of proposed approach is a more reliable estimation of the trade-offs of compilation algorithms for current large-scale systems and a possibility of extrapolation of these results for hypothetical exascale designs.

The rest of this paper is organized as follows. Section defines the set of target systems and dissects the existing background in achievement performance portability via compiler optimizations. Section presents the core part of approach, which includes model considerations to evaluate the effect of optimizations. Section is about selecting transformations for performance portability. Results of experimental evaluation and discussions are reported in Section . Further, Section refers to the previous research, proving the interim findings of this paper and concerning the issues of locality transformations for performance portability towards multi-level memory and targeting spatial locality using polyhedral optimizations. Finally, Section contains final remarks and discusses future work.

Background and Notation

At the first stage of the proposed approach, this section gives an idea about the targets, portability of performance over which is investigated. In the context of this work, a special subset of performance portability is implied when the performance of unoptimized code is ported to multi-level memory hierarchies of macronodes. Macronodes are shared memory multi-machine nodes with deep memory hierarchy based on Cache-Coherent Non-Uniform Memory Access (ccNUMA) architecture. Table enables to deduce an inference about the available macronodes configurations. Since the hypothetical systems of capacity-bandwidth type will have a depth complexity of globally addressable memory that is not comparable with a currently available one, the prototype of such system should have the approximate extreme characteristics. Being a shared memory clusters, current ccNUMAs are equipped with a larger amount of RAM and the on-line CPUs, than it is available on a typical general-purpose cluster node. The largest available multi-machine node, so-called jumbonode defined in paper [ ], is capable of providing more than Tb of RAM at a time with K CPUs running a single operating system instance. At present, these features allow to extrapolate its performance for hypothetical characteristics of future machines [ ]. Obviously, the main difference between the macronodes and the other get-at-able architectures is the possibility of a much larger indivisible amount of globally addressable memory controlled by one OS instance with a record number of on-line CPUs. It is this criterion has formed the basis for considering the ccNUMA architecture as an affordable prototype and a primary target since the lack of hypothetical machines that have not been developed yet. This viewpoint is grounded on a number of the previous works mentioned in the Section .

Deep memory hierarchy causes significant challenges for loop-intensive applications, well-functioning in a general-purpose HPC environment, but failing frequently in circumstances when spatial locality degrades [ ]. In comparison with other software levels (e.g., performance portability libraries), the way of compiler analysis and optimizations is evidently more promising in terms of software/architecture codesign, due to better awareness of both the program and the underlying hardware. In particular, the polyhedral model, a mathematical framework to transform affine loop nests, is a promising way to achieve performance portability over targets with deep memory hierarchies via implementation of transformations as a single algebraic operation [ ]. Compilers based on the polyhedral model provide alternative ways to use available resources of parallelism through analysis and transformations of the loop-based code. Properly implemented tiling, which reorganizes computation iteration space to improve cache reuse, can thus improve data locality and, as a consequence, the performance of iterative algorithms for a class of numeric programs. In the case of macronode, the tile size is critical, and it is prescribed by the cache/TLB/NUMA node size that requires a multi-aspect automatic determination of the optimal boundaries for the loop nest.

Therefore, it is argued that the deep memory hierarchy of macronodes specified above can demonstrate singular performance portability challenges that are difficult to meet in traditional computing clusters. While porting performance to macronodes, the estimation of improvements and drawdowns should be mapped to the prediction of performance models, pre-developed contextually.

Ad-Hoc Models

At the second stage, it is necessary to examine performance portability models for the considered targets (macronodes). Strict performance models can be efficient to characterize improvements provided by optimizers via several techniques. In this respect, the challenge looks as a number of limitations of the disparate models. A drawback here is that the cost models of compiler transformations often do not go beyond transformations in themselves, unlike the external memory-wide view of the entire memory subsystem. Meanwhile, it is possible to consider the main characteristics of the macronode that will affect the performance model of the proxy application. In this context, proxy application is a surrogate, representative scientific code for which there is a number of equivalent implementations using alternative parallel programming models and targeting multiple architectures. The aspects of using proxy applications implied in this paper are within the framework of the concept proposed in [ ]. Eq. associates computational proxy kernels, which the proxy application consists of, with a set of code improvements applied by the compiler simultaneously or in turn [ ]. Let denotes the number of optimized proxy kernels, Ti,m-node,opt is the execution time of proxy kernel on macronode after applying every optimization . Under this assumption, the total execution time of proxy application is ,, ( ) ( ) ( ) .

T = ∑︁ =1 ,,

Ti,m-node,opt Next, performance portability

of proxy kernel can be determined according to the formula

, proposed by Pennycook et al. [ ], and reinterpreted here for the case of macronodes: m-node,opt,proxy = ⎧ ⎪⎩ 0 ⎪⎨ ∑︀ ∈ (, | | 1 ) , if ∀ ∈ , otherwise where (,

) denotes execution time of optimized proxy kernel for macronode , |

| — the set of macronodes. Here, spatial locality is considered as a key parameter of performance portability to overcome the notorious memory wall problem. Spatial locality reflects the tendencies in application behavior to access neighboring memory regions near regions that have been recently accessed. To consider the impact of compiler optimizations on spatial locality, it is necessary to analyze access to neighboring memory regions, which is the responsibility of the compiler. The results of proxy kernels can be compared using Eq. , that yields the locality measure previously defined by Dümmler et al. [ ] using logarithmic geometric mean of access distances and reinterpreted here for macronodes as −, := ∑︁ ∈ ︃( ︁∑ =1 log2 ︀( ︀) ︃) where ( ) is spatial access distance of a variable ∈ (in the multiset of variables

), is the total number of accesses to a variable The idea is to deduce the locality of computationally equivalent versions of the ∈ proxy application with different compiler optimizations for target macronodes.

As a result, the components of the performance portability equation, particularly the set of target systems and the main performance parameter (i.e., spatial locality), were identified. Additionally, the definition of the proxy kernel/program and the general formula of its execution time are given.

Selecting Transformations for Performance Portability

Even so, the next stage of the proposed approach involves the selection of the most promising transformations for the proxy application. Currently, the challenge is a lack of applications, that may be considered as polyhedral benchmarks as such. Polybench suite, for instance, is commonly used for this purpose [ , , ]. However, systems with deep memory can achieve deceptively high levels of performance on small benchmarks, but lose performance in tasks of more realistic sizes. Therefore, of particular interest is a study of the polyhedral optimizations effects on behavior of a full-fledged scientific application in a real-world architecture, at least of a proxy application that can be divided into multiple proxy kernels. The experimental evaluation of optimized programs may be limited by the availability of a suitable benchmark code and the ability of a polyhedral optimizer to perform its transformations. As an example, the High Performance Conjugate Gradients (HPCG) Benchmark provides SpMV and symmetric Gauss-Seidel preconditioner with loop carried dependencies. HPCG can be considered as a proxy application, since it already has versions for different parallel programming models, namely MPI, OpenMP, SHMEM [ ], etc. Although in this paper HPCG is still under consideration to be proxy application, run-time reordering transformations like sparse tiling that improve data locality in general have high potential for symmetric Gauss-Seidel kernel which dominates in the HPCG runtime.

Instead, a thoroughly studied proxy application, Livermore Unstructured Lagrange Explicit Shock Hydrodynamics (LULESH) has been used so far to evaluate the optimizations effects. LULESH solves one octant of the spherical Sedov blast problem using Lagrange hydrodynamics [ ], which is representative of existing HPC codes and is able to demonstrate the complexity of poor spatial locality problem. This paper focuses on the traditional OpenMP implementation of LULESH as a consequence of the core architectural characteristics of the macronodes. In this respect, OpenMP programming model, the base version of LULESH alongside with serial and MPI code, is considered to be widely used mostly at the intra-node level. At the same time, it hypothetically can be used in hybrid MPI+OpenMP+X fashion which is considered as promising for exascale supercomputer designs. The presence of more than K of threads living within the terabytes of shared memory is a great stress-test leastwise against the background of known benchmarking efforts.† The performance of subsequent LULESH implementations for emerging programming models is often compared by researchers with the characteristics of OpenMP code. Currently, LULESH results are known for target architectures like BG/Q [ ], Cray XE [ ], Power , AMC [ ], etc. Another advantage of focusing on the traditional OpenMP programming model is the support by number of polyhedral infrastructures and corresponding libraries.

As mentioned above, while Polybench suite is specially designed to contain predefined static control parts (SCoPs), LULESH is a more realistic, full proxy ap†Up to

threads supported using customized OpenMP implementation. plication, but it is not polyhedral benchmark from the cradle. It provides relatively more complex SCoPs, accordingly, it has to be prepared to become polyhedral optimizable. The range expansion of traditional polyhedral benchmarks here is a consequence of the search for applications like HPCG or LULESH containing important computational proxy kernels, which ( ) would be representative of wide range of important scientific applications, and ( ) would allow to select simplified tasks from the full proxy application, enabling the semi-automatic iterative selection of compiler optimizations. The most significant modifications of OpenMP implementation of LULESH include resolving indirect array accesses. Potential SCoPs can be formed by converting from the most time-consuming large parallel OpenMP regions. The limitation here is that SCoPs contain multiple redundant dependencies between various statements, which must be eliminated. This approach was proposed by Wang et al. [ ], where the variants of LULESH code were generated using PoCC (the Polyhedral Compiler Collection) [ ]. The list of optimizations applied in this paper or considered for LULESH in the well-known studies beyond the scope of this work is shown in Table . Array contraction [ ] Considered, applicable Global allocation [ ] Considered, applicable Loop fusion [ ] Considered, applicable Loop distribution [ ] Considered, applicable (+) Tiling Applied (+) Vectorization Applied

Considered, applicable

Considered, applicable Considered, not applicable

Applied Applied

Loop tiling for data locality is an important addition in the context of the transformations already reviewed, because the right tile size/shape can take into account the sophisticated characteristics of a non-uniform deep memory. For example, when optimizing performance of a local memory, kernel loop can be produced exclusively for the local memory access and bounding loop can be produced to transfer the DMA operations outside the kernel loop in conditions of insufficient hardware coherence support for a local and global memory. Loop tiling requires development of the models in compatibility with other loop transformations.

Experimental Evaluation and Discussions

The fourth stage includes optimizations of the proxy program and the measurement of a number of metrics, the most important of which is the spatial locality. To this end, this paper uses Polly, a tool for the polyhedral optimization of the LLVM-IR for a data locality and parallelism [ ]. Polly was used to detect SCoPs in canonicalized code in the front end that can be translated to a polyhedral representation and subjected to optimization. Optimizations, namely loop tiling and vectorization, were described manually in JSCoP format, which is specific to Polly, and applied through import/reimport mechanism of the polyhedral representation with modified schedules of the statements (Figure ). Finally, the transformed polyhedral representation is used for the OpenMP code generation.

Canonicalization passes

SCoPs

detection

Polyhedral

representation

LLVM-IR regeneration IR Modified schedule

JSCoPs (re)import ↕ (+) tiled (+) tiled-vectorized

Experimental runs were carried out using available ccNUMA macronodes configurations (Table ), and the average results are reported. The Grind Time metric, a measurement of the per-element compute time reported by LULESH, was used to compare early-stage results with reference baseline OpenMP code, where no SCoPs detection and code generation were used. Lower values of a Grind Time metric indicate an increase in performance. Table compares the preliminary results of the optimized code with unoptimized version (NoOpt) and demonstrates that transformed LULESH exhibits superior Grind Time to a reference code. As shown in Table , in the case of TB macronode, the stated optimizations reduced LLC and TLB misses, and the percentage of vectorized floating point operations has been increased.

Figure shows the results for the optimized and reference implementation of LULESH for a 903 problem. When applying tiling+vectorization, there is an improvement in performance due to NUMA aware allocation over the intramacronode level when switching Minimal ( Gb) to Medium ( Tb). Although no special compiler optimizations have been applied to improve locality for HPCG, the results of HPCG obtained in paper [ ] with some HPCG optimizations first described in [ ] along with the models that take into account the characteristics NoOpt (baseline OpenMP code) (+) Tiled (+) Vectorized

Target system

characteristics

Single OS instance macronodes

Minimal Medium Jumbonode Architecture details

Gb Tb Tb RAM NUMA node(s) Board/Socket/Core(s) / / / /

Description of improvements made FLOP Vectorization + % + % LLC cache misses - % - % TLB misses - % - % /

Future work Future work Future work

of the macronode were used additionally to predict the approximate scaling of LULESH during aggregation of macronode memory to Tb of RAM.

Tb NoOpt Predicted Tiling +Vectorization

Figure illustrates the results of spatial locality measurement for multithreaded LULESH for 303 . . . 903 sized mesh on the macronodes with Gb and

Tb of globally addressable memory. Using the previously presented model, tiled version shows the expected better spatial locality compared to NoOpt version, which approximately matches the reported values of Grind Time, as well as the Elapsed time results. The main interest is the measurement of locality for the TB macronode, which is the maximum value in these experiments. For all problems, the spatial locality is better (lower) for the optimized code. The surface bundle is approximately the same for the minimum and maximum problem size. Hence, in accordance with the previously considered definitions, the performance portability was achieved for macronodes under consideration.††

Tb Tiled

Tiled+vectorized ) t opNO 903 703 Mesh size503 453 303 Gb o t e Fig. . Spatial locality (relative to NoOpt) for multithreaded LULESH with 303 . . . 903 itv . sized mesh on the macronodes with Gb and Tb of shared memory (lower is better) a l e (r Regarding the trade-offs between the optimizations under consideration in ty terms of spatial locality and parallelism, loop fusion is also widely ceonsidered to il improve locality due to a data motion reduction. Fusion reducensod loops to Tiledlcao laosoppesc,taonfdtnhuismibsetrhoaft OaspepnaMraPllepliasrmallbeleirneggioinncsriesarseeddu,ctehde mfrreoadmcurnodatnot loo[p f]u.sOionne r Tiledla+v.ectorized mTayilendot properly use a hardware prefetching, and sppeatial locality will dbegrade.

itap AmTgugislltoembdee+rsaavttiieosnficetodof,rtahisze eiltodoapppfueasirosnintetnhdes ctaosienocrfhefrauessaeeddasnLuUmLbEeSrHo.f AdetptehnedesanTmcieesttimhpaeet,r m 703S 3 tthhee 03nlouompbfeur3sioofndceapnenpdreenvceinets pthroapt enrepedartaollebleiTzsaattioisnfieodf tinhethOGepbfeunsMedPlocoopdsherw,ebialeldcgsaruoswe. 90 Mesh sTuizshee5ids.7f03acto4r5should be taken503into a4c53co3u03nt when a fused-tiled implemTentation is b

G Threads per macronode

Related MWeosrhksize The fundamental concepts of performance portability implied in this paper are closely adjacent to the terms used in dissertation [ ] where the special

††Studies for hybrid OpenMP/MPI version of LULESH on the Tb Jumbonode now left for near future work. optimization issues are considered in the context of massively threaded systems.

At the same time, work [ ] proposed profitable compiler optimizations for performance portability of CFD applications on multiple HPC systems. The subsequent work [ ] clearly concludes that solution of the problem of analytical determination of tile size limits for loop tiling will help improve performance for a wide range of systems. The most detailed analysis of the concept of performance portability as such was proposed in work [ ] and confirmed by the results of the study [ ]. Papers [ ] and [ ] demonstrate the potential of the polyhedral model to achieve the portability of performance, in particular, due to the architectual modeling of spatial effects in the latter research report.

The conventional consideration of NUMA equipped deep memory systems as a prototype during the adaptation to exascale supercomputer designs is an inherent continuation of several recent works [ , , , , ] and particularly the paper [ ] that directly uses proxy applications for idem. In supercomputing circles, LULESH as a proxy application has been used in the research [ ] on the codesign of compiler and Active Memory Cube (AMC), recently developed Processing in Memory (PIM) architecture for exascale computing, and in a study on the compiler optimizations selection, particularly for BG/Q by León et al. [ ]. As for the currently known studies on polyhedral-related optimizations, LULESH was used in [ ] for the study of tiled Concurrent Collections (CnC) implementation, superior to performance of traditional OpenMP implementation.

Wang et al. [ ] used LULESH to evaluate various optimizations including loop fusion and auto-parallelization of OpenMP baseline implementation, and the result demonstrates the possibility of using LULESH to characterize the performance improvement provided by the newly-developed polyhedral techniques.

Last but not least, Verdoolaege et al. [ ] provides an insight into targeting spatial locality via polyhedral scheduling using Pluto, and Zinenko et al. [ ] proposes an algorithmic template capable of modeling the temporal/spatial locality of multiprocessors.

Conclusions and Future Work

The proposed approach allows to achieve the performance portability over a set of affined targets, that differ significantly in the number of levels of hierarchical memory. At the same time, this paper does not address the issue of performance portability from traditionally used massively parallel architectures to ccNUMA macronodes, which almost certainly will be suboptimal for a class of developed numerical software based on traditional concepts of a spatial locality. The abovementioned fundamental works [ , , ] give a sense of conceivable solution complexity. One aspect of this is that the issue of a backward performance portability from ccNUMA architecture to a traditional symmetric multiprocessor and massively parallel architectures will not be as acute as in the case of direct migration of performance from standard cluster to architectures with a large number of levels of memory hierarchy (i.e., to ccNUMA macronode).

Future work will investigate the opportunities for development of loop tiling performance models to improve fast algorithms to predict performance and automatic tile size selection. The computational kernels of particular significance include important stationary iterative methods such as Jacobi, Gauss-Seidel, SORlike methods that are used as subroutines by other algorithms, e.g., in symmetric multigrid. Another idea being explored is to model fusion+tiling effects for this complex algorithms when porting performance to largest macronodes. Acknowledgments. This work was financially supported by the Ministry of Education and Science of the Russian Federation in the framework of the state assignment No. . . / . (the project theme “Methods and technologies for verification and development of software for modeling and calculations using HPC platform with extramassive parallelism”). . Eisymont, L.: Hybrid strategy development of the supercomputer components.

Open Systems. DBMS ( ), – (Jun ) . Goglin, B.: Memory footprint of locality information on many-core platforms. In:

IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). pp. – (May ) . Grosser, T., Groesslinger, A., Lengauer, C.: Polly — performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters ( ) ( ) . Jacob, A., Nair, R., Chen, T., Sura, Z., Kim, C., Bertolli, C., Antao, S., O’Brien, K.: Progressive codesign of an architecture and compiler using a proxy application. In: th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). pp. – (Oct ) . Jing, M., Kong, F., Jin, X., Zeng, X.: An improved automatic parallelizing algorithm based on polyhedral model. In: th IEEE International Conference on SolidState and Integrated Circuit Technology (ICSICT). pp. – (Oct ) . Karlin, I., Bhatele, A., Keasler, J., Chamberlain, B.L., Cohen, J., Devito, Z., Haque, R., Laney, D., Luke, E., Wang, F., Richards, D., Schulz, M., Still, C.H.: Exploring traditional and emerging parallel programming models using a proxy application. In:

IEEE th International Symposium on Parallel and Distributed Processing. pp. – (May ) . Karlin, I., McGraw, J., Gallardo, E., Keasler, J., León, E.A., Still, B.: Memory and parallelism tuning exploration using the LULESH proxy application. In: SC Companion: High Performance Computing, Networking Storage and Analysis. pp.

– (Nov ) . Karlin, I., et al.: LULESH programming model and performance ports overview.

Tech. Rep. LLNL-TR- , Lawrence Livermore National Laboratory (Dec ) . Kirk, R.O., Mudalige, G.R., Reguly, I.Z., Wright, S.A., Martineau, M.J., Jarvis, S.A.: Achieving performance portability for a heat conduction solver mini-application on modern multi-core systems. In: IEEE International Conference on Cluster Computing (CLUSTER). pp. – (Sep ) . Kotlyarov, V., Drobintsev, P., Levchenko, A., Voinov, N.: Adapting software applications to hybrid supercomputer. In: Proceedings of the th Central & Eastern European Software Engineering Conference in Russia. pp. : – : . CEE-SECR ’ , St. Petersburg, Russia (Oct ) . León, E.A., Karlin, I., Grant, R.E.: Optimizing explicit hydrodynamics for power, energy, and performance. In: IEEE International Conference on Cluster Computing. pp. – (Sep ) . Lin, P.H.: Performance portability strategies for computational fluid dynamics (CFD) applications on HPC systems. Ph.D. thesis, University of Minnesota (Jun ) . Liu, C., Kulkarni, M.: Evaluating performance of task and data coarsening in concurrent collections. In: Ding, C., Criswell, J., Wu, P. (eds.) Languages and Compilers for Parallel Computing. pp. – . Springer International Publishing, Cham ( ) . Meeus, W., Stroobandt, D.: Data reuse buffer synthesis using the polyhedral model. IEEE Transactions on Very Large Scale Integration (VLSI) Systems ( ), – (Jul ) . Milthorpe, J., Grove, D., Herta, B., Tardieu, O.: Exploring the APGAS programming model using the LULESH proxy application. Tech. Rep. IBM Research Report. RC (WAT - ), IBM Research Division, Thomas J. Watson Research Center (Sep ) . Padoin, E.L., Pilla, L.L., Castro, M., Navaux, P.O.A., Méhaut, J.F.: Exploration of load balancing thresholds to save energy on iterative applications. In: Barrios Hernández, C.J., Gitler, I., Klapp, J. (eds.) High Performance Computing. pp.

– . Springer International Publishing, Cham ( ) . Pennycook, S., Sewall, J., Lee, V.: Implications of a metric for performance portability. Future Generation Computer Systems ( ), https://doi.org/ . /j.future.

. . . Perarnau, S., Zounmevo, J.A., Dreher, M., Essen, B.C.V., Gioiosa, R., Iskra, K., Gokhale, M.B., Yoshii, K., Beckman, P.: Argo NodeOS: Toward unified resource management for exascale. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS). pp. – (May ) . Saavedra, R.H., Smith, A.J.: Performance characterization of optimizing compilers.

IEEE Transactions on Software Engineering ( ), – (Jul ) . Sharma, K.: Locality transformations of computation and data for portable performance. Ph.D. thesis, Rice University (Aug ) . Shirako, J., Pouchet, L.N., Sarkar, V.: Oil and water can mix: An integration of polyhedral and AST-based transformations. In: SC : International Conference for High Performance Computing, Networking, Storage and Analysis. pp. – (Nov ) . Stratton, J.: Performance portability of parallel kernels on shared-memory systems.

Ph.D. thesis, University of Illinois at Urbana-Champaign ( ) . Verdoolaege, S., Isoard, A.: Extending Pluto-style polyhedral scheduling with consecutivity. In: Proceedings of the th International Workshop on Polyhedral Compilation Techniques, IMPACT . Manchester, United Kingdom (Jan ) . Wang, W., Cavazos, J., Porterfield, A.: Energy auto-tuning using the polyhedral approach. In: Proceedings of the th International Workshop on Polyhedral Compilation Techniques, IMPACT . Vienna, Austria (Jan ) . Zinenko, O., Verdoolaege, S., Reddy, C., Shirako, J., Grosser, T., Sarkar, V., Cohen, A.: Unified polyhedral modeling of temporal and spatial locality. Research Report RR- , Inria Paris (Nov ), https://hal.inria.fr/hal. Zinenko, O., Verdoolaege, S., Reddy, C., Shirako, J., Grosser, T., Sarkar, V., Cohen, A.: Modeling the conflicting demands of parallelism and temporal/spatial locality in affine scheduling. In: Proceedings of the th International Conference on Compiler Construction. pp. – . CC , Vienna, Austria (Feb ) . Zou, P., Allen, T., Claude H. Davis IV, Feng, X., Ge, R.: CLIP: Cluster-level intelligent power coordination for power-bounded systems. In: IEEE International Conference on Cluster Computing (CLUSTER). pp. – (Sep ) . Zounmevo, J.A., Perarnau, S., Iskra, K., Yoshii, K., Gioiosa, R., Van Essen, B.C., Gokhale, M.B., Leon, E.A.: A container-based approach to OS specialization for exascale computing. In: IEEE International Conference on Cloud Engineering. pp. – (Mar )