=Paper=
{{Paper
|id=Vol-3027/paper14
|storemode=property
|title=An Auto-Programming Approach to Vulkan
|pdfUrl=https://ceur-ws.org/Vol-3027/paper14.pdf
|volume=Vol-3027
|authors=Vladimir Frolov,Vadim Sanzharov,Vladimir Galaktionov,Alexandr Scherbakov
}}
==An Auto-Programming Approach to Vulkan==
An Auto-Programming Approach to Vulkan Vladimir Frolov 1,2, Vadim Sanzharov 2, Vladimir Galaktionov 1 and Alexandr Scherbakov 2 1 Keldysh Institute of Applied Mathematics, Miusskaya sq., 4, Moscow, 125047, Russia 2 Lomonosov Moscow State University, GSP-1, Leninskie Gory, Moscow, 119991, Russia Abstract We propose a novel high-level approach for software development on GPU using Vulkan API. Our goal is to speed-up development and performance studies for complex algorithms on GPU, which is quite difficult and laborious for Vulkan due to large number of HW features low level details. The proposed approach uses auto programming to translate ordinary C++ to optimized Vulkan implementation with automatic shaders generation, resource binding and fine-grained barriers placement. Our model is not general-purpose programming, but is extendible and customer-focused. For a single C++ input our tool can generate multiple different implementations of algorithm in Vulkan for different cases or types of hardware. For example, we automatically detect reduction in C++ source code and then generate several variants of parallel reduction on GPU: with optimization for different warp size, with or without atomics, using or not subgroup operations. Another example is GPU ray tracing applications for which we can generate different variants: pure software implementation in compute shader, using hardware accelerated ray queries, using full RTX pipeline. The goal of our work is to increase productivity of developers who are forced to use Vulkan due to various required hardware features in their software but still do care about cross-platform ability of the developed software and want to debug their algorithm logic on the CPU. Therefore, we assume that the user will take generated code and integrate it with hand-written Vulkan code. Keywords 1 GPGPU, Vulkan API, ray tracing, code generation 1. Introduction Over the past 5 years, dramatic changes have taken place in the field of GPU programming, which put researchers and developers of deep learning algorithms, computer vision and computer graphics around the world in a difficult situation: widely used and convenient cross-platform technologies such as OpenCL, OpenMP do not provide access to the latest capabilities of modern GPU (indirect dispatch, command buffers, ray tracing, texture compression, and many other). Technologies which do provide enough hardware features are proprietary (CUDA, OptiX) or difficult/laborious (Vulkan, DX12, Metal). In practice, developers have to support several variants of the same algorithm for different GPUs to achieve the desired level of performance and compatibility. Moreover, the differences in the source code (and in performance) can be dramatic. For example, the implementation of image processing can be done via compute shaders or by using the graphics pipeline with hardware alpha blending or graphics pipeline with sub-passes on mobile HW. The essence of an algorithm does not change from how exactly it is implemented on the GPU, but developers should handle different versions and work with low-level details, which usually change for different GPUs. Our goal is to preserve pure algorithmic software description, but at the same time with the ability to use any existing or future Vulkan HW features. GraphiCon 2021: 31st International Conference on Computer Graphics and Vision, September 27-30, 2021, Nizhny Novgorod, Russia EMAIL: vova@frolov.pp.ru (V. Frolov); vadim.sanzharov@graphics.cs.msu.ru (V. Sanzharov); vlgal@gin.keldysh.ru (V. Galaktionov); alex.shcherbakov@graphics.cs.msu.ru (A. Scherbakov) ORCID: 0000-0001-8829-9884 (V. Frolov); 0000-0001-6455-6444 (V. Sanzharov); 0000-0001-6460-7539 (V. Galaktionov); 0000-0002- 5360-4479 (A. Scherbakov) ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 2. Existing Solutions The number of existing GPU and FPGA programming technologies is significant [1] (and we believe that this indicates the high relevance of the problems outlined in the previous section). It’s possible to classify these technologies into several groups. Libraries. Usually aimed at specific class of developers who need implementations of specific tools or algorithms. This category includes such well-known libraries as TBB, Thrust [2], CUBLAS, IPP, NPP, MAGMA, [3], [4], HPX [5] and many others. Some of them turn into rather powerful software solutions (for example PyTorch [6], TensorFlow [7] and [8]). The main drawback of libraries is their narrow focus and limited capabilities. In addition, many libraries are deliberately deprived of the ability to be cross-platform (for example, TBB, Thrust, IPP, Intel Embree), since they are released by a certain CPU or GPU manufacturer in order to promote their own hardware and that is why they are made non- portable. Directive-based Programming Models. These include technologies such as OpenMP, OpenACC [9] C ++ AMP (Microsoft), Spearmint [10], OP2 [11], in works [12-13] and DVM / DVMH [14-16]. This approach has 2 key disadvantages. The first disadvantage is the absence of the control over the operations of copying and distributing data between the CPU and GPU, which is done automatically. This becomes a bottleneck and results in poor performance [17]. In solving this problem, interesting results were achieved in [12], where a software pipeline was used that combines the stages of execution on the CPU and GPU. However, the pipeline cannot always help and in certain situations, memory management and copying must be strictly controlled. The second disadvantage of this group of technologies is the extremely weak support for GPU hardware capabilities. For example, in OpenACC it is impossible to directly use shared memory, warp voting functions, there is no support for textures (which is unacceptable for high-performance and energy-efficient computer vision and computer graphics applications). Other technologies have similar problems as well. Skeletal Programming. This includes such works as SkePU [18, 19], SkelCL [20], SYCL [21] and its derivatives (Intel oneAPI). These technologies target developers actively using STL-like containers in C++. The goal is to have one implementation of the algorithm in C++, which can work efficiently both on the CPU and on the GPU, or hybrid execution (simultaneously on the CPU and GPU). The main advantage over previous group of technologies is higher efficiency for some algorithms (which actively use basic skeletons or their combinations). The disadvantage is a significant deterioration in readability, code maintainability and flexibility, since the implementation of the algorithms has to fit the skeleton which leads to unnaturally looking implementations [22]. In addition, skeletal programming inherits almost all the disadvantages of the previous group. However, it still took an important step forward, since it was the first to apply algorithmic optimizations on a GPU. GPGPU. This group includes CUDA and OpenCL – technologies which are employed by a wide class of professional GPU developers and can provide access to the wide range hardware capabilities of modern GPUs. Nowadays, CUDA is the dominant technology thanks to Nvidia's technological leadership. However, in addition to the lack of cross-platform functionality, it has many drawbacks: CUDA is a very heavy technology with a huge number of functionalities and versions (11 versions at the moment), which do not always work equally well on different Nvidia GPUs. Porting a software system from one version of CUDA to another is often time consuming. To overcome dependence on the single platform several open-source implementations of CUDA were developed using OpenCL or SIMD CPU instructions [23-25], as well as AMD HIP [26]. But this approach is inherently unsuccessful, - it has a critical problem: the lack of hardware support for certain functionality in OpenCL (or other technology used as a back-end) will lead to the fact that the algorithm either won’t work at all, or will work slowly (due to software emulation of this functionality), which is unacceptable for most GPGPU programming applications. As for OpenCL itself, the main drawback of this technology is the lack of support for the hardware functionality of modern GPUs. For example, the indirect dispatch [27], which wasn’t included in the recently released OpenCL 3.0 standard. This functionality is critical for complex algorithms where control flow is dependent on GPU computation [28]. Graphics APIs, Vulkan. This includes technologies such as OpenGL4+, DirectX10 - DirectX11, DirectX12, Metal, Vulkan. GPU programming has changed a lot over the past 5 years with new hardware features in APIs such as Vulkan, DirectX12, and Metal. Further we will consider only Vulkan, because it is a cross-platform technology, unlike DirectX12 and Metal. Unfortunately, the complexity of developing programs using Vulkan exceeds the development using CUDA or OpenCL up to 10 times [29], which is a consequence of the manual control over many hardware capabilities. The traditional way of reducing code complexity is to create lightweight helper libraries, which can be general-purpose or specific for each application. However, in this case such an approach doesn't really help much, because the user still has to work with the same Vulkan entities, setting up and connecting the relationships between them. An alternative solution is to create a heavier version of a library or an engine that encapsulates Vulkan entities and tries to handle things automatically (V-EZ, VUH, Kompute). But this method does not greatly distinguish the engine from the work that, for example, the OpenGL driver does. This usually leads to suboptimal implementation and bugs, as the developer largely duplicates the work of the driver (or the other API). The problem is fundamental, because if someone is going to use Vulkan, they need a very specific low-level control on certain functionality that is supposed to be implemented and carefully configured in some places, but not for everything. Domain Specific APIs. This group is comprised of technologies such as Nvidia OptiX [30], Microsoft DirectML [31], AMD MIOpen. OptiX is a specialized technology for accelerating GPU ray tracing applications developed in C ++. OptiX includes two key technologies: (1) Hardware-accelerated ray tracing and (2) accelerating virtual function calls using shader tables. These 2 functionalities are also available in Vulkan and DirectX12. Since it is much easier to program in OptiX (and this is the only industrial-level technology of this kind that supports C ++), almost all industrial rendering systems have switched to OptiX: Octane, VRay, IRay, RedShift, Cycles, Thea and many others. Although the latest AMD GPUs have hardware support for ray tracing, in fact there is no alternative for users to Nvidia. Analogues (such as [32, 33]) significantly lag behind solutions based on OptiX. This is an example of how programming technology combined with hardware implementation established the monopoly of one GPU manufacturer, preventing the developers of rendering systems of the ability to create cross-platform solutions which is a significant drawback. Domain Specific Languages (DSL). This group targets developers working in specific fields, for whom it is important to achieve maximum simplification and abstraction from the details of the algorithm implementation on the GPU with a good level of performance. For example, the languages Darkroom [34] and Halide [35-37] are designed to create image processing algorithms. Cross-platform ability in Halide is achieved by implementing a large number of low-level layers, and high efficiency due to the use of optimized filter sequences. One of the key optimizations is based on the idea of reordering operations: if we consider applying two filters sequentially to an image, then often the image can be divided into regions and the entire chain of filters can be applied to each region at once: this decreases L2 cache misses. The disadvantage of DSL is that algorithms and knowledge written in them are difficult to transfer to other areas and difficult to integrate with the rest of software that does not use a DSL. In addition, the more areas the software solution targets, the more different DSLs will need to be used and stacked, which significantly complicates the development. For example, in modern computer graphics applications, shaders are essentially a domain-specific language. Currently in graphics and ray tracing pipelines an algorithm is distributed over several programs (from 2 to 5), supplemented by setting the pipeline state in C ++ code. This makes the process of writing a program extremely difficult, since there is no single description of the algorithm, but instead there are many disparate programs and configurable links between them in different places. Various high-level approaches. This group includes different unique solutions. [38] focuses on achieving cross-platform high performance. The goal is to have a single representation of an algorithm that can be translated into an efficient implementation on various computing systems. The means of achievement is the so-called “multidimensional homomorphisms”, a formal description of a problem at a high level that allows expressing computations with parallel patterns that can be implemented in different ways on different hardware. A significant limitation of [38] is the use of OpenCL as a low-level layer which does not provide access to many of the new capabilities of modern GPUs (including mobile ones) due to the limitations of the OpenCL. Next, [38] uses extremely limited DSL, which is a significant drawback. TaskFlow [39] proposes a model for expressing computations in the form of static and dynamic task graphs for heterogeneous computing systems. Tasks communicate with each other by streams of data along the edges of the graph through queues using the producer-consumer scheme. TaskFlow has the ability to build heterogeneous graphs (using both CPU and GPU) and then pipelined execution similar to [13]. Such a computation model is promising since it can not only be efficiently performed on CPU and GPU, but also can be used for prototyping hardware implementations on FPGA or VLSI. The disadvantage of TaskFlow is that the algorithm must be explicitly described in terms of task graphs, which is not very convenient for development and debugging. The PACXX compiler [40, 41] uses skeletal programming ideas, but is more convenient for use in existing code. PACXX directly implements modern C ++ constructs and translates the use of STL containers into GPU buffers. Unfortunately, PACXX does not provide access to many hardware features (for example, textures) which available even in CUDA and OpenCL. Therefore, from a practical point of view, its advantages over pragmas (for example, OpenACC) are not significant. The main purpose of Chapel [42] is to achieve cross-platform ability, so that a single description of the algorithm can work efficiently on both the CPU and GPU. In addition, it is positioned as easier to use than CUDA. For this, authors propose new language oriented towards parallel programming. But unlike Halide, it is a general-purpose language. The key disadvantage of this approach is that the user has to port a significant description of the algorithm to this new language, which is usually met with resistance in the industry because it means lack of cross-platform ability. Taichi [43] is a programming language and an optimizing compiler oriented on applications dealing with sparse data-structures including physical simulations, ray tracing and neural networks. Taichi allows users to write high-level code using proposed language (frontend is embedded in C++) as if they were dealing with ordinary dense multidimensional arrays. The compiler then generates intermediate representation, optimizes it and generates back C++ or CUDA code. Taichi also handles memory management and uses unified memory access feature available in CUDA. The ability to target both CPUs (by using such techniques as loop vectorization) and GPUs (although only using CUDA) is a strong point if this solution. The fact that Taichi targets operations on specific data structures allows it to produce highly optimized and efficient code, but at the same time limits its potential applications. Being a DSL Taichi also shares same drawbacks, however close integration with C++ somewhat alleviates them. Tiramisu [44] is a polyhedral compiler (that is, considering different optimization options for the same algorithm) that specializes in high-performance code generation for different platforms. At the same time, this compiler has several limitations. First, it only supports a specialized high-level language, which makes it difficult to implement in applications in other languages and increases the time spent on porting algorithms written in other programming languages. Second, it is targeted to cluster computing and has significant limitations in terms of hardware. The clspv compiler [45] translates OpenCL kernels into an intermediate GPU representation called SPIR-V (used by Vulkan to define shader programs) and thus can be used for Vulkan development. Unfortunately, it only supports compute shaders and is currently officially in the prototype stage (although it is already relatively stable, since it has been in development since 2017). The Circle compiler appeared 2 years ago, but it wasn't until 2020 that it became focused on GPU programming [46]. It is currently the only C ++ compiler in the world that supports the graphics functionality of modern GPUs (graphics pipeline, ray tracing pipeline, mesh shaders). But the development is still in the early stage. Circle is a traditional compiler, which itself has all the drawbacks of the traditional approach: If a developer starts using Circle as the main tool (that is, not only for shaders, but for the entire description of the algorithm), then it becomes dependent on it and assembly for any platform (including mobile systems) is no longer possible without Circle. 2.1. Conclusion on existing solutions General purpose technologies do not support enough hardware features which hinders the performance and energy efficiency (for example, PACXX, OpenACC, DVMH, OpenCL, CUDA, TaskFlow). Low-level industrial APIs provide such support, but development on them is extremely laborious (Vulkan, Metal, DirectX12). Domain Specific Language (DSL) technologies and languages (Halide, Optix) are a good solution for both GPUs and even other computing systems (FPGA or ASIC). However, their key disadvantage is that algorithms and knowledge implemented with them are difficult to transfer to other areas and difficult to integrate with the rest of the software that does not use a domain-specific language. There are no technologies which can achieve 2 goals simultaneously: (1) cross-platform ability and (2) accessing specific HW features, because existing solutions don’t have an intermediate layer between high-level algorithm description and its actual implementation. Best results in this direction have been achieved in [43, 44, 38, 39]. 3. Proposed solution The proposed programming technology is not general purpose, but it considers several different fields of application and tends to be customer-oriented. At the same time, unlike, for example, Halide [35-37], it does not use domain-specific languages (DSL), but instead extracts the necessary knowledge from ordinary C++ code. One of the main advantages of our technology is that the input source code is not extended by any new language construction or directives. It is assumed that the input source code is normal hardware agnostic C++ source code which in general can be compiled by any appropriate C++ compiler (with some limitations though). This significantly increases ability to cross-platform development using suggested technology. To achieve this, we turn the concept of programming technology upside down: instead of making a general-purpose programming technology for building various software systems, we propose an extendable technology that can be customized for a specific class of applications. Therefore, we use pattern-matching to find patterns in C++ code and transform them to efficient GPU code in Vulkan. Patterns are expressed through С++ classes, member-functions and relationships between them. Before we proceed it is important to note the difference between our technology and most existing parallel programming technologies like CUDA, Taichi, Halide and others: they extend programming language with new constructions or propose new languages in which parallel constructions map to some efficient implementation in the hardware. Our approach is the opposite. First, we do not extend the programming language, but rather limit it. Second, our patterns don’t express hardware features. Instead of that, they express algorithmic and architectural knowledge. Thus, hardware features are the responsibility of the translator, not the user. There could be a lot of patterns in total, but in this work, we have implemented limited number of them. Therefore, we consider implemented patterns by examples. 3.1. Patterns Patterns are divided into architectural and algorithmic. The architectural pattern expresses architectural knowledge of some part of the software. It determines the behavior of the translator as a whole and, thus, is responsible for an application area (for example, image processing, ray tracing, neural networks, fluid simulation, etc.). Algorithmic patterns express algorithmic knowledge and define a narrower class of algorithms that can have efficient hardware implementations and can be found inside architectural patterns. For example, parallel reduction, data compaction (parallel append to the buffer), sorting, scan (prefix sum), building histogram, map-reduce, etc. Now, let’s consider patterns that we have implemented in the current version of our translator: Architectural pattern for image processing. This pattern provides basic GPGPU capabilities. The input source code looks like ordinary C++ code with loops in OpenMP-style except that we don’t actually use directives (pragmas). Instead of that we suppose that there is certain class with a control and kernel functions (listing 1). The kernel functions contain the code that is assumed to be ported to GPU almost “as is”. The control functions are the functions, which call kernel functions, and thus they define the logic of kernel launches and resource bindings. Architectural pattern for ray tracing. The goal of this pattern is to provide access to hardware accelerated ray tracing and efficiently call virtual functions on GPU. Therefore, it can be considered as a cross-platform OptiX analogue. The significant difference between this pattern and the previous one is that in the image processing pattern loops are assumed to be placed inside kernel functions while in the ray tracing pattern they are assumed to be placed out of control functions (and thus out of kernels too). Therefore, ray tracing pattern is convenient if complex and heavy code is used for each thread or each processed data element, but the data processing loop is straightforward. The image processing pattern is convenient when number of threads (processed data elements) changes during the algorithm and inter-thread communication on GPU is actually needed for implementation. For example, if we need to resize (down sample) image, we can process small version of the image and then upscale it back. Algorithmic pattern for parallel reduction. For this pattern, we detect access to class-data variables (members of input class, fig. 1) and generate code for parallel reduction on GPU. Algorithmic pattern for parallel append of elements to the buffer and the related pattern of subsequent indirect dispatching. Imagine that you have a member-function which processes input data and appends some data to a vector via “std::vector.push_back(…)”. Now you are going to process selected data in the other function. This time, loop counter depends on “vector.size()” and thus actual number of threads on GPU should be different: it will be known only after first kernel finishes, therefore we have to insert indirect dispatch here. Listing 1 shows input code example and listings 2 and 3 – simplified output examples. 1. class Numbers { 2. public: 3. void CalcArraySumm(const int* a_data, uint a_dataSize) { 4. kernel1D_ArraySumm(a_data, a_dataSize); 5. } 6. void kernel1D_ArraySumm(const int* a_data, size_t a_dataSize) { 7. m_summ = 0; 8. for(uint i=0; i0) 11. m_summ += number; 12. } 13. } 14. int m_summ; 15. }; Listing 1: Example of input source code for the image processing architectural pattern. Calculation of sum of all positive numbers in the array. Kernel function and its dimensions (1D, 2D or 3D) are extracted by analyzing function name (kernel1D_ArraySumm, lines 6-12). Control function is extracted by analyzing its code: if at least one of the kernel functions is called from this function, then it is a control function (CalcArraySumm, lines 3-5). Any class data members which are accessed by kernels are placed in a single “class data buffer” (m_summ, line 14). Access for such variables is further analyzed. If in any kernel writes single variable on different loop iterations (line 11), then we generate parallel reduction code at the end of the shader for this variable (listing 3). 1. class Numbers_Generated : public Numbers { 2. public: 3. virtual void SetInOutFor_CalcArraySumm(VkBuffer a_dataBuffer) { ... } 4. virtual void CalcArraySummCmd(VkCommandBuffer a_commandBuffer, uint a_dataSize) { 5. m_currCmdBuffer = a_commandBuffer; 6. vkCmdBindDescriptorSets(a_commandBuffer, ... , ArraySummLayout, &allGeneratedDS[0], ... ); 7. ArraySummCmd(a_dataSize); 8. } 9. protected: 10. virtual void ArraySummCmd(size_t a_dataSize) { 11. ... 12. vkCmdBindPipeline (m_currCmdBuffer, ... , ArraySummInitPipeline); 13. vkCmdDispatch (m_currCmdBuffer, 1, 1, 1); 14. vkCmdPipelineBarrier(m_currCmdBuffer, ... ); 15. vkCmdBindPipeline (m_currCmdBuffer, ... , ArraySummPipeline); 16. vkCmdDispatch (m_currCmdBuffer, a_dataSize/256, 1, 1); 17. vkCmdPipelineBarrier(m_currCmdBuffer, ... ); 18. } 19. ... 20. } Listing 2: Getting some class as input, our solution generates an interface and implementation for the GPU version of the algorithms, implemented in control functions. Due to the peculiarities of Vulkan we have to generate 2 functions for each input control function. Thus, for CalcArraySumm we generate two funcs: SetInOutFor_CalcArraySumm and CalcArraySummCmd. When the first one is called, it creates descriptor set for input buffer (a_dataBuffer in the example) and saves it to allGeneratedDS[0]. We have deleted input pointer parameter a_data. In generated code pointer parameters of control and kernel functions are not used because this time all data is on GPU and they are accessed via descriptor sets in shaders. In this example 2 different shaders were generated: the first one is ArraySummInitPipeline which executes loop initialization (zero sum) and the second one is ArraySummPipeline which executes loop body (listing 3). 1. __kernel void kernel1D_ArraySumm(__global const int* a_data, __global NumbersData* ubo, ...) { 2. __local int m_summShared[256*1*1]; 3. ... 4. int number = a_data[i]; 5. if(number > 0) 6. m_summShared[localId] += number; 7. ... 8. barrier(CLK_LOCAL_MEM_FENCE); 9. m_summShared[localId] += m_summShared[localId + 128]; 10. barrier(CLK_LOCAL_MEM_FENCE); 11. m_summShared[localId] += m_summShared[localId + 64]; 12. barrier(CLK_LOCAL_MEM_FENCE); 13. m_summShared[localId] += m_summShared[localId + 32]; 14. m_summShared[localId] += m_summShared[localId + 16]; 15. m_summShared[localId] += m_summShared[localId + 8]; 16. m_summShared[localId] += m_summShared[localId + 4]; 17. m_summShared[localId] += m_summShared[localId + 1]; 18. if(localId == 0) 19. atomic_add(&ubo->m_summ, m_summShared[0]); 20. } Listing 3: Example of generated shader for the clspv compiler. The original loop body transforms to lines 4—6. It can be seen that access to m_summ was rewritten to m_summShared[localId] which is further used in parallel reduction at the end of the shader. Lines 13—17 implements optimized variant of parallel reduction for Nvidia HW assuming warp size is 32 threads. It changes depending on input parameters of our translator. For example, we can turn off optimization or assume smaller warp size (8 for mobile GPUs), or use subgroupAdd instead of 13—17 lines (available only in GLSL). 3.2. Code generation The proposed generator works on the principle of code morphing [47]. The essence of this approach is that, having a certain class in a program and transformation rules we can automatically generate another class with the desired properties (for example, the implementation of the algorithm on the GPU). The transformation rules are defined by mentioned patterns, within which the processing and translation of the current code is carried out. The generated class in inherited from the input class, thus having access to all data and functions of input class. Input source code is processed via clang and libtooling [48]. Almost all tasks in our translator are done in 2 passes. On the first pass we detect patterns and their parts via libtooling: nested loops inside kernel functions, reduction access, access to class data members and et c. On the second pass we rewrite the source code using clang AST Visitors. The final source code is produced via templated text rendering approach [50]. Thus, our solution is implemented via pure source-to-source transformations and unlike Circle, for example, we don’t work with the LLVM code representation. While this approach has certain limitations (we can’t change input programming language, for example, to Rust or Ada which is easily achieved in the LLVM in general), it also has significant advantages: 1. The generated source code for both shaders (OpenCL C for clspv [45] or GLSL) and host C++ with Vulkan calls looks like a normal code written by hand. It can be debugged, changed or combined with another code (hand written usually) in any way. Thus, unlike many existing programming technologies it is easy to distinguish errors of generator/translator from user errors. This is a problem for OptiX or Circle for example because we can’t see what programming technology actually does with the input code. 2. The ability to generate source code for shaders gives us a huge flexibility by the subsequent shader compiler features because we can easily add different HW features support. The early version of our tool used only clspv [45] for shaders. However, we quickly found that the capabilities of the clspv are not enough for ray queries, virtual functions and many other things. It is possible to add such support to clspv to get desired features in SPIR-V from OpenCL C shader source code, but this is expensive and hard way because both working with SPIR-V and clspv source code requires special knowledge and significant effort. At the same time, adding new HW feature support directly to the generated GLSL source code is relatively easy. CPU <=> GPU data transfer. As were mentioned in a related work review, many existing solutions solve the problem of data copying automatically. In most cases for software that uses Vulkan this is not satisfactory for many reasons. In the proposed approach, we generate code for executing algorithms on the GPU and performing copying and then let the user independently call this code. The generated function called “UpdateAll” performs this task. If user needs data back on the CPU from some internal data of generated class, he or she could make a new class, which is inherited from the generated one. In this class any additional algorithm or copying functionality can be implemented. Our solution implements the entire generated algorithm to the GPU, therefore, in general, all generated variables and buffers are located on the GPU. However, since the generated class is inherited from the original one, it also contains all the original variables and vectors on the CPU under their own names. The user either provides his own copy implementation to “UpdateAll” method (via the interface implementation), or uses ours from the library. Accordingly, temporary buffers are either created manually by the user or an implementation provided by us is used to create them. In the same way use may manually clear unnecessary CPU data after UpdateAll method. 4. Experimental evaluation We evaluated our approach on several applications for which we generated different implementations (GPU v1—v3) using different options of our translator (fig. 1). Unlike traditional compilers these options force our translator to apply different HW features and different actual implementations of the same algorithm on GPU. Therefore, performance difference is significant in some cases. Results of our experiments are presented in tables Table 1 and Table 2. The GPU implementation is usually 30-100 times faster than a multicore CPU version which means performance is on desired level on average. Table 2, on the other hand, demonstrates the high labor intensity of implementing such experiments manually in Vulkan. Thus, performance study using our solution becomes easier. Reduction samples (#1--#3). Here we demonstrate the ability to detect and generate different implementations of parallel reduction. Although the speedup is not very significant (which is expected on such tasks), it was stable, in contrast to the multithreaded execution on the CPU for which #1 and #2 in average were slower than single threaded (we took the smallest time over several runs). GPU v1 is a cross-platform implementation which uses Vulkan 1.0, doesn’t know warp size and doesn’t use subgroup operations. GPU v2 knows warp size (passed via command line argument) and thus may omit synchronization operations for several last steps of reduction (listing 3). GPU v3 knows warp size and additionally uses subgroup operations. Figure 1: Example applications of the proposed technology. Bloom filter (top, left), spherical harmonics integration (top middle), guided Non-Local Means denoising (bottom left and bottom middle), path tracing with different materials (for testing virtual functions) and procedural textures (right) and finally NBody simulation is on the top-right corner of the image. Table 1 Execution time in milliseconds for different algorithms and different generated implementations via proposed technology. Because applications are different, v1—v3 means different optimizations for different samples. It is described in details further. First two rows show time in milliseconds for calculating the sum of 1 million numbers. This task is mapped to parallel reduction on GPU. The third row is spherical harmonic evaluation which is 2D reduction with some math. NLM means guided Non- Local Means filter. For path tracing implementation on the CPU we used Embree ray tracing. CPU used is Intel Core i9 10940X, GPU is Nvidia RTX 2080. For Path tracing 512x512 image was rendered (i.e., 256K paths were traced). ‘*’ means that for path tracing and v3 variant the generated code was finalized by hand due to early stage of GLSL generator in our solution. App/ CPU (1 core) CPU (14 cores) GPU v1 GPU v2 GPU v3 Impl Input source OpenMP Ours Ours Ours (#1) Int Arr. Σ 1.263 ms 0.271 ms 0.095 ms 0.089 ms 0.084 ms (#2) Float Arr. Σ 1.420 ms 0.342 ms 0.104 ms 0.096 ms 0.096 ms (#3) Sph. Eval. 39.73 ms 2.931 ms 0.399 ms 0.364 ms 0.320 ms (#4) NBody 250400 ms 11920 ms 118.0 ms - - (#5) Bloom 711.8 ms 52.74 ms 0.733 ms 1.420 ms 0.841 ms (#6) NLM 88440 ms 6851 ms 422.0 ms 571.1 ms 351.0 ms (#7) Path Trace 188.4 ms 14.928 ms 4.790 ms 1.310 ms 0.460 ms* Nbody (#4) is classic GPGPU problem of a quadratic complexity for which we demonstrate 100 times acceleration in comparison to multi-threaded CPU version. Bloom and Non-Local Means (#5, #6). In these samples we demonstrate the ability to change implementation of images from buffers (GPU v1) to textures (GPU v2) and use half float for texture format (GPU v3) which gives essential speed-up. It is interesting to note that for these image processing examples 32-bit float textures were slower than buffer variants. For Bloom buffers were even faster than a half-float textures which is due to the Load/Store access and general texture layout. In this way we show that performance question on GPU is not trivial and require to implement and test different variants of same algorithm. Withing proposed solution these experiments can be automated. Table 2 Lines of code for different application. The first (C++) column shows original lines count for input class. The second column shows total line number for generated source code. Vulkan (compact) approximately estimates number of lines for the Vulkan code written by human without our approach. We did this by excluding descriptor sets setup from the generated code because it is quite verbose in our generator and it can be written in a more compact way manually. The last column shows lines number for device code in generated kernels. App/Lines C++ Vulkan Vulkan Vulkan (input source) (generated) (compact) (shaders only) (#1) Int Arr. Σ 80 640 500 195 (#2) Float Arr. Σ 80 780 650 285 (#3) Sph. Eval. 120 1280 1000 445 (#4) NBody 115 850 700 140 (#5) Bloom 300 1460 1050 250 (#6) NLM 330 1290 1000 250 (#7) Path Trace 800 4500 3200 1470 For Bloom we get 100x times speedup over multi-threaded CPU version and for Non-Local Means it is only 20x, which seems to be suboptimal for such a heavy task. In fact, there could be a lot of optimizations for image processing (at least more aggressive pixel quantization/compression), but we believe Halide [35-37] project did most of them and Taiichi did similar for physics simulation [53]. So, we decided to focus our performance investigation on ray tracing. For path tracing the initial generated code (which uses exactly same traversal algorithm) outperforms original CPU variant at factor of 10x. We then replace CPU ray traversal with optimized Embree (table 2) and got only 3x. Then we add hardware accelerated ray tracing in computer shader (12x) and generate single kernel via RTX pipeline which finally gives us 32x over multithreaded CPU path tracing with Embree. 4.1. Path Tracing experiments (#7) Light transport simulation algorithms involve heavy mathematical models and require extensibility from the framework in which these algorithms are implemented. In existing CPU rendering systems, this usually means object-oriented approach. Therefore, it is not yet enough to accelerate ray tracing. To achieve efficiency on GPU we should study how complex and extendible code can be implemented on GPU. Currently, there are three general approaches: 1. Single-kernel – the whole code for light path evaluation is placed inside a single kernel. There could be different option for optimizations (like OptiX state machines) [50]. This approach is good for relatively simple models, for example in computer games. However, with the growing code base it’s performance dramatically reduces due to branch divergence and register pressure. 2. Multiple-kernel – code is split into several smaller kernels, which communicate by reading/writing data into GPU memory. The necessity to read/write data from/into memory results in significant performance overhead depending on an application [51] 3. Wavefront path tracing. This approach extensively uses sorting and compaction of threads to organize them into several queues which execute different computational kernels. This helps to avoid branch divergence but may result in even bigger performance overhead [52]. Therefore, all of these approaches can be used (and are used) in real life applications and there is no single approach that is strictly better than the others in all cases. Depending on the available hardware and needed features, different implementations may be needed to achieve optimal performance. Even though the implementation in code of all these approaches is significantly different, the essential algorithm (path tracing) and implemented models (BRDFs) stay the same. The actual difference in these approaches is actually how these computer graphics algorithms are translated to the GPU code. 4.2. Adding hardware ray tracing We implemented the basic path tracing algorithm and several material models on the CPU and used the proposed solution to generate the GPU Vulkan-based implementation with software ray tracing in a multiple-kernel way (GPU v1). The generation was performed using ray tracing pattern previously described in section 3.1. Generated GPU version corresponds to the CPU version and implements naive path tracing algorithm consisting of the following kernel launches: 1. Primary (“eye”) ray generation. 2. Loop until maximum tracing depth is reached: a. Ray trace kernel, which searches for intersection and stores surface hit (6 floats). b. Kernel to obtain material Id for hit surface (store 2 floats). c. Next bounce kernel which performs shading computations and stores the path state: new ray position and direction (8 floats) and accumulated color and throughput (8 floats). 3. Kernel which contributes accumulated color to the output image. Therefore, the total size of ray pay-load is equal to 24 floats (96 bytes) per thread. Next, the host part of generated code was modified via virtual function override (we override generated ray tracing kernel call) to use the hardware accelerated ray tracing feature via VK_KHR_ray_query extension which allows using RTX functionality in the compute shaders (among others). Modification required less than 1000 lines of code (about the same as drawing a single triangle in Vulkan). This is a GPU v2. In way we show how generated and hand written code could be connected together (both CPU and GPU parts). Considering kernel launches described above, we simply replace Ray trace kernel with the new one which uses hardware accelerated ray queries. Finally, we have implemented several variants of exactly the same path tracing setup via full RTX pipeline for performance comparison. We took a part of generated GLSL code (functions of material sampling, etc.) and finalized it by adding to ray tracing pipeline. This is a GPU v3. In fact, there were 3 different implementations of GPU v3 because RTX itself has many options (fig. 2, first 3 rows). In all cases path tracing was performed with tracing depth equal to 6 and with 8 samples per pixel (for larger sample numbers Nvidia Nsight runs out of memory for GPU trace). Measurements were made on a geometrically simple scene (about 31k triangles), but featuring a variety of material models - Lambertian, perfect mirror, glossy GGX and a blended material – GGX and Lambertian BRDF mixed with respect to a procedural noise texture mask. Initially we didn’t plan to generate single kernel version because for offline ray tracing applications it’s not the best option due to significant performance degradation when adding new materials and light source models. We didn’t even plan to generate GLSL code using our generator for logic, opposite we plan to replace specific parts of the algorithm (for example, ray tracing) via separate kernel calls. Interaction via DRAM seems to be natural and common for GPU programs, but it turned out that this approach is rather limited. First, clspv has only basic support for hardware features in shaders. Second, for relatively simple code base single kernel variant can be significantly better because less data is transferred to DRAM from chip (fig. 2). Nevertheless, our GPU v2 implementation almost caught up with the most flexible/stable variant of RTX via callable shaders (first row in fig. 2). This one seems to be a wavefront path tracing approach implemented by Nvidia inside RTX pipeline and it’s not cross- platform. Thus, our further studies were related to a question: can we get the same performance as a solution based on callable shaders within the multiple-kernels framework by, for example, regrouping threads on material sampling to avoid branch divergence and high register pressure. Implementation comparison [GPU V3] RT PIPELINE, MANY CALLABLE KERNELS 272,6 [GPU V3] RT PIPELINE, MANY CLOSEST HIT… 528,4 [GPU V3] RT PIPELINE, RAYGEN "SINGLE"-KERNEL 805,3 [GPU V2] MULTIPLE KERNELS, NO TILES 243,7 [GPU V2] MULTIPLE KERNELS, RAY QUERY, TWO … 215,1 [GPU V2] MULTIPLE KERNELS, RAY QUERY, ONE … 195,1 [GPU V1] MULTIPLE KERNELS 65,7 0 100 200 300 400 500 600 700 800 900 Million paths per second Figure 2: Comparison of performance in millions of paths per second between different variants; 1024x1024; RTX2080. Raygen “single”-kernel variant implements all material models inside ray generation shader, many closest hit kernels variants perform computations for different material models inside different closest hit shaders and many callable kernels – inside different callable shaders invoked from ray generation shader. The last 4 rows is GPU v2 (ray query, with different tailing variants) and GPU v1. 4.3. Performance analysis and asynchronous compute for multiple kernel A problem of multiple kernel approach is that data which passes between kernels is stored in DRAM. Actually, if the size of the intermediate data is not large and it can fit into L2 cache, the work load of DRAM significantly decreases which we analyzed via Nvidia Nsight tool. Unfortunately, we cannot significantly reduce the number of active threads because barriers between kernels lead to a frequent situation when a previous kernel is still computing in a few threads, but the new one can’t be launched. So, for arbitrary tile size there is a tradeoff between L2 hit rate and amount of required memory and DRAM throughput for multiple kernel approach (fig. 3, blue lines). In fact, for multiple kernel approach we should try to reduce tile size because less memory will be required for intermediate data and DRAM workload will be reduced. This usually means more complex stuff can be done efficiently in future. To do this efficiently we have to get new independent work to GPU as previous kernel finishes execution. Thus, we decided to submit new work from independent queue using asynchronous compute in Vulkan (fig. 3, orange lines). Let’s say we have 1024x1024 image and we have split it into tiles with the size of 256x256 pixels. We can process the whole image tile by tile, or we can, for example, process two 256x256 tiles asynchronously. For fair comparison we should take tile size twice as big for single queue as it is for two queues (for example, 512x256 for a single queue and 256x256 for two queues) so the actual buffer size would be the same. Even then, having two asynchronous queues shows ~10-12% better performance in the best cases on Nvidia GPU and up to 16% on AMD (fig. 3) over tile-by-tile approach. It can be seen from fig. 4 and 5 that AMD hardware implements asynchronous compute significantly better than Nvidia. Asynchronous tile-based approach implementation does not require many modifications to the code – we need to create additional queue from a different queue family, record command buffers for each tile using alternating queues (each tile launches the same kernels as described in 4.2) and submit the commands in a multithreaded fashion (so we won’t block on fence synchronization on the CPU). Note that the number of executed command buffers depends linearly on the number of tiles. Nvidia RTX 2080, with hardware AMD Vega 10, no hardware accelerated accelarated ray tracing ray tracing Time, ms 300 7000 Time, ms single single 250 queue 6000 queue average 5000 average 200 two 4000 two 150 queues queues average 3000 average 100 No tiles 2000 No tiles 50 1000 Tile size Tile size 0 0 Figure 3: Measurements for (left) RTX 2080 with hardware accelerated ray tracing (VK_KHR_ray_query, GPU v2) and (right) AMD Vega 10 without it (GPUv1), 8 samples per pixel, ray tracing depth = 6, total image resolution 1024x1024. DRAM throughput decreases linearly from 90% (1024x1024) to 10% (for 128x128) and even less for smaller tile size. This is not shown on the image. Asynchronous compute (orange lines) shows significant performance increase over simple multiple kernel approach. Although rendering without splitting the image into tiles for this simple scene is still slightly faster, the difference is not essential (1-2%) and we’ve got significantly better HW metrics for units for proposed tiled rendering (table 3). So, with more complex materials tile splitting may actually become a better option. At the same time our approach reduces memory required for intermediate data up to 8- 16 times, depending on the tile size. This can be especially important for MCMC methods (Metropolis Light Transport) where large vectors are stored for each thread. Table 3 Nvidia RTX 2080, Nsight Graphics Metrics Metric No tiles One 512x256 Tile Two 256x256 tiles VRAM throughput 48.9% 35.3% 23.8% L2 hit rate 23.1% 28.9% 48.7% L2 hit rate from L1 21.7% 28.4% 47.9% CS Warp can't launch (register limited) 31.6% 15.4% 6.3% Average time 35 ms 43 ms 39 ms Figure 4: Nsight Graphics GPU trace for RTX2080. Path tracing with asynchronous compute queues from different queue families. An asynchronous compute queue on Nvidia (top part of the image) runs like a “background” task. It can be seen that 8 submits from the main queue (bottom part of the image) takes same time than single submit to async compute queue. Figure 5: Radeon GPU Profiler frame capture for Vega 10. Path tracing with asynchronous compute queues from different queue families. Unlike Nvidia, the workload floats freely between 2 queues. 5. Conclusions We proposed a solution capable of alleviating the difficulties of porting a computationally intensive algorithm to the GPU using source-to-source translation of ordinary C++ code to C++ with necessary Vulkan API calls and shader code (OpenCL C or GLSL). During the code generation process different optimizations can be applied to create several implementations depending on the problem specifics and/or available hardware. This way we are able to increase GPU developers’ productivity by generating code using Vulkan (which can be quite verbose) and applying complex optimizations automatically to achieve maximum performance. We have shown that generated code could be connected to hand written adding hardware accelerated ray tracing. Finally, using the proposed solution and asynchronous compute in Vulkan we made a performance study for path tracing via multiple kernel approach and proposed an improvement for it which reduces required memory for intermediate data by an order of magnitude and uses GPU memory units in a more efficient way (table 3). 6. Acknowledgments This work is supported by the Russian Science Foundation (RSF) under grant #21-71-00037. 7. References [1] J. Fang, et al., Parallel programming models for heterogeneous many-cores: a comprehensive survey, CCF Transactions on High Performance Computing 2(4) (2020) 382-400. [2] Thrust: a powerful library of parallel algorithms and data structures, 2021. URL: https://developer.nvidia.com/thrust [3] A. Kolesnichenko, C. M. Poskitt, S. Nanz, SafeGPU: Contract-and library-based GPGPU for object- oriented languages, Computer Languages, Systems & Structures 48 (2017) 68-88. [4] D. Beckingsale, et al., Performance portable C++ programming with RAJA. In: Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PpoPP, 2019. [5] T. Heller, et al., Hpx–an open source c++ standard library for parallelism and concurrency, in: Proceedings of OpenSuCo, 2017, 5. [6] A. Paszke, et al., Pytorch: An imperative style, high-performance deep learning library, arXiv preprint (2019) arXiv:1912.01703. [7] M. Abadi, et al., Tensorflow: Large-scale machine learning on heterogeneous distributed systems, arXiv preprint (2016) arXiv:1603.04467. [8] T. Chen, et al., Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems, arXiv preprint (2015) arXiv:1512.01274. [9] OpenACC, 2021. URL: https://www.openacc.org/ [10] N. Jacobsen, LLVM supported source-to-source translation-Translation from annotated C/C++ to CUDA C/C++, University of Oslo, Norway, 2016. [11] G. D. Balogh, et al., Op2-clang: A source-to-source translator using clang/llvm libtooling, in: Procceedings of IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), 2018. [12] P. Yang, et al., Improving utility of GPU in accelerating industrial applications with user-centered automatic code translation, IEEE Transactions on Industrial Informatics 14(4) (2017) 1347-1360. [13] J. A. Pienaar, S. Chakradhar, A. Raghunathan, Automatic generation of software pipelines for heterogeneous parallel systems, in: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC'12, IEEE, 2012. [14] N. Konovalov, V. Kryakov, 2002. URL: https://www.keldysh.ru/dvm/dvmhtm1107/publishr/dvm- appr-OpSys.htm (in Russian). [15] V. Bakhtin, et al., 2008. URL: http://agora.guru.ru/abrau2008/pdf/085.pdf (in Russian). [16] V.A. Bakhtin, V.A. Krukov, DVM-Approach to the Automation of the Development of Parallel Programs for Clusters, Programming and Computer Software, 3 (2019) 43-56. [17] M. A. Mikalsen, Openacc-based snow simulation, MS thesis. Institutt for datateknikk og informasjonsvitenskap, 2013. [18] T. Öhberg, Auto-tuning Hybrid CPU-GPU Execution of Algorithmic Skeletons in SkePU, MS thesis. Linkoping University, 2018. [19] A. Ernstsson, L. Lu, C. Kessler, SkePU 2: Flexible and type-safe skeleton programming for heterogeneous parallel systems, International Journal of Parallel Programming 46(1) (2018) 62- 80. [20] M. Steuwer, P. Kegel, S. Gorlatch, Skelcl-a portable skeleton library for high-level gpu programming, in: Proceedings of 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, IEEE, 2011. [21] SYCL, cross-platform abstraction layer, 2021. URL: https://www.khronos.org/sycl/ [22] SYCL for CUDA developers, examples, 2020. URL: https://developer.codeplay.com/products/computecpp/ce/guides/sycl-for-cuda- developers/examples [23] T. D. Han, T. S. Abdelrahman, hiCUDA: High-level GPGPU programming, IEEE Transactions on Parallel and Distributed systems 22(1) (2010) 78-90. [24] J. Wu, et al., gpucc: an open-source GPGPU compiler, in: Proceedings of the 2016 International Symposium on Code Generation and Optimization, 2016. [25] P. Sathre, M. Gardner, W. Feng, On the portability of cpu-accelerated applications via automated source-to-source translation, in: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, 2019. [26] HIP, C++ Runtime API and Kernel Language, 2021. URL: https://github.com/ROCm-Developer- Tools/HIP [27] Vulkan specification, indirect dispatch command, 2021. URL: https://www.khronos.org/registry/vulkan/specs/1.2- extensions/man/html/vkCmdDispatchIndirect.html [28] Anton Sherin. Resident Evil 2 Frame Breakdown, 2019. URL: https://aschrein.github.io/2019/08/01/re2_breakdown.html [29] N. Mammeri, B. Juurlink, Vcomputebench: A vulkan benchmark suite for gpgpu on mobile and embedded gpus, in: Proceedings of IEEE International Symposium on Workload Characterization (IISWC), IEEE, 2018. [30] Nvidia OptiX, 2021. URL: https://developer.nvidia.com/optix [31] DIrectML, 2021. URL: https://github.com/microsoft/DirectML [32] Wisp renderer, 2021. URL: https://github.com/TeamWisp/WispRenderer [33] Projects using RTX, 2021. URL: https://github.com.cnpmjs.org/vinjn/awesome-rtx [34] J. Hegarty, et al., Darkroom: compiling high-level image processing code into hardware pipelines, ACM Trans. on Graphics 33(4) (2014) 144-1. [35] J. Ragan-Kelley, et al., Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM Sigplan Notices 48(6) (2013) 519-530. [36] R. T. Mullapudi, et al., Automatically scheduling halide image processing pipelines, ACM Transactions on Graphics (TOG) 35(4) (2016) 1-11. [37] A. Adams, et al., Learning to optimize halide with tree search and random programs, ACM Transactions on Graphics (TOG) 38(4) (2019) 1-12. [38] A. Rasch, R. Schulze, S. Gorlatch, Developing High-Performance, Portable OpenCL Code via Multi-Dimensional Homomorphisms, in: Proceedings of the International Workshop on OpenCL, 2019. [39] T.-W. Huang, et al., Taskflow: A General-purpose Parallel and Heterogeneous Task Programming System, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2021). [40] M. Haidl, S. Gorlatch, PACXX: Towards a unified programming model for programming accelerators using C++14, in: Proceedings of Workshop on the LLVM Compiler Infrastructure in HPC, IEEE, 2014. [41] M. Haidl, et al., Pacxxv2+ RV: an LLVM-based portable high-performance programming model, in: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC, 2017. [42] A. Sidelnik, et al., Performance portability with the chapel language, in: Proceedings of the IEEE 26th international parallel and distributed processing symposium, 2012. [43] Y. Hu, et al., Taichi: a language for high-performance computation on spatially sparse data structures, ACM Transactions on Graphics (TOG) 38(6) (2019) 1-16. [44] R. Baghdadi, et al., Tiramisu: A polyhedral compiler for expressing fast and portable code, in: Proceedings of IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2019. [45] Google clspv. A prototype compiler for a subset of OpenCL C to Vulkan compute shaders, 2021. URL: https://github.com/google/clspv [46] Sean Baxter. Circle C++ shaders, 2021. URL: https://github.com/seanbaxter/shaders [47] S. S. Huang, D. Zook, Y. Smaragdakis, Morphing: Safely shaping a class in the image of others, in: Proceedings of European Conference on Object-Oriented Programming, Springer, Berlin, Heidelberg, 2007. [48] Clang documentation, 2021. URL: https://clang.llvm.org/docs/LibTooling.html [49] Inja, template engine for modern C++, 2021. URL: https://github.com/pantor/inja [50] S. G. Parker, et al., Optix: a general purpose ray tracing engine, ACM transactions on graphics 29(4) (2010): 1-13. [51] V. A. Frolov, V. A. Galaktionov, Memory-compact Metropolis light transport on GPUs. Program. Comput. Software 43 (2017) 196–203. doi:10.1134/S0361768817030057 [52] S. Laine, T. Karras, T. Aila, Megakernels considered harmful: Wavefront path tracing on GPUs, in: Proceedings of the 5th High-Performance Graphics Conference, 2013. [53] Y. Hu, J. Liu, X. Yang, M. Xu, Y. Kuang, W. Xu, Q. Dai, W. T. Freeman, F. Durand, QuanTaichi: A Compiler for Quantized Simulations, ACM Transactions on Graphics 40(4) (2021).