=Paper=
{{Paper
|id=Vol-2744/short56
|storemode=property
|title=Constructing K-d Tree on GPU through Treelet Restructuring  (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2744/short56.pdf
|volume=Vol-2744
|authors=Vadim Bulavintsev,Dmitry Zhdanov
}}
==Constructing K-d Tree on GPU through Treelet Restructuring  (short paper)==
<pdf width="1500px">https://ceur-ws.org/Vol-2744/short56.pdf</pdf>
<pre>
       Constructing K-d Tree on GPU through Treelet
                      Restructuring *

        Vadim Bulavintsev1[0000-0003-3099-1442], Dmitry Zhdanov2[0000-0001-7346-8155]
           1
             ITMO University, 49 Kronverksky Pr., St. Petersburg, 197101, Russia
                          v.g.bulavintsev@gmail.com
           2 ITMO University, 49 Kronverksky Pr., St. Petersburg, 197101, Russia
                                 ddzhdanov@mail.ru


       Abstract. With every new generation of graphics processing units (GPUs), of-
       floading ray-tracing algorithms to GPUs becomes more feasible. Software-hard-
       ware solutions for ray-tracing focus on implementing its basic components, such
       as building and traversing bounding volume hierarchies (BVH). However, global
       illumination algorithms, such as photon mapping method, depend on another
       kind of acceleration structure, namely k-d trees. In this work, we adapt state-of-
       the-art GPU-based BVH-building algorithm of treelet restructuring to k-d trees.
       By evaluating the performance of the resulting k-d tree, we show that treelet op-
       timisation heuristic suitable for BVHs of triangles is inadequate for k-d trees of
       points.

       Keywords: Graphics Processing Unit, Kd Tree, Bounding Volume Hierarchy,
       Photon Mapping, Ray-tracing.


1      Introduction

Over the last three decades, physically correct rendering became a ubiquitous method
of data visualization in scientific, educational, industrial design and digital entertain-
ment fields [1,2]. Meanwhile, advances in GPU design brought consumer computa-
tional devices enough processing power to make physically correct rendering feasible
in interactive applications.
   In this paper we discuss recent advancements in acceleration structures – building
algorithms used to facilitate physically-correct rendering with GPUs. We apply a
method of building an acceleration structure for ray-tracing for GPUs originally used
with Bounding Volume Hierarchies (BVHs) to k-d trees.
   The paper is organised as follows. In the rest of the "Introduction" section we de-
scribe the necessary preliminaries, such as definitions of acceleration structures and


Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).

* Publication is supported by RFBR grant № 18-08-01484.
2 V. Bulavintsev, D. Zhdanov


heuristics, as well as specifics of GPU programming. In the “Related Works” section
we review the development of GPU-based acceleration structures construction and tra-
versal methods over the years. In the "Methods" section, we describe the choice of the
original BVH-building method and the changes that were necessary to adapt it to k-d
trees. Also, we describe the choice of test scenes and experimental setup. In the "Re-
sults" section, we compare the performance of our method to state-of-the-art top-down
methods for building k-d trees on CPU and to the corresponding BVH-building method.
In the "Discussion" section, we analyse the problems we encountered while adapting
the method and propose the directions for further research.


1.1    Photon mapping
Photon mapping is a robust method for solving the rendering equation [3]. The method
is able to reproduce complex optical effects, such as caustics [4]. It consists of three
stages:

• cast light rays from into the scene and note the luminance at the rays' points of inter-
  section ("photons") with the scene surfaces;
• do the same for visibility rays emitted by the camera;
• for each visibility point, gather the photons in its vicinity to calculate the approxi-
  mate luminance.


1.2    Acceleration structures
To perform the photons gathering in the third stage of the photon mapping algorithm,
it is essential to organise the photons into an acceleration structure, such as the k-d tree
or a Bounding Volume Hierarchy.
    K-d tree. K-d tree is a binary tree that partitions the scene volume into smaller vol-
umes by assigning an axis-aligned split plane for one of the scene's k dimensions at
each internal node [5]. The process of building a tree of volume elements is called
voxelization.
    Bounding volumes hierarchy. Another important acceleration structure used in
rendering is the bounding volumes hierarchy (BVH) [6]. BVH partitions the scene
space into a tree of axis-aligned bounding boxes. Building BVHs using massively par-
allel architectures (e.g. GPUs) got much attention from researchers and industry [7, 8,
9] during the last decade. The goal of the present work is to apply one of the more
successful methods for building BVHs with GPUs to k-d trees.
    Top down vs bottom-up building. When building a tree-like acceleration structure,
there are two primary strategies. Splitting the whole space according to some heuristic,
then proceeding to apply the same heuristic to the resulting subspaces is called top-
down approach. Combining the smallest elements of the set (e.g. photons or triangles)
into gradually bigger units, effectively building the tree from its leaves is called bottom-
up approach.
                               Constructing K-d Tree on GPU through Treelet Restructuring 3


    Top-down methods are known to produce the most effective acceleration structures
[10]. However, top-down methods are not suitable for massively-parallel architectures,
such as GPU. This led to development of alternative, bottom-up construction methods.
    Over the years, several bottom-up methods were suggested by researchers. The basic
method was coined in by Lauterbach et al. in [8]. The method’s idea is to sort the prim-
itives based on their coordinates on a Morton curve [11], then interpret them as a tree.
The resulting structure is called Linear BVH (LBVH). This method was further refined
in [12].
    The next generation of bottom-up methods uses LBVH as a starting point, then
building up to optimize it by e.g. changing the internal nodes [9] or iteratively merging
them bottom-up [13].
    It is important to note that top-down methods, while producing more efficient accel-
eration structures, tend to be much more computationally intensive than bottom-up
methods, therefore requiring choosing a trade-off between the structure’s quality and
building time. This makes finding a fitting trade-off a critical part of building a real-
time rendering system.


1.3    Surface area heuristic
Beside the actual tree construction method, the most important thing in an acceleration-
structure-building algorithm is the heuristic that selects the splitting point. It is im-
portant to note that the actual effectiveness of the acceleration structure is highly scene-
dependent [10]. Nonetheless the most popular heuristics are the median heuristic for k-
d trees and the Surface Area Heuristic (SAH) [14] for BVHs.
    Surface Area Heuristic. The idea of the Surface Area Heuristic is to optimize for a
general case of rays traversing the scene at random angles. At each splitting the heuris-
tic minimizes the summary surface area of the resulting bounding volumes in any given
subtree.


1.4    GPU architecture
Modern GPUs are SIMT [15] devices, which stands for "Single Instruction Multiple
Threads" architecture, a variant of SIMD architecture [16] with hardware control of
branch divergence. SIMD architecture features an asymmetric ratio of control (CUs) to
processing units (PUs), making it efficient at processing extensive collections of ho-
mogenous data [17]. A GPU consists of independent multiprocessors, each of which
has a single CU controlling multiple PUs. To provide data to all PUs of a single multi-
processor at once, GPU memory controller is optimized to fetch long words. Typically,
GPU's on-chip cache size is small in regards to the number of PUs. To hide memory
access latencies a single multiprocessor pipelines execution of several thread groups,
switching between them when the data from GPU memory becomes available. Threads
running on a single multiprocessor share the register file. Thus, register pressure limits
the number of threads running on a single multiprocessor.
4 V. Bulavintsev, D. Zhdanov


   These quirks of GPU architecture result in the following rules of efficient GPU pro-
gramming [17]:

• avoid branch divergence;
• avoid non-coalesced memory access (gather) in a thread group;
• minimize register usage and device memory access.

Modern GPUs have recently received support of hardware acceleration for some ray-
tracing operations, such as BVH tree traversal. Unfortunately, the underlying hardware
instructions remain undocumented [18] and are only usable through a closed-source
API [19].


2      Related works

Being an "embarrassingly parallel" problem, ray-tracing is well suited to GPU archi-
tecture. However, following the rules mentioned above for such stages of a ray-tracing
algorithm as building and traversal of acceleration structures is non-trivial.
   Firstly, top-down methods for constructing BVHs and kd-trees are bottlenecked by
the low parallelism during the beginning of the build process. Researchers proposed
several ways to alleviate this problem, such as using trees and BVHs of higher arity or
using hybrid CPU/GPU algorithm [20]. Alternatively, it is possible to use bottom-up
building algorithms [8, 12] as mentioned in the previous section.
   Secondly, tree traversal can result in high branch divergence when, for example, a
single thread in a thread group proceeds into a deep branch of the tree when the other
threads in the same group have already terminated. The problem can be tackled both
from the side of traversal algorithm by modifying it to use "persistent threads" tech-
nique [21] and from the side of tree construction algorithm by avoiding creating deep
leaves.
   Thirdly, while persistent threads method solves the problem of branch divergence, it
can result in scattered memory access. One way to solve this problem is to use prefix
scans, and regularly sort the threads based on locality [22].
   Fourthly, GPU cache/on-chip memory proved to be too small to support stack-based
tree traversal methods, which motivated the development of stackless methods as an
alternative [23]. Though, in recent hardware generations the cache constraints were
somewhat relieved.
   A detailed analysis of the problem can be found in [24, 25].


3      Methods

One of the reasons ray-tracing became feasible on GPUs is the rapid development of
algorithms for building high-quality BVHs on massively parallel processors, such as
GPUs [7, 9]. We decided to base our algorithm on TRBVH algorithm [9] that is natively
implemented in NVIDIA Optix API [19], has some open-source implementations [13]
and is readily suitable to be converted to work with k-d trees [12]. We reimplemented
                              Constructing K-d Tree on GPU through Treelet Restructuring 5


the algorithm in Python programming language to facilitate rapid prototyping. As the
Python prototype does not run on a GPU, we studied relative and hardware-independent
performance metrics, such as the average number of traversed nodes and SAH cost.


3.1    TRBVH algorithm
The idea of TRBVH algorithm is to first build a lower-quality BVH by utilizing Morton
codes (the so-called Linear BVH) [12] and then enhance its quality selecting small (e.g.
7-9 nodes) "treelets" from it and optimize them by SAH heuristic [9].
   The LBVH is built in the following way. Coordinates of the primitives (e.g. a pho-
tons or triangle centroids) are converted to 30-bit Morton codes. The resulting array is
sorted using Radix sort [17]. Then, the sorted points can be readily interpreted as a k-d
tree by comparing the radix prefix of their Morton codes as described in [12]. The al-
gorithm is very efficient on GPU as it uses the fact that operations for individual ele-
ments are independent of each other.
   At the optimization stage, the algorithm starts to traverse the LBVH bottom-up, se-
lecting 7-nodes breadth-first subtrees (“treelets”) to optimize. For each treelet, the al-
gorithm compares all possible variants of organizing four of its leaves into a new sub-
tree, comparing the results by SAH. To avoid repeated SAH recalculations, the algo-
rithm employs methods of dynamic programming [9]. Treelet optimization processes
run in parallel, using atomic write instructions to avoid data races.


4      Results

4.1    Algorithm modifications
Every k-d tree can be interpreted as a BVH, but not every BVH can be interpreted as a
k-d tree. BVH nodes' volumes can overlap, which is impossible for k-d tree nodes.
Thus, converting the linear part of the TRBVH algorithm to produce k-d trees is
straightforward because of Morton codes natively splitting the scene space into non-
overlapping volumes. However, converting the treelet optimisation part requires mod-
ifying the original SAH heuristic to avoid overlapping volumes. To achieve this, we
assign ∞ heuristic cost to permutations that would result in overlapping child volumes.
Also, TRBVH works with triangles, while the photon map consists of points (spheres),
so we had to change the algorithm accordingly.


4.2    Algorithm performance
We tested the performance of the algorithm with Sibenik, Stanford Dragon and Con-
ference scenes [26] and Lumicept light modeling system [27]. We rendered each scene
using Lumicept’s reverse path tracing mode. In this mode, instead of emitting light rays
(called photons) from light sources, Lumicept emits image photons from the camera.
For each scene, we randomly selected 105 of the generated image photons, then built
two k-d trees by using our Python-based voxelizer prototype. The first tree was built
6 V. Bulavintsev, D. Zhdanov


using linear Morton codes (LBVH) [9]. The second tree was built by refining the first
tree using our treelet-based method. We tested the tree performance by counting the
total number of tree node traversals necessary to find 1000 image photons randomly
selected from the original set.
   Table 1 shows improvements in SAH cost and ray-tracing performance for k-d trees
built with our treelet-based method relative to k-d trees built with linear Morton codes
(LBVH). Usage of relative values allows us to compare results obtained by using our
Python-based emulation of a k-d tree GPU algorithm to the results reported by Karras
for BVHs in [9]. The data was averaged over two camera positions facing opposing
sides of a scene.

         Table 1. k-d tree and TRBVH performance compared to linear BVH (100%)

Performance                 Dragon                    Sibenik               Conference
in comparison
to linear Mor-
ton codes //
LBVH


                    SAH          Search       SAH           Search       SAH       Search
                  surface       speed       surface        speed       surface    speed
TRBVH                 86%            113%       78%             129%      59%        153%
(BVH) [9]
Our method            88%            107%       89%             102%      90%            98%
(k-d tree)


5      Discussion

Table 1 indicates that treelet restructuring does not increase the quality of the resulting
k-d tree tangibly. We can attribute this effect to the following. First, forbidding volume
overlaps cuts off more than 80% of the search space. Second, SAH heuristic is known
to be better suited to BVHs of triangles, not k-d trees of points. This can be clearly seen
with the Conference scene: after applying treelet restructuring the SAH cost goes down
for both k-d tree (our algorithm) and TRBVH. However, the traversal performance goes
up only for the BVH case, while for the k-d tree it actually decreases. Thus, the direct
translation of BVH-building methods to the domain of k-d trees is not feasible. This
observation gives us clear direction for future research: voxelizers based on the idea of
tree optimization should employ an alternative, more suitable heuristic.
                                Constructing K-d Tree on GPU through Treelet Restructuring 7


References
 1. Barladian, B., Voloboy, A., Galaktionov, V., Shapiro, L.: Integration of realistic computer
    graphics into computer-aided design and product lifecycle management systems. Program-
    ming and Computer Software, 44(4), 225-232 (2018)
 2. Zhdanov, D., et al.: Photorealistic rendering of images formed by augmented reality optical
    systems, Programming and Computer Software, 44(4), pp. 213-224 (2018)
 3. Jensen, H.: Realistic image synthesis using photon mapping. AK Peters/CRC Press (2001)
 4. Papadopoulos, C., Papaioannou, G.: Realistic real-time underwater caustics and godrays. In
    Proc. GraphiCon 2009, vol. 9, pp. 89-95 (2009).
 5. Shevtsov, M., Soupikov, A., Kapustin, A.: Highly parallel fast KD‐tree construction for in-
    teractive ray tracing of dynamic scenes. In Computer Graphics Forum, vol. 26(3), pp. 395-
    404, Oxford, UK: Blackwell Publishing Ltd (2007).
 6. Gunther, J., Popov, S., Seidel, H., Slusallek, P.: Realtime ray tracing on GPU with BVH-
    based packet traversal. In 2007 IEEE Symposium on Interactive Ray Tracing, pp. 113-118,
    IEEE (2007)
 7. Garanzha, K., Premože, S., Bely, A., Galaktionov, V.: Grid-based SAH BVH construction
    on a GPU. The Visual Computer, 27(6-8), pp. 697-706 (2011)
 8. Lauterbach, C., et al.: Fast BVH construction on GPUs., Computer Graphics Forum., vol.
    28(2)., Oxford, UK: Blackwell Publishing Ltd (2009)
 9. Karras, T., Aila, T.: Fast parallel construction of high-quality bounding volume hierarchies.
    In Proc. of the 5th High-Performance Graphics Conference, pp. 89-99 (2013)
10. Aila, T., Karras, T., Laine, S.: On quality metrics of bounding volume hierarchies. In Proc.
    of the 5th High-Performance Graphics Conference, pp. 101-107 (2013)
11. Morton, G.: A computer oriented geodetic data base and a new technique in file sequencing,
    Technical Report, Ottawa, Canada: IBM Ltd (1966)
12. Karras, T.: Maximising parallelism in the construction of BVHs, octrees, and k-d trees.,
    In Proc. of the Fourth ACM SIGGRAPH/Eurographics conference on High-Performance
    Graphics (2012)
13. Domingues, L., Pedrini H.: Bounding volume hierarchy optimisation through agglomerative
    treelet restructuring, in Proc. of the 7th Conference on High-Performance Graphics (2015)
14. MacDonald, J., Booth, K.: Heuristics for ray tracing using space subdivision, The Visual
    Computer, 6(3), pp.153-166 (1990)
15. Lindholm, E., Nickolls, J., Oberman, S., & Montrym, J.: NVIDIA Tesla: A unified graphics
    and computing architecture., IEEE micro, 28(2), pp. 39-55 (2008)
16. Flynn, M.: Some computer organizations and their effectiveness., IEEE transactions on com-
    puters, 100(9), pp. 948-960 (1972)
17. Bulavintsev, V.: An evaluation of CPU vs. GPU performance of some combinatorial algo-
    rithms for cryptoanalysis, Vestn. YuUrGU. Ser. Vych. Matem. Inform., 4:3, pp.67–84
    (2015)
18. Sanzharov, V., Gorbonosov, A., Frolov, V., Voloboy, A.: Examination of the Nvidia RTX.
    In Proceedings of the 29th International Conference on Computer Graphics and Vision
    (GraphiCon 2019), vol. 2485, p. 7 (2019)
19. Parker, S., et al.: OptiX: a general purpose ray tracing engine. ACM transactions on graphics
    (TOG) vol. 29.4, p1-13 (2010)
20. Roccia, J. P., Coustet, C., Paulin, M.: Hybrid CPU/GPU KD-Tree construction for versatile
    ray tracing, Eurographics (Short Papers), Euro-graphics Association, pp. 13–16. (2012)
21. Gupta, K., Stuart, J. A., Owens, J. D.: A study of persistent threads style GPU programming
    for GPGPU workloads, IEEE, pp. 1-14., (2012)
8 V. Bulavintsev, D. Zhdanov


22. Garanzha, K., Loop, C.: Fast ray sorting and breadth‐first packet traversal for GPU ray trac-
    ing., Computer Graphics Forum, Vol. 29, No. 2, pp. 289-298. Oxford, UK: Blackwell Pub-
    lishing Ltd. (2010)
23. Popov, S., Günther, J., Seidel, H. P., Slusallek, P.: Stackless KD‐tree traversal for high per-
    formance GPU ray tracing., Computer Graphics Forum, Vol. 26, No. 3, pp. 415-424. Oxford,
    UK: Blackwell Publishing Ltd. (2007)
24. Aila, T., Laine, S.: Understanding the efficiency of ray traversal on GPUs, In Proceedings
    of the conference on high performance graphics 2009, pp. 145-149 (2009)
25. Satish, N., Harris, M., Garland, M.: Designing efficient sorting algorithms for manycore
    GPUs., In 2009 IEEE International Symposium on Parallel & Distributed Processing, pp. 1-
    10, IEEE (2009)
26. McGuire, M.: ”Computer Graphics Archive”, URL https://casual-effects.com/data (2017).
27. Integra Inc., Lumicept – Hybrid Light Simulation Software, URL http://www.inte-
    gra.jp/en/products/lumicept (2020)

</pre>