1 Evaluation of Data Transfer Methods for Block-based Realtime Audio Processing with CUDA Christoph Kuhr∗ , Alexander Carôt† Department of Computer Sciences and Languages, Anhalt University of Applied Sciences Köthen Email: ∗ christoph.kuhr@hs-anhalt.de, † alexander.carot@hs-anhalt.de Abstract—Realtime audio production environments generally do not use GPUs, as long as they are not involved in 3D rendering or video production processes. Thus, the GPU is idle most of the time and can be utilized as an audio co-processor. The block-based streaming nature and floating point representation of computer audio hardware are very well suited for GPGPU programming techniques. In this paper we line out the data transfers as the most expensive part in the processing of realtime audio data and evaluate different data transfer methods and positively evaluate different data transfer methods with respect to future audio DSP applications. I. I NTRODUCTION Modern computer systems are equipped with a CPU and a GPU. CPUs control the peripheral hardware and perform calculations unrelated to 3D graphics or video decoding. A GPU in contrast is concerned with rendering 3D graphics or utilizing special hardware codecs to decode nowadays video codes like H264 [1]. If a computer system is used for any kind of audio production, that excludes 3D rendering and video decoding, the GPU is mostly idle. Additionally, GPUs are designed to handle multiple floating point operations at the same time in a threaded fashion. These considerations promote the idea to use a GPU as an audio co-processor for signal processing purposes. Computation intensive audio signal processing of realtime data Figure 1: CUDA Computing Grids [5] has already been done, e.g. Wefers and Berg have used a GPU to process FIR and IIR filters [2], Jedrzejewski and Marasek have used the GPU to do impulse response computations for virtual room acoustics [3]. x86/x86 64 CPUs. Such parallel programs are called kernel In this paper we will investigate the lower limit for the in the CUDA domain. usage of a GPU for such signal processing tasks in a realtime When a kernel is executed on the GPU, the kernel launches a audio production environment. The limit is given as the grid of several blocks, the limit is depending on GPU features. combination of channel count and sample buffer size in use. Inside each block on the grid, multiple threads execute the The bottlenecks in the communication between CPU and GPU actual computations at runtime. The same computation runs are evaluated and discussed. Further, possible workarounds on each thread, but with different data. Threads can be to increase the performance aspects under investigation are handled in a synchronous or an asynchronous way. The latter proposed and evaluated. requires the concept of streams for a destinct mapping of the data shared between the threads of one block. The structure CUDA (Compute Unified Device Architecture) is a of CUDA computing grids is shown in fig. 1. programming langauge designed for high-performance computing [4]. The idea is to make use of thousands of The concept of CUDA streams [6] is very convenient for threads running in parallel, which is not possible with the problem at hand. 71 Evaluation of Data Transfer Methods for Block-based Realtime Audio Processing with CUDA 2 Different audio streams can be treated asynchronously, which of 16, 24 or 32 bits, either encoded as integer or floating is a better representation of their orthogonal nature then a point [10]. matrix with an appropriate amount of rows and columns. This Computer audio hardware manages data by using buffers that way the orthogonality may also be represented appropriately, consist of a predefined amount of samples. The audio driver but access to the matrix would be centralized and would repeatedly accesses the memory of the audio hardware and experience possible racing conditions. Beyond, using a copies the sample buffers to the CPU RAM for further usage. dimension (x, y or z) for the representation of the different The responsivness of such an audio system depends on the audio channels, reduces the available dimensionality that is size of the sample buffers, while the response time reduces useable for calculations at runtime. with an increasing sample buffer size. Typical sample buffer sizes are 64, 128, 256, 512, 1024 samples [11]. This paper is part of the research project fast-music [7]. The project has the goal to enable symphonic orchestras to rehearse AudioDataBlock = SampleDepth · SampleBuf f erSize · ChannelCount via the public internet, by using the realtime communication software Soundjack [8] [9]. Research in the field of packet loss AudioDataBlock = concealtment will use GPUs for complex signal processing 32bit · {64, 128, 512, 1024} Samples · {2, 8, 16, 32, 64} s based on machine learning algorithms. Due to this block-based streaming nature, the data transfer II. A RCHITECTURE and processing of audio data between CPU and GPU might The work of Wefers and Berg [2] has also shown, that reduce the impact of the data copying overhead, particularly realtime processing of audio data with a GPU is possible. if multiple audio channels are used. The communication between CPU and GPU is realized via The audio data, that we will transfer and process with the driver calls and shared memory, either DMA, GPU or CPU GPU, is provided by a professional audio driver and server RAM. The CPU is also referred to as host and the GPU as combination called Jack Audio Connection Kit [12]. On device. Nowadays, system architectures where CPU and GPU top of a Linux ALSA [13] driver, Jack provides the means share the same cache are used increasingly, albeit mainly in to interconnecting jack-aware audio software to the audio embedded systems. This architecture completely eleminates interface with 32 bit floating point precision. The floating memory copies, since the memory is coherently accessible point format requires the development of a prototype, because by the CPU and the GPU. In conventional systems which the Soundjack clients use an integer format instead of floating communicate via the PCIe bus, data has to be copied from point and would require additional conversion. CPU RAM to GPU RAM and back. Since the API calls copying data between CPU and GPU have much overhead, it is more efficient to copy huge amounts of data. Thus, it is even more interesting to investigate the use case of small amounts of data, as generated and processed in the audio domain. Figure 3: Device to Device Copy Duration Synchronous Data Transfer Method We developed a most simple Jack client for testing purposes Figure 2: Legend Data Transfer Method Measurements with varying channel counts and sample buffer sizes. The Jack client is linked against a shared library that provides the CUDA Realtime audio data is represented as a two dimensional Kernel [4]. This way CUDA computations can be integrated vector field. At any sample point in time some analog digital in arbitrary C programs. The Jack Server configures the audio converter process generates a sample, with typical bit depths interface by utilizing the ALSA driver infrastructure. The most 72 Evaluation of Data Transfer Methods for Block-based Realtime Audio Processing with CUDA 3 important configuration parameters for our investigations are III. CUDA M EMORY O RGANIZATION AND M ANAGEMENT the channel count and sample buffer size, called frame or The data structure and data transfer between CPU and period in the Jack domain. At runtime, the Jack Server requests GPU are the bottlenecks for the entire signal processing. our Jack client to process a frame with a callback function. Three different data transfer methods can be used: If the callback function is not done with its computations in time, the Jack Server reports a buffer underrun, also called xrun in the Jack domain. 1) Synchronous data transfer A synchronous data transfer returns as soon as the memory operation on the GPU memory is done, with a success or failure result. For the GPU integration of synchronous data transfers, it is irrelevant whether the memory is pagable or pinned. Either type can be accessed. Pagable memory is memory from the virtual address space of CPU or the operating system. 2) Asynchronous data transfer An asynchronous data transfer returns immediately after invoking the data transfer, regardless of the result. The result of the operation has to be checked seperately. It requires the additional concept of streams for the integration on the GPU. Further, the host memory has to be pinned. Pinned memory addresses are allocated in the DMA address space of the host system. 3) Managed memory with coherent caches on CPU and GPU Figure 4: Kernel Execution Duration Managed Memory Data Transfer Method With managed memory, the requirement of memory copy operations is eliminated. The GPU driver allocates A Nvidia Geforce GT940mx GPU with 2 GB of DDR3 memory on the CPU and GPU respectively, manages any RAM is connected to an Intel i7-6870 4-Core CPU with 16 data access onto these memory segments implicitly and GB DDR3 RAM via a PCIe x16 2.0 bus [14], in the system thus keeps the data in both memory locations coherent under test. Thus, the transfer rate between CPU and GPU is by small caching operations. limited to the bus bandwidth of 8 GBps simplex. The Nvidia Geforce GT940mx has a compute capability of 5.0 (≥ 2.0), which allows it to use managed memory. Figure 5: Host to Device Transfer Duration Data Transfer Figure 6: Device to Host Transfer Duration Data Transfer Method Method 73 Evaluation of Data Transfer Methods for Block-based Realtime Audio Processing with CUDA 4 The direction for data transfers is crucial as well. Three 3) DeviceToHost (DtoH or D2H) different directions are distinguished: Although the DeviceToHost mode does not utilize 1) HostToDevice (HtoD or H2D) the DMA memory it may also operate asynchronously, but slower since it is copied from GPU to CPU RAM. The HostToDevice mode utilizes the Direct Memory Access (DMA) memory of the host system. This enables the CPU to offload the data transfer operations to the IV. E XPERIMENTS GPU without waiting for the completion or result. We investigated the influence that the sample buffer size and channel count had on the data transfer rates. The audio channel count was varied between 2, 8, 16 and 32 channels, while each channel count was tested with each common sample buffer size of 64, 128, 256, 512 and 1024 samples per buffer. The samples were formatted as 32 bit floating point. A simple CUDA kernel is provided for an exemplary computation. Each thread in a block handles exactly one sample, copies it from the input to the output buffer. This way 64 up to 1024 threads run in parallel in a single block. The worstcase for the data transfer times, is given by the Jack servers buffersize and sample rate, which in this case is 48kHz ( Sample Duration = 48kHz 1 = 20.833µs ): Sample Worst Buffer Case Size Latency 64 1.334ms 128 2.667ms 256 5.334ms Figure 7: Kernel Execution Duration Asynchronous Data 512 10.667ms Transfer Method 1024 21.334ms 2) DeviceToDevice (DtoD or D2D) Table I: Tolerable Worst Cast Latencies for Realtime Audio Invoking CUDA memcpy between two GPUs uses The profiling overhead of the NVidia Visual Profiler memory copy operations between the RAM of both (NVVP) for 32 channels with 64 samples per buffer pushed GPUs. If a D2D memory copy operation is issued on a the host machine to its limits. Thus, tests with 64 audio single device however, the GPUs’ internal cache is used channels were omitted. for the data transfer. Figure 8: Host to Device Transfer Duration Asynchronous Figure 10: Device to Host Transfer Duration Asynchronous Data Transfer Method Data Transfer Method 74 Evaluation of Data Transfer Methods for Block-based Realtime Audio Processing with CUDA 5 Figure 9: NVVP Screenshot showing CUDA API Overhead V. D ISCUSSION managed memory) memory operations last approximately four times longer, resulting in the larger gap on the left side All combinations of transfer methods and modes, sample in fig. 9. These API and driver calls introduce jitter to the buffer sizes and channel counts take in average less then tested audio signal. 10µs and show peaks of up to 46µs, as visualized in fig. 4 to fig. 10. The visualized durations neglect the CUDA API and The turning point from where the CUDA API overhead is driver calls, they represent the execution on hardware only. neglectable, can be quantified: The legend in fig. 2 is common for all figures. Channel Sample Count Buffer A comparisson of fig. 5 and fig. 8 shows that the memory Size mapped H2D mode takes less time, at minimum, average and 2 512 maximum then the asynchronous copy mode. 4 512 The kernel execution times for the two other transfer methods 8 1024 shown in fig. 4 and fig. 7, exhibit no significant difference. 16 1024 In fig. 3 only the device to device copy operation is shown, 32 1024 which does not involve any kernel launch. These findings Table II: Channel Count and Sample Buffersize Limit for suggest that the synchronous memory transfer method Realtime Audio Processing would also be suitable for the H2D copy mode. Since a kernel has to wait until all data is present in the GPU memory, it is of no consequence at this point, if the data is VI. C ONCLUSIONS transfered synchronously or asynchronously. In contrast to All three memory transfer methods are able to operate the D2H mode, where a non blocking data transfer allows on realtime audio data. Managed memory however is most the processing chain to finish sooner. The magnitude of these convenient, because host and device pointers do not require savings is much lower then of the overhead introduced by any special handling and integrate smoothly into C code the CUDA API and driver calls. This is observable in the as well as CUDA code. For the usage with Jack however, rows below the CUDA Context in fig. 9, the three smaller two memory copy operations are still required, because Jack gaps (≈ 7ms) on the right side and a larger gap (≈ 28ms) provides preallocated pointers to its buffer interface. on the left side relate to the small chunks in the rows for Low sample buffer sizes increase jitter, but no buffer the respective streams. These chunks are the hardware based underruns were detected. Although the duration of the CUDA memory operations as mentioned above and take only a few API and driver calls suggest that underruns should occur with micorseconds in average. sample buffer sizes below 512 samples. All three memory organization modes exhibit a common problem of cyclic nature. At a given interval (≈ 11s for pagable memory, ≈ 5s for pinned memory and ≈ 2.5s for 75 Evaluation of Data Transfer Methods for Block-based Realtime Audio Processing with CUDA 6 VII. F UTURE W ORK R EFERENCES The evaluation of the data transfer method has been a [1] H.264: Advanced video coding for generic audiovisual services, ITU-T feasability study for further goals. In the future, machine Std. H.264, 2003. [2] F. Wefers and J. Berg, “High-performance real-time fir-filtering using learning algorithms will be investigated in this environment fast convolution on graphics hardware,” in Proc. of the 13th Int. as well as common signal processing algorithms, with respect Conference on Digital Audio Effects (DAFx-10). Graz, Austria: Institute to error concealment techniques and the generation of audio of Technical Acoustics, RWTH Aachen University, Sep. 6–10, 2010, pp. DAFX–1 – DAFX–8. effects. [3] M. Jedrzejewski and K. Marasek, “Computation of room acoustics using programmable video hardware,” in Computer Vision and Graphics, Springer-Verlag Netherlands. PJWSTK, 2006. VIII. ACKNOWLEDGEMENTS [4] Getting Started with CUDA, NVidia Corporation, 2008. fast-music is part of the fast-project cluster (fast actuators [5] CUDA C PROGRAMMING GUIDE, NVidia Corporation, 2013. [6] CUDA Streams, Best Practices and Common Pitfalls, NVidia Corpora- sensors & transceivers), which is funded by the BMBF (Bun- tion, Year unknown. desministerium für Bildung und Forschung). [7] (2017, Jun.) fast actuators, sensors and transceivers. [Online]. Available: https://de.fast-zwanzig20.de/ [8] (2017, Jun.) Soundjack - a realtime communication solution. [Online]. Available: http://http://www.soundjack.eu [9] A. Carôt, “Musical telepresence - a comprehensive analysis towards new cognitive and technical approaches,” Ph.D. dissertation, University of Lübeck, Germany, May 2009. [10] A. V. Oppenheim and R. W. Schaefer, Discrete-time signal processing, 2nd ed. Englewood Cliffs, NJ: Prentice Hall, Inc., 1989. [11] K. C. Pohlmann, Principles of Digital Audio, 5th ed. The Mcgraw-Hill Companies, 2005. [12] (2017, Jun.) Jack audio connection kit. [Online]. Available: https://jackaudio.org [13] (2017, Jun.) Advanced linux sound architecture. [Online]. Available: https://alsa-project.org/main/index.php/Main Page/ [14] (2006, Dec.) Pci express base specification revision 2.0. PCI-SIG. [Online]. Available: https://members.pcisig.com/wg/PCI- SIG/document/download/8246 76