<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluation of Data Transfer Methods for Block-based Realtime Audio Processing with CUDA</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>, Alexander Car oˆt</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Christoph Kuhr</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Computer Sciences and Languages, Anhalt University of Applied Sciences K o ̈then</institution>
        </aff>
      </contrib-group>
      <fpage>71</fpage>
      <lpage>76</lpage>
      <abstract>
        <p>-Realtime audio production environments generally do not use GPUs, as long as they are not involved in 3D rendering or video production processes. Thus, the GPU is idle most of the time and can be utilized as an audio co-processor. The block-based streaming nature and floating point representation of computer audio hardware are very well suited for GPGPU programming techniques. In this paper we line out the data transfers as the most expensive part in the processing of realtime audio data and evaluate different data transfer methods and positively evaluate different data transfer methods with respect to future audio DSP applications.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>
        Modern computer systems are equipped with a CPU and
a GPU. CPUs control the peripheral hardware and perform
calculations unrelated to 3D graphics or video decoding. A
GPU in contrast is concerned with rendering 3D graphics or
utilizing special hardware codecs to decode nowadays video
codes like H264 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>If a computer system is used for any kind of audio production,
that excludes 3D rendering and video decoding, the GPU
is mostly idle. Additionally, GPUs are designed to handle
multiple floating point operations at the same time in a
threaded fashion.</p>
      <p>
        These considerations promote the idea to use a GPU as an
audio co-processor for signal processing purposes.
Computation intensive audio signal processing of realtime data
has already been done, e.g. Wefers and Berg have used a GPU
to process FIR and IIR filters [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Jedrzejewski and Marasek
have used the GPU to do impulse response computations for
virtual room acoustics [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>In this paper we will investigate the lower limit for the
usage of a GPU for such signal processing tasks in a realtime
audio production environment. The limit is given as the
combination of channel count and sample buffer size in use.
The bottlenecks in the communication between CPU and GPU
are evaluated and discussed. Further, possible workarounds
to increase the performance aspects under investigation are
proposed and evaluated.</p>
      <p>
        CUDA (Compute Unified Device Architecture) is a
programming langauge designed for high-performance
computing [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The idea is to make use of thousands of
threads running in parallel, which is not possible with
x86/x86 64 CPUs. Such parallel programs are called kernel
in the CUDA domain.
      </p>
      <p>When a kernel is executed on the GPU, the kernel launches a
grid of several blocks, the limit is depending on GPU features.
Inside each block on the grid, multiple threads execute the
actual computations at runtime. The same computation runs
on each thread, but with different data. Threads can be
handled in a synchronous or an asynchronous way. The latter
requires the concept of streams for a destinct mapping of the
data shared between the threads of one block. The structure
of CUDA computing grids is shown in fig. 1.</p>
      <p>
        The concept of CUDA streams [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is very convenient for
the problem at hand.
Different audio streams can be treated asynchronously, which
is a better representation of their orthogonal nature then a
matrix with an appropriate amount of rows and columns. This
way the orthogonality may also be represented appropriately,
but access to the matrix would be centralized and would
experience possible racing conditions. Beyond, using a
dimension (x, y or z) for the representation of the different
audio channels, reduces the available dimensionality that is
useable for calculations at runtime.
      </p>
      <p>
        This paper is part of the research project fast-music [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The
project has the goal to enable symphonic orchestras to rehearse
via the public internet, by using the realtime communication
software Soundjack [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Research in the field of packet loss
concealtment will use GPUs for complex signal processing
based on machine learning algorithms.
      </p>
    </sec>
    <sec id="sec-2">
      <title>II. ARCHITECTURE</title>
      <p>
        The work of Wefers and Berg [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] has also shown, that
realtime processing of audio data with a GPU is possible.
The communication between CPU and GPU is realized via
driver calls and shared memory, either DMA, GPU or CPU
RAM. The CPU is also referred to as host and the GPU as
device. Nowadays, system architectures where CPU and GPU
share the same cache are used increasingly, albeit mainly in
embedded systems. This architecture completely eleminates
memory copies, since the memory is coherently accessible
by the CPU and the GPU. In conventional systems which
communicate via the PCIe bus, data has to be copied from
CPU RAM to GPU RAM and back.
      </p>
      <p>
        Since the API calls copying data between CPU and GPU have
much overhead, it is more efficient to copy huge amounts of
data. Thus, it is even more interesting to investigate the use
case of small amounts of data, as generated and processed in
the audio domain.
Realtime audio data is represented as a two dimensional
vector field. At any sample point in time some analog digital
converter process generates a sample, with typical bit depths
of 16, 24 or 32 bits, either encoded as integer or floating
point [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Computer audio hardware manages data by using buffers that
consist of a predefined amount of samples. The audio driver
repeatedly accesses the memory of the audio hardware and
copies the sample buffers to the CPU RAM for further usage.
The responsivness of such an audio system depends on the
size of the sample buffers, while the response time reduces
with an increasing sample buffer size. Typical sample buffer
sizes are 64, 128, 256, 512, 1024 samples [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>AudioDataBlock =
SampleDepth · SampleBuf f erSize · ChannelCount
AudioDataBlock =
32bit · {64, 128, 512, 1024} Samsples · {2, 8, 16, 32, 64}</p>
      <p>Due to this block-based streaming nature, the data transfer
and processing of audio data between CPU and GPU might
reduce the impact of the data copying overhead, particularly
if multiple audio channels are used.</p>
      <p>
        The audio data, that we will transfer and process with the
GPU, is provided by a professional audio driver and server
combination called Jack Audio Connection Kit [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. On
top of a Linux ALSA [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] driver, Jack provides the means
to interconnecting jack-aware audio software to the audio
interface with 32 bit floating point precision. The floating
point format requires the development of a prototype, because
the Soundjack clients use an integer format instead of floating
point and would require additional conversion.
      </p>
      <p>
        We developed a most simple Jack client for testing purposes
with varying channel counts and sample buffer sizes. The Jack
client is linked against a shared library that provides the CUDA
Kernel [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This way CUDA computations can be integrated
in arbitrary C programs. The Jack Server configures the audio
interface by utilizing the ALSA driver infrastructure. The most
important configuration parameters for our investigations are
the channel count and sample buffer size, called frame or
period in the Jack domain. At runtime, the Jack Server requests
our Jack client to process a frame with a callback function.
If the callback function is not done with its computations in
time, the Jack Server reports a buffer underrun, also called
xrun in the Jack domain.
      </p>
      <p>
        A Nvidia Geforce GT940mx GPU with 2 GB of DDR3
RAM is connected to an Intel i7-6870 4-Core CPU with 16
GB DDR3 RAM via a PCIe x16 2.0 bus [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], in the system
under test. Thus, the transfer rate between CPU and GPU is
limited to the bus bandwidth of 8 GBps simplex.
The Nvidia Geforce GT940mx has a compute capability of
5.0 (≥ 2.0), which allows it to use managed memory.
III. CUDA MEMORY ORGANIZATION AND MANAGEMENT
      </p>
      <p>The data structure and data transfer between CPU and
GPU are the bottlenecks for the entire signal processing.
Three different data transfer methods can be used:</p>
    </sec>
    <sec id="sec-3">
      <title>1) Synchronous data transfer</title>
      <p>A synchronous data transfer returns as soon as the
memory operation on the GPU memory is done, with
a success or failure result. For the GPU integration
of synchronous data transfers, it is irrelevant whether
the memory is pagable or pinned. Either type can be
accessed. Pagable memory is memory from the virtual
address space of CPU or the operating system.</p>
    </sec>
    <sec id="sec-4">
      <title>2) Asynchronous data transfer</title>
      <p>An asynchronous data transfer returns immediately
after invoking the data transfer, regardless of the result.
The result of the operation has to be checked seperately.
It requires the additional concept of streams for the
integration on the GPU. Further, the host memory has
to be pinned. Pinned memory addresses are allocated in
the DMA address space of the host system.
3) Managed memory with coherent caches on CPU and GPU
With managed memory, the requirement of memory
copy operations is eliminated. The GPU driver allocates
memory on the CPU and GPU respectively, manages any
data access onto these memory segments implicitly and
thus keeps the data in both memory locations coherent
by small caching operations.
The direction for data transfers is crucial as well. Three
different directions are distinguished:</p>
    </sec>
    <sec id="sec-5">
      <title>3) DeviceToHost (DtoH or D2H)</title>
    </sec>
    <sec id="sec-6">
      <title>1) HostToDevice (HtoD or H2D)</title>
      <p>The HostToDevice mode utilizes the Direct Memory
Access (DMA) memory of the host system. This enables
the CPU to offload the data transfer operations to the
GPU without waiting for the completion or result.</p>
      <p>Invoking CUDA memcpy between two GPUs uses
memory copy operations between the RAM of both
GPUs. If a D2D memory copy operation is issued on a
single device however, the GPUs’ internal cache is used
for the data transfer.</p>
      <p>Although the DeviceToHost mode does not utilize
the DMA memory it may also operate asynchronously,
but slower since it is copied from GPU to CPU RAM.</p>
    </sec>
    <sec id="sec-7">
      <title>IV. EXPERIMENTS</title>
      <p>We investigated the influence that the sample buffer size
and channel count had on the data transfer rates. The audio
channel count was varied between 2, 8, 16 and 32 channels,
while each channel count was tested with each common
sample buffer size of 64, 128, 256, 512 and 1024 samples per
buffer. The samples were formatted as 32 bit floating point.
A simple CUDA kernel is provided for an exemplary
computation. Each thread in a block handles exactly one
sample, copies it from the input to the output buffer. This
way 64 up to 1024 threads run in parallel in a single block.
The worstcase for the data transfer times, is given by the
Jack servers buffersize and sample rate, which in this case is
1
48kHz ( Sample Duration = 48kHz = 20.833μs ):</p>
      <p>The profiling overhead of the NVidia Visual Profiler
(NVVP) for 32 channels with 64 samples per buffer pushed
the host machine to its limits. Thus, tests with 64 audio
channels were omitted.</p>
      <p>All combinations of transfer methods and modes, sample
buffer sizes and channel counts take in average less then
10μs and show peaks of up to 46μs, as visualized in fig. 4 to
fig. 10. The visualized durations neglect the CUDA API and
driver calls, they represent the execution on hardware only.
The legend in fig. 2 is common for all figures.</p>
      <p>A comparisson of fig. 5 and fig. 8 shows that the memory
mapped H2D mode takes less time, at minimum, average and
maximum then the asynchronous copy mode.</p>
      <p>The kernel execution times for the two other transfer methods
shown in fig. 4 and fig. 7, exhibit no significant difference.
In fig. 3 only the device to device copy operation is shown,
which does not involve any kernel launch. These findings
suggest that the synchronous memory transfer method
would also be suitable for the H2D copy mode. Since a
kernel has to wait until all data is present in the GPU
memory, it is of no consequence at this point, if the data is
transfered synchronously or asynchronously. In contrast to
the D2H mode, where a non blocking data transfer allows
the processing chain to finish sooner. The magnitude of these
savings is much lower then of the overhead introduced by
the CUDA API and driver calls. This is observable in the
rows below the CUDA Context in fig. 9, the three smaller
gaps (≈ 7ms) on the right side and a larger gap (≈ 28ms)
on the left side relate to the small chunks in the rows for
the respective streams. These chunks are the hardware based
memory operations as mentioned above and take only a few
micorseconds in average.</p>
      <p>All three memory organization modes exhibit a common
problem of cyclic nature. At a given interval (≈ 11s for
pagable memory, ≈ 5s for pinned memory and ≈ 2.5s for
managed memory) memory operations last approximately
four times longer, resulting in the larger gap on the left side
in fig. 9. These API and driver calls introduce jitter to the
tested audio signal.</p>
      <p>The turning point from where the CUDA API overhead is
neglectable, can be quantified:</p>
      <p>Channel</p>
      <p>Count</p>
      <p>All three memory transfer methods are able to operate
on realtime audio data. Managed memory however is most
convenient, because host and device pointers do not require
any special handling and integrate smoothly into C code
as well as CUDA code. For the usage with Jack however,
two memory copy operations are still required, because Jack
provides preallocated pointers to its buffer interface.
Low sample buffer sizes increase jitter, but no buffer
underruns were detected. Although the duration of the CUDA
API and driver calls suggest that underruns should occur with
sample buffer sizes below 512 samples.</p>
    </sec>
    <sec id="sec-8">
      <title>VII. FUTURE WORK</title>
      <p>The evaluation of the data transfer method has been a
feasability study for further goals. In the future, machine
learning algorithms will be investigated in this environment
as well as common signal processing algorithms, with respect
to error concealment techniques and the generation of audio
effects.</p>
    </sec>
    <sec id="sec-9">
      <title>VIII. ACKNOWLEDGEMENTS fast-music is part of the fast-project cluster (fast actuators sensors &amp; transceivers), which is funded by the BMBF (Bundesministerium f u¨r Bildung und Forschung).</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>H.264</surname>
          </string-name>
          <article-title>: Advanced video coding for generic audiovisual services</article-title>
          ,
          <string-name>
            <surname>ITU-T Std</surname>
          </string-name>
          . H.
          <volume>264</volume>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Wefers</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Berg</surname>
          </string-name>
          , “
          <article-title>High-performance real-time fir-filtering using fast convolution on graphics hardware,”</article-title>
          <source>in Proc. of the 13th Int. Conference on Digital Audio Effects (DAFx-10)</source>
          . Graz, Austria: Institute of Technical Acoustics, RWTH Aachen University, Sep.
          <fpage>6</fpage>
          -
          <lpage>10</lpage>
          ,
          <year>2010</year>
          , pp.
          <source>DAFX-1 - DAFX-8.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jedrzejewski</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Marasek</surname>
          </string-name>
          , “
          <article-title>Computation of room acoustics using programmable video hardware,” in Computer Vision</article-title>
          and Graphics, Springer-Verlag Netherlands. PJWSTK,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Getting</given-names>
            <surname>Started with</surname>
          </string-name>
          <string-name>
            <surname>CUDA</surname>
          </string-name>
          ,
          <source>NVidia Corporation</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>CUDA C PROGRAMMING</surname>
          </string-name>
          <article-title>GUIDE</article-title>
          ,
          <source>NVidia Corporation</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>CUDA</given-names>
            <surname>Streams</surname>
          </string-name>
          ,
          <article-title>Best Practices and Common Pitfalls, NVidia Corporation, Year unknown</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <article-title>(2017, Jun</article-title>
          .)
          <article-title>fast actuators, sensors and transceivers</article-title>
          . [Online]. Available: https://de.fast-zwanzig20.de/
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <article-title>(2017, Jun</article-title>
          .)
          <article-title>Soundjack - a realtime communication solution</article-title>
          . [Online]. Available: http://http://www.soundjack.eu
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Caroˆt</surname>
          </string-name>
          , “
          <article-title>Musical telepresence - a comprehensive analysis towards new cognitive and technical approaches</article-title>
          ,”
          <source>Ph.D. dissertation</source>
          , University of Lu¨beck, Germany, May
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Oppenheim</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Schaefer</surname>
          </string-name>
          ,
          <article-title>Discrete-time signal processing</article-title>
          , 2nd ed. Englewood Cliffs, NJ: Prentice Hall, Inc.,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>K. C. Pohlmann</surname>
          </string-name>
          , Principles of Digital Audio, 5th ed. The
          <string-name>
            <surname>Mcgraw-Hill Companies</surname>
          </string-name>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <article-title>(2017, Jun</article-title>
          .)
          <article-title>Jack audio connection kit</article-title>
          . [Online]. Available: https://jackaudio.org
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <article-title>(2017, Jun</article-title>
          .)
          <article-title>Advanced linux sound architecture</article-title>
          . [Online]. Available: https://alsa-project.org/main/index.php/Main Page/
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <article-title>(2006, Dec</article-title>
          .)
          <article-title>Pci express base specification revision 2.0</article-title>
          .
          <string-name>
            <surname>PCI-SIG.</surname>
          </string-name>
          [Online]. Available: https://members.pcisig.com/wg/PCISIG/document/download/8246
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>