Advanced mixing audio streams for heterogeneous computer systems in
                           telecommunications
                                                  A.A. Kolpakov1, Ju.A. Kropotov1
                               1
                                Murom institute (branch) VlSU, Orlovskaya st., 23, 602264, Murom, Vladimir region, Russia


Abstract

This paper presents an algorithm enhanced mixing of audio streams for computation on GPUs, which combines multiple stages of mixing by
using two-pass rendering, which significantly reduces the switching time between buffers. Methods of computer experimental comparative
studies were carried out evaluating the performance of the developed algorithm. The purpose of the present paper is development of an
efficient algorithm for mixing audio streams for processing on GPUs. The method of transfer operations computing on graphics processors
with the use of Shader programs was developed. Novel features presented solutions is using a two-pass rendering. The results showed that the
application of the developed algorithm leads to an increase in computational performance up to 6 times. Presented solution can be
implemented as software in the telecommunications multiprocessor systems.

Keywords: two-pass rendering; algorithm for increasing productivity; parallel computing; heterogeneous computing systems; graphics
processors; mixing of audio data


1. Introduction

   The development of information and telecommunication systems of digital in-house operational, command, loudspeaker and
telephone communication on their basis becomes especially urgent with the development of modern computer networks. It is
justified to use panel or tablet computers based on ready-made solutions to ensure optimum performance when designing such
systems. Since voice communication plays an important role, the voice conferences mode with several participants is the main
one in such systems.
   The simplest scenario of organizing voice communication is that each sound source sends its audio stream to each receiver
independently. This method is simple and convenient, but it requires high network bandwidth, which is not always possible.
Therefore, the best method is audio mixing, which means combining the audio streams from each source into one. Based on the
possibility of sound waves superimposed on each other, this method can provide an acceptable sound quality, while ensuring a
decrease in network congestion. However, the application of this method can significantly increase the load on the server CPU,
which can negatively affect the overall system performance. The way out of this situation may be the use of GPUs to solve this
problem [1].

2. The problems with using graphics processors for audio mixing

    Although the graphics processors are quite efficient, there are several problems in using them for audio mixing, which are
related to the architecture and the limited functionality [2].
    The first problem is the bus bandwidth between the GPU and the main memory, which is smaller than between the main
processor and the main memory. For example, the Intel 975X chipset provides theoretical bandwidth for a CPU of 10.7 GB/s,
and for the GPU only 8 GB/s. Practice shows that the lack of support for asynchronous I/O requires a lot of time for additional
operations, such as locking / unlocking the buffer. Since the general calculations on the GPU are based on 3D rendering, the
write speed is usually higher than the read speed. This asymmetry makes the procedure for reading the result sufficiently long [3,
4].
    Secondly, the general calculations on the GPU are based on 3D models, different tasks require different GPU settings, such as
3D models, transforming matrices and shader programs. During the loading of the settings, the computational flows of the GPU
are not involved. Worst of all, the GPU does not tell the CPU when the task is completed, so the CPU must periodically check
the status of the GPU. This is quite a time-consuming operation, as it breaks the parallelism between the GPU and the CPU.
    Third, the disadvantage of the GPU is its performance in logical operations. As you know, CPU tracks branches, GPU works
differently: each branch is first executed, and then the desired result is selected. This makes parallelization easier, but requires
more resources.
    Finally, the GPU instruction set is incompatible with the CPU. In addition, execution time and code length are limited. All
this makes it difficult to transfer existing algorithms to graphics processors [5].

3. The structure of the developed algorithm

   The basic algorithm of audio mixing consists of five steps. The first step is to summarize the audio samples from different
sources, which can be represented by the following formula [6,7]:


3rd International conference “Information Technology and Nanotechnology 2017”                                                           32
                                               High-Performance Computing / A.A. Kolpakov, Ju.A. Kropotov
                           n
                  M t   u k ,t
                          k 0     ,                                                                                  (1)
где u k ,t – the sample vector; i.e. the vector of samples obtained by microphone k in time t;
    M t – resulting mixing vector.
   The second stage is echo cancellation, which in its basic form consists in excluding the sample of the i-th device from the
final vector [8]. This stage is represented in the form
                  M i ,t  M t  u i ,t ,                                                                              (2)
где M i ,t – resulting mixing vector for the i-th device;

     u i ,t – sample of the i-th device.
   For the correctness of the resulting vector M i ,t , it is necessary that its dimension be equal to the dimension of the incoming

vectors. However, after stages 1 and 2, the vector M i ,t may be overcrowded, leading to unwanted noise [9]. In order to use the

original vector of large dimensions M i ,t , it is necessary to compress it for further use. This is done in the third stage by the
formula
                  M i ,t  M i ,t  fraci ,t
                                   ,                                                                              (3)
where fraci ,t – attenuation factor for the i-th device. This coefficient must be calculated automatically, starting from the
maximum compressed sample. The search for the maximum among the compressed samples occurs in the fourth stage, and the
correction of the attenuation coefficient is made on the fifth stage.
   A block diagram of the basic mixing algorithm is shown in fig. 1.

                                                     Processing of
                                 Begin              source samples
                                                                                       Sample processing cycle

                                                                                                 i = [0…n]
                                   n
                         M t   u k ,t              M i ,t  M t  u i ,t          M i ,t  M i ,t  fraci ,t
                                  k 0


                                                       Correction of
                                                                                      Max(M i ,t )
                                                           fraci ,t


                                                                                Output of the
                                                                                source vector


                                                                                    End

                                                   Fig. 1. A block diagram of the basic mixing algorithm.

  For a better description of the algorithm, the GPU capabilities are represented as an f-function of the X texture and pixel
coordinates represented by the formula
                 f ( X , P )  { yi | pi  Pi yi  f ( X , pi )} ,                                             (4)
где y i – projection of pixel p i on the texture X,
    P – 3D model projection.
    For each pixel in the 3D model projection P, the function (4) will be calculated and then the result will be written to the
render buffer.
    From the formula (4) the sample of calculations on GPU can be presented as A=(X,P,f).
    Suppose that n is the total number of sound sources in the session, and L – the length of one audio sample. Each of the first
three mixing steps (sample accumulation, echo cancellation and compression) yields n sequences of L bytes. At the same time,
the last two steps (searching for the maximum sample and adapting the attenuation coefficient) yield only n integers. Since the
first three steps can be performed within one projection for computing on GPU, they can be combined into one step, as shown
below


3rd International conference “Information Technology and Nanotechnology 2017”                                                   33
                                                  High-Performance Computing / A.A. Kolpakov, Ju.A. Kropotov

                          n1         
                    mi    u j  ui   fraci .                                                                                    (5)
                           j 0        
   Since the fourth step uses a different measurement than the first three, it cannot be combined within the framework of formula
(5). A block diagram of the extended mixing algorithm is shown in fig. 2.

                                                         Processing of source
                          Begin                                samples


                                                                                                          Sample processing cycle


                        n 1                                                                                         Correction of
                  mi    u j  ui   fraci                             Max(M i ,t )
                         j  0                                                                                          fraci ,t


                                                                Output of the source
                                                                      vector


                                                                              End


                                                    Fig. 2. A block diagram of the extended mixing algorithm.

   Usually, calculations can be combined according to general criteria when their projections do not coincide and they do not
have intersections. Therefore, if these calculations are denoted as A1=(Xl,P1,f1) and A2=(X2, P2, f2), then they can be combined
as shown below.
                                                                                           f1 ( X 1 , p), p  P1 ,
                    X  X 1 , X 2 , P  P1  P2 , f ( X 1 , X 2 , p )                                                                (6)
                                                                                           f 2 ( X 2 , p), p  P2 .
   As seen from (6), the main algorithm is checking the coordinates of each pixel to select the module's execution function.
Since the GPU executes all branches before selecting the desired one, each branch will be executed for each pixel, which will
take a long time.
   The developed algorithm presents an alternative method for executing a large number of functions, which outputs data of
different lengths to a single render buffer using multithreaded rendering. Here the main rule is to move the projection by
modifying the projection matrix, so that the computational area of each function is limited to the required area, rather than the
entire buffer.
   Depending on the pixel, it can have a different amount of information without loss of versatility. w denote the total number of
pixels needed to store L bytes. Thus, the render buffer can be represented as (w+1)*n. 3D model of the developed algorithm is a
rectangle, which lies in the plane Z. It has the dimensions: 2w/(w+1) units in width and 2 units in height. Coordinates of vertices
– (-1,-1,0), (1-2/(w+1),-1,0), (-1,1,0), (1-2/(w+1),1,0).
   At the first rendering pass, a unit matrix is chosen as the projection matrix. This ensures that the projection matches the 3D
model. After converting the visible area, in this pass the formula (5) is applied to perform the first three steps of the mixing.
Transformations of the 3D model, produced in the first three steps, are shown in Fig. 3.

                                                                              Texture P2


                                                           Texture P1

                                                                 X  X1, X 2
                                                                P  P1  P2
                                                                               f ( X , p ), p  P1
                                                         f ( X 1 , X 2 , p)   1 1
                                                                               f 2 ( X 2 , p ), p  P2

                                                                Fig. 3. The first pass of the algorithm.


3rd International conference “Information Technology and Nanotechnology 2017”                                                                34
                                              High-Performance Computing / A.A. Kolpakov, Ju.A. Kropotov
   In the second rendering pass, the projection is shifted to the left by L pixels. At the same time, the shader program switches to
the search mode of the maximum sample. In this pass, only one column can be written, since most parts of the projection lie
outside the render buffer and will be automatically ignored by the GPU. Since the clipping of the projection was performed at
the beginning of the render, this method calls the function only for the correct pixels, and not for the entire buffer. The actions
performed on the second pass of the algorithm are shown in fig. 4.
                                                                                                    Recorded area


                                                         Texture P2
                                                                                      Текстура P1


                                                                              Texture P1

                                                                                X  X1, X 2
                                                                               P  P1  P2
                                                                                                  f ( X , p ), p  P1
                                                                            f ( X 1 , X 2 , p)   1 1
                                                                                                  f 2 ( X 2 , p ), p  P2


                                                           The shift in L pixels
                                                         Fig. 4. The second pass of the algorithm.

4. Using a single-texture algorithm

   As noted above, both pass algorithms requires as input data samples every n sequences. Each sequence must have its own
unique texture. Total needed n texture dimension L. This multi-textural technology is not suitable for audio mixing. First, each
sound source requires its own texture. Therefore, additional passes may be required. Secondly, loading a lot of small textures is
much slower than loading a single large one.
   In the developed algorithm, it is proposed to load a single texture. As an example, the first pass of the algorithm is presented.
The input contains n samples of sequences of size L bytes and n attenuation coefficients. Two textured RGBA format buffers are
used. Texture T1 has the dimension [L/4]*n and stores all sample sequences in a line. Texture T2 has a dimension of 1*n and
stores the attenuation coefficients. The coordinates of all the textures are given in Table 1.

                               Table 1. Coordinates of textures used in the developed algorithm.
                                Vertexes                            Coordinates of               Coordinates of
                                                                   texture T1                                        texture T2
                                (-1, -1, 0)                        (1/2w, 1)                                         (0.5, 1)
                                (1-2/(w+1), -1, 0)                 (1, 1)                                            (0.5, 1)
                                (-1, 1, 0)                         (1/2w, 0)                                         (0.5, 0)
                                (1-2/(w+1), 1, 0)                  (1, 0)                                            (0.5, 0)

   Audio mixing is performed independently for each pixel. The pixel's texture coordinates (x, y) are calculated by interpolating
the vertex texture coordinates. Based on Table 1, the texture coordinate for T1 or pt.t0 should be (X, Y), and for T2 or pt.t1 –
(0.5, Y). Pt.t0 refers to the sample for which the current mixing transformations are made, this sample is called the "aiming
point". To exclude the appearance of an echo, the other texture coordinate of ptCur is restricted by accessing T1 instead of pt.t0.
The x component of the ptCur texture is identical to the pt.t0 texture, and the y component is calculated from a cyclic variable
that designates each sound source. In the loop, the position register v is used to skip the "aiming point". Finally, the attenuation
coefficients are read from the texture pt.t1. Since the coefficient is stored in the first byte, the sample is produced only by the
blue component.

5. Experimental study of the developed algorithm

   Test input data for mixing are sequences of 320 samples. All samples are generated randomly. Test bench configuration: CPU
Intel Core i3-4130, 4 GB RAM, graphics card NVIDIA GeForce GT730. The number of M sequences was varied. The results for
the basic and developed algorithms at the output are identical. The results of an experimental study of the dependence of the
mixing time t on the number of sequences M are shown in fig. 5.
   As can be seen from the test results shown in Fig. 5, application of the advanced algorithm of audio mixing allows to increase
productivity of computer system in 5-6 times [9].
   Table 2 shows the results of measuring the average running time of the algorithm t for 8 random sequences of 320 samples.
During the measurement, 1000 processing cycles were carried out, for each of which new random sequences of samples were


3rd International conference “Information Technology and Nanotechnology 2017”                                                     35
                                             High-Performance Computing / A.A. Kolpakov, Ju.A. Kropotov
generated. For the results obtained, the dispersion of the output sequences is calculated, the values of which are also given in
Table 2.


                     Fig. 5. The results of an experimental study of the dependence of the mixing time t on the number of sequences M.

                           Table 2. Results of the study of the algorithm for extended audio mixing for 8 sample sequences.
                                                                       Average time of the           Dispersion of output
                           The used algorithm
                                                                       algorithm, ms                 Sequences
                           Basic algorithm                             1,351                         3,757
                           Developed algorithm                         2,226 x 10-1                  1,426 x 10-3
   As can be seen from the test results presented in Table 2, the variance of the output sequences for the basic algorithm is
substantially higher than for the developed one [12,13]. This is due to the fact that 32-bit numbers are used for the calculations in
the GPU, whereas for the calculations on the central processor – 64-bit. Application of the developed algorithm in a
heterogeneous computer system reduces the processing time to 0.22226 x 10-3 s instead of 1.351 x 10-3 s – the time of data
processing by the basic algorithm.

6. Conclusion

   This paper presents an algorithm for the extended downmix audio streams for computing on GPU. Its main advantage is the
combination of multiple stages of mixing by using a two-pass rendering, which significantly reduces the switching time between
buffers. The use of one texture for calculations increases the efficiency of I/O operations. Although I/O operations take
approximately half the computation time, experimental studies of the developed algorithm showed an increase in performance
up to 6 times.

References

[1] Lindholm E, Nickolls J , Oberman S , Montrym J. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro 2008; 28(2): 39–55.
[2] Luebke D, Harris M, Kruger J, Purcell T, Govindaraju N, Buck I, Woolley C, Lefohn A. GPGPU: general purpose computation on graphics hardware. ACM
     SIG-GRAPH. New York, USA: Course Notes, 2004; 33 p. DOI: 10.1145/1103900.110393.
 [3] Kolpakov AA. Theoretical evaluation of growth performance computing systems from the use of multiple computing devices. V mire nauchnykh otkrytii
     2012; 1: 206–209. (in Russian)
[4] Boreskov AV. Shaders development and debugging. Sankt-Peterburg: BHV-Peterburg, 2006; 488 p. (in Russian)
[5] Nekrasov KA, Potashnikov SI, Boyarchenkov AS, Kupryazhkin AYa. Parallel computing for general purpose graphics processors: text book. Ministry of
     Education and Science of the Russian Federation, Ural Federal University. Ekaterinburg : Publishing house of the Ural University, 2016; 104 p. (in
     Russian)
[6] Kropotov JuA. Experimental study of the law of distribution of probability of amplitudes of signals of systems of transmission of voice information.
     Proektirovanie i tekhnologiia elektronnykh sredstv 2006; 4: 37–42. (in Russian)
[7] Belozyorov AS, Korobicyn VV. Implementation of computations on a graphics processor using Nvidia CUDA platform . Programmnye produkty i sistemy
     2010; 1: 62–64. (in Russian)
[8] Kropotov JuA, Belov AA, Proskuryakov AJu, Kolpakov AA. Methods of Designing Telecommunication Information and Control Audio Exchange Systems
     in Difficult Noise Conditions. Sistemy upravleniia, sviazi i bezopasnosti 2015; 2: 165–183. (in Russian)
[9] Ermolaev VA, Kropotov JuA. Algorithms for processing acoustic signals in telecommunication systems by local parametric methods of analysis.
     Proceedings International Siberian Conference on Control and Communications (SIBCON), 2015. URL: http://ieeexplore.ieee.org/document/7147109/.


3rd International conference “Information Technology and Nanotechnology 2017”                                                                      36