<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Compute Shader in Image Processing Development∗</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Robert Tornai</string-name>
          <email>tornai.robert@inf.unideb.hu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Péter Fürjes-Benke</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Proceedings of the 1</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Debrecen, Faculty of Informatics</institution>
        </aff>
      </contrib-group>
      <fpage>218</fpage>
      <lpage>225</lpage>
      <abstract>
        <p>This paper will present the OpenGL compute shader implementation of the BlackRoom software. BlackRoom is a platform-independent image processing program, which supports multiple execution branches like Vulkan fragment shader, OpenGL fragment shader, and CPU-based rendering. In order to support a wider range of devices with diferent amounts of memory, users can utilize tile rendering, and the program can be run in browsers thanks to the WebAssembly format. Thanks to our program's built-in benchmark system, the performance differences between the implemented CPU- and GPU-based executing branches can be easily determined. We made a comprehensive comparison between the rendering performance of our CPU, OpenGL compute shader, fragment shader and Vulkan fragment shader branches. Latter is under development, which induces a relatively higher runtime presently. A further aim is to optimize our algorithms, which are using Vulkan API. Besides that, the program will be capable of rendering multiple efects at once with Vulkan fragment shader. Furthermore, the available GPU rendering and multi-threading features are planned to be enabled for WebAssembly platform yielded by the Qt framework [2].</p>
      </abstract>
      <kwd-group>
        <kwd>Image processing</kwd>
        <kwd>benchmark</kwd>
        <kwd>CPU</kwd>
        <kwd>shaders</kwd>
        <kwd>Vulkan</kwd>
        <kwd>OpenGL</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        BlackRoom is an image processing application developed in Qt 5.15 version [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
The goal was to use the most modern techniques, so we implemented the
algorithms using compute shader also beyond fragment shader of OpenGL and Vulkan
fragment shader. This paper will cover the results of these implementations.
      </p>
      <p>
        The structure of BlackRoom is based on the standard skeleton, which is
introduced in GPU Gems [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. As a source operator, we have a load operator for
processed and raw formats. Our system contains image filters. Some filters, as the
Harris shutter, have additional load operators for the color channels, thus extending
the simple demo structure mentioned in GPU Gems. Consequently, not just linear
processing paths can be accomplished. We have implemented both image view and
save operator as sink operators. Similar to the framework of Seiller et al., each
processing path is implemented in a separate class to follow the basic principles of
object-oriented programming [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        During the research we have studied the existing accelerated image processing
libraries. Although the progressive GPUCV library was found to be one of the
best, it has not been developed since 2010, and it was never made available for web
applications [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, Allusse et al. enhanced the system with CUDA support,
which is a proprietary technique, and it is not web-enabled either [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Our program
supports Linux, macOS and Windows platforms, and most of its features are also
available for the WebAssembly. The optimization of our image processing
software yielded better and customizable memory management regarding the memory
usage of image modification algorithms. Our other solution for achieving better
performance was implementing our algorithms for Vulkan [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Vulkan API will get
significance in adding Android support for our program in the near future.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Utilized Technologies</title>
      <p>Nowadays, several application programming interfaces are available for image
processing. Our goal was to implement the most common platform independent APIs
and to collect statistics about their performances in diferent use cases. Therefore,
the BlackRoom software provides multiple execution branches based on OpenMP,
OpenGL and Vulkan APIs. In this section, these technologies will be presented in
order to give a short summary of their characteristics.</p>
      <sec id="sec-2-1">
        <title>2.1. OpenMP</title>
        <p>The Open Multi-Processing is an API which supports shared-memory
multiprocessing on CPU. It was released in 1997 for Fortran and since 2000 it has supported C
and C++ programming languages, as well. This API is quite wide-spread solution
for implementing parallel execution in applications. Its usage is straightforward
in C++ programs since the programmers only need to use #pragma-s before the
specific program code parts. In terms of performance, the diference between the
sequential and parallel processing highly depends on the given use case and the
number of utilized threads, but with complex calculations the latter usually
provides significantly better results.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. OpenGL</title>
        <p>The Open Graphics Library is a platform independent 2D and 3D graphics API
which was released in 1991. It is quite a robust high-level API, and thanks to that
it is widely adopted in the industry. With the release of OpenGL ES, it can be
used on mobile devices and in web applications thanks to WebGL. Since OpenGL
is used for hardware-accelerated rendering on GPU, it is really eficient in image
processing and what is more, thanks to the compute shader support, it can be used
for general calculations, as well.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Vulkan</title>
        <p>Vulkan is the newest platform independent 2D and 3D graphics API which is based
on the Mantle API developed by AMD. It was released in 2016, and its main goal
was to provide higher performance and balanced CPU/GPU usage. Similarly to
OpenGL, it is available for multiple platforms and hardware, like Windows, Linux,
Android and macOS through MoltenVK. Compared to OpenGL, the Vulkan API
can be 100% faster but it really depends on the application and the implementation.
Latter is one of the key point of this new API, since the developer has almost
full control over the graphics processing unit and because of that, for beginner
programmers it is really dificult to implement the interface eficiently.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Benefits of OpenGL Compute Shader</title>
      <p>
        Previously BlackRoom used only OpenGL fragment shader for computing the
effects on GPU. Meanwhile, this way of the computing has its benefits for our needs,
and this approaches brought in some challenges. First of all, the software contains
multiple context-sensitive algorithms where the neighboring pixels are used for the
ifnal result. With OpenGL fragment shader, the access of the neighborhood was
not really efective, although it has improved by introducing rectangle textures
that enabled the usage of integer indices instead of float values. Secondly, the
histogram generation may be even slower than computing it on the CPU [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Finally,
the implementation of the OpenGL fragment shader is a little bit more complex
compared to our needs.
      </p>
      <p>So, we started to implement our algorithms in OpenGL compute shader
because of the above reasons. For this executing branch, the program uses OpenGL
fragment shader only for the onscreen rendering, and the efect chain calculation is
done completely by OpenGL compute shaders. In terms of compute performance,
the diference between the two approaches is not significant since both use the same
hardware. However, the source code is more straightforward and simpler. The code
for histogram generation is more elegant than by OpenGL fragment shader. From
OpenGL 4.3 the atomic counters give a huge boost to histogram calculations.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Performance Comparisons</title>
      <p>Our test system contains an AMD Ryzen 5 3600 processor @ 4.35 GHz and an
Nvidia GTX 960 graphics card with 4 GB memory. The following efects were used
in order to compare the performance of the diferent executing branches: basic
modifications, edge detection, Gauss filter, infrared and grayscale efects. As for
basic modifications, we are talking about exposure value and brightness. To obtain
the execution times we used the QElapsedTimer class, which measures the elapsed
time in nanoseconds. Because the magnitude of the running time of our algorithms
is millisecond, after readout, the timer variable is divided by 1 000 000 in order to
yield values of milliseconds. The measured times in the figures represent only the
runtimes of the efect executions without the bus transfer between the main memory
and the video card. The efects were tested with multiple images and according to
our observation the size of the image and the execution time is in linear relationship.
The results below were measured with a 4608 × 3072 PNG image (see Figure 1).
The color depth and the number of color channels do not afect the results since
the program converts every image to single-precision floating-point format.</p>
      <sec id="sec-4-1">
        <title>4.1. Context-free Algorithms</title>
        <p>The processing times of the basic—exposure value and brightness—modifications,
the infrared and the grayscale efects were measured. According to the results (see
Figure 2), rendering by GPU is approximately twice as fast as rendering by CPU
on a single core.
Min
Med
Avg</p>
        <p>Utilizing multi-threading and SIMD decreases the gap but raises another
problem. The memory bandwidth is limiting the all core performance of the CPU.
Furthermore, thread management also increases the execution time. Nevertheless,
taking into consideration the basic, infrared, and grayscale efects we can see an
almost 20% decrease in execution time compared to the single-thread performance.
It can be seen that the multi-core performance is more consistent because of the
smaller gap between the extreme values.</p>
        <p>Looking at the comparison of diferent execution branches of the GPU
rendering, we can see a little performance advantage in favor of OpenGL fragment
shader with grayscale efect. The Vulkan fragment shader execution branch is
under development at present, but even now, it has decent performance. Talking
about basic and infrared efects’ execution time, the Vulkan fragment shader is the
best. Inconsistency is its worst drawback since the diference between the extreme
values is here the biggest among the execution branches. The OpenGL compute
shader is behind the two other GPU rendering branches in terms of execution time.
Meanwhile, it provides really consistent performance.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Context-sensitive Algorithms</title>
        <p>The context-sensitive algorithms were represented by edge detection and Gauss
iflter efects during the benchmarks. These efects calculate each pixel based on its
neighbors. The benchmark tests show that there is quite a big diference between
these two efects in terms of performance (see Figure 3). As an example, rendering
the Gauss filter can profit from the extra threads of the CPU. Its execution time is
almost eight times faster on multiple threads than on a single thread. Meanwhile,
the edge detection’s runtime is slower by utilizing multi-threading. The reason
for this is the simplicity of the edge detection compared to the Gauss filter. In
this case, supposedly, the limited memory bandwidth and the thread management
increase the execution time.</p>
        <p>4.88114.458
Min 3.495</p>
        <p>4.99916.369
Med 3.764</p>
        <p>5.31716.847
Avg 4.61
Max
6.55718.714
9.223
108.29
111.235
110.879
113.331
158.267
159.174
159.401
163.014</p>
        <p>The variance between the three GPU based execution branches is greater
compared to the results with context-free algorithms. The OpenGL compute shader
falls behind both OpenGL fragment shader and Vulkan fragment shader. According
to our experiments, the overhead of compute shader causes a huge diference.
Accessing the neighboring pixels in the same work group is really eficient but getting
the pixel colors from other work groups takes too much time. OpenGL fragment
shader provides the second best overall performance in edge detection and in Gauss
ifltering. The Vulkan execution branch – which is still under development – is a
little bit faster. However, there is a huge room for improvement, especially in terms
of consistency.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>We implemented OpenGL compute shaders, and most of our efects are now
available for this execution path. Its performance increment is significant compared
to the CPU computation. Meanwhile, the execution times of the context-sensitive
algorithms are big due to the pixel access from other work groups. Our Vulkan
implementation is mature enough for competing with our OpenGL compute shader
and OpenGL fragment shader implementations. Furthermore, its Android support
makes it useful for the BlackRoom software.</p>
      <p>Unfortunately, the theoretical speedup of the parallelization of algorithms on
either CPU or GPU cannot be achieved because the memory bandwidth is heavily
limiting the multi-core performance of both of them.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Future Work</title>
      <p>Since our Vulkan implementation can only handle just one efect at a time now,
we are planning to develop it further for calculating a whole efect chain at once.
Besides that, WebAssembly also remains in our scope, and we will provide a wider
range of functionality of our program on this platform. Thanks to the compute
shader implementations, the creation of histogram becomes easier, so we will add a
panel to the user interface where the user can see the histogram change in real-time.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Allusse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Horain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Saipriyadarshan: GpuCV: A GPU-Accelerated Framework for Image Processing and Computer Vision</article-title>
          , in: Advances in Visual Computing.
          <source>ISVC 2008. Lecture Notes in Computer Science</source>
          , Las Vegas, December,
          <year>2008</year>
          , pp.
          <fpage>430</fpage>
          -
          <lpage>439</lpage>
          , doi: http://dx.doi.org/10.1007/978-3-
          <fpage>540</fpage>
          -89646-3_
          <fpage>42</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L. Z.</given-names>
            <surname>Eng: Qt5 C+</surname>
          </string-name>
          <article-title>+ GUI Programming Cookbook: Practical recipes for building crossplatform GUI applications, widgets, and animations with Qt 5, 2nd</article-title>
          , Birmingham, England: Packt Publishing Ltd.,
          <source>March</source>
          <volume>27</volume>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.-P.</given-names>
            <surname>Farrugia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Horain</surname>
          </string-name>
          , E. Guehenneux,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <article-title>Alusse: GPUCV: A framework for image processing acceleration with graphics processors</article-title>
          ,
          <source>in: IEEE International Conference on Multimedia and Expo</source>
          , Toronto, July,
          <year>2006</year>
          , pp.
          <fpage>585</fpage>
          -
          <lpage>588</lpage>
          , doi: http://dx.doi.org/10.1109/ICME.
          <year>2006</year>
          .
          <volume>262476</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Jargstorff</surname>
          </string-name>
          :
          <article-title>A Framework for Image Processing</article-title>
          , in: GPU Gems, ed. by R. Fernando, 1st ed., Boston:
          <string-name>
            <surname>Addison-Wesley</surname>
            <given-names>Professional</given-names>
          </string-name>
          , April,
          <year>2004</year>
          , chap. 27, pp.
          <fpage>445</fpage>
          -
          <lpage>467</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kubias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Deinzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kreiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Paulus: Eficient computation of histograms on the GPU</article-title>
          ,
          <source>in: SCCG '07: Proceedings of the 23rd Spring Conference on Computer Graphics, April</source>
          <year>2007</year>
          , pp.
          <fpage>207</fpage>
          -
          <lpage>212</lpage>
          , doi: http://dx.doi.org/10.1145/2614348.2614377.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Lazar</surname>
          </string-name>
          , R. Penea:
          <article-title>Mastering Qt 5: Create stunning cross-platform applications</article-title>
          , Birmingham, England: Packt Publishing Ltd.,
          <source>December</source>
          <volume>15</volume>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Seiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Singhal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. K.</given-names>
            <surname>Park:</surname>
          </string-name>
          <article-title>Object oriented framework for real-time image processing on GPU</article-title>
          ,
          <source>in: Proceedings of 2010 IEEE 17th International Conference on Image Processing, Hong Kong, September</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>4477</fpage>
          -
          <lpage>4480</lpage>
          , doi: http://dx.doi.org/10.1109/ICIP.
          <year>2010</year>
          .
          <volume>5651682</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Tornai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fürjes-Benke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>File</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. M.</surname>
          </string-name>
          <article-title>Nyitrai: WebAssembly and Vulkan API in Image Processing Development</article-title>
          ,
          <source>in: Proceedings of the 11th International Conference on Applied Informatics (ICAI)</source>
          (Eger, Hungary, Jan.
          <fpage>29</fpage>
          -
          <lpage>31</lpage>
          ,
          <year>2020</year>
          ), ed. by I. Fazekas, G. Kovásznai, T. Tómács, CEUR Workshop Proceedings 2650,
          <string-name>
            <surname>Aachen</surname>
          </string-name>
          ,
          <year>2020</year>
          , pp.
          <fpage>382</fpage>
          -
          <lpage>391</lpage>
          , url: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2650</volume>
          /#paper39.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>