<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Framework-based Approach to Implementation of High-Performance Image Processing Library</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Evgeny V. Rusin</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Computational Mathematics and Mathematical Geophysics SB RAS</institution>
          ,
          <addr-line>Novosibirsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Framework-based approach to the implementation of high-performance image processing library is suggested. The implementation of prototype libraries for doing processing on computational cluster and GPU is described. In recent years, the characteristics of remote sensing of the Earth from space are, on the one hand, the increase in spatial resolution of satellite images, and on the other, the use of hyper spectral survey with a large number of spectral bands. The multispectral data composed of the images obtained from several spectral channels can be hundreds of megabytes in size. While the data from relatively small areas can be handled by the average desktop computers, the solution of large-scale and global tasks of geodata analysis demands the approaches based on high-performance technologies [1] such as parallel, cluster, and distributed computing or calculations on GPU or Intel Xeon Phi. Today, there are a large number of the libraries of image processing subroutines [2-5], some of which support the use of modern facilities of high-performance computing. All of them, however, have one limitation which is significant from our point of view, the lack of extensibility, i.e. impossibility to introduce additional algorithms to the library; the set of the algorithms provided by a typical library is limited to the set implemented by the developer. This circumstance strongly limits the possibility of using existing libraries in the research projects aimed at the creation of new methods and algorithms for satellite data processing. In the present work, we discuss the "framework" approach to the architecture of image processing library allowing to avoid the limitation.</p>
      </abstract>
      <kwd-group>
        <kwd>remote sensing</kwd>
        <kwd>image processing</kwd>
        <kwd>high-performance computing</kwd>
        <kwd>parallel computing</kwd>
        <kwd>computational cluster</kwd>
        <kwd>GPGPU</kwd>
        <kwd>library of subprograms</kwd>
        <kwd>framework</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>other; and the composition of A and B, the third, perhaps the same as the first or second). Besides, a user, as a rule,
has some general thoughts on how to parallelize an algorithm efficiently; for example, if the algorithm to parallelize
is the composition of three filters with the kernel of 5×5 pixels, then the best parallelization is cutting an image onto
strips with six-pixel overlapping. Therefore, to provide the most efficient parallelization of an algorithm, the library
allows user to specify how the image will be represented on the set of processing nodes:
– Full copy of the image on each processing node.
– Cutting image on horizontal non-overlapping strips: each node stores "its" strip for processing.
– Cutting on horizontal strips overlapping in given number of pixels.</p>
      <p>2. Framework architecture which allows both to avoid duplication of the program code implementing parallel
algorithms and to provide extensibility of the library.</p>
      <p>3. The code of the library has to minimize the inevitable overhead due to the growth of the abstraction level of the
computational model. For this purpose, C++ language was chosen for implementation which provides the possibility
of efficient generalization by means of the templates’ mechanism.</p>
      <p>4. Portability of the library to the wide range of the cluster computers which is provided by the choice of C++ and
MPI (the standard of parallel programming for the computers with distributed memory) as development tools.</p>
      <p>Basic elements of the library are the following classes:</p>
      <p>Image is the class implementing the abstraction of image. As one of its responsibilities, it (transparently for user)
guarantees that the distributed representation of image data on the set of processing nodes is consistent and
up-todate.</p>
      <p>NeighborhoodManipulator is the class implementing the abstraction of the manipulator with the
neighborhood of pixel. Allows to implement processing algorithms in isolation from concrete parameters of the image
(the physical sizes, representation on the processing nodes and so forth), in the terms of the neighborhood of pixel
being processed. The use of this abstraction provides the extensibility of the library by means of the creation of new
processing functions compatible with the library. The following fragment of the C++ code shows SobelFilter
class, implementation of the well-known Sobel filter which is compatible with the library:
– Lines 3-6 define the size of the neighborhood affecting the result at a pixel; the library needs it to calculate the
result at the pixels on the margins of image and on the margins of strips for "distributed" representation of image. For
Sobel filter, all four sizes are ones as the filter accounts only the closest neighbors.</p>
      <p>– Lines 8-23 describe the filter in the terms of the processing of single pixel. The parameter of this operation is the
neighborhood manipulator which, by means of the PixelRelativeToCurrent(i, j) method, provides the
access to the neighbor pixels (that is, displaced by i pixels horizontally and j pixels vertically from the pixel being
processed).</p>
      <p>The following fragment of code shows how the filter can be applied to image:
1. Image im(“image.jpg”, PartitioningInfo pi(CutWithOverlap, 1));
2. im.DoNeighborhoodToPixelOperation(SobelFilter());
– Line 1 reads the image from disk file and distributes it among processing nodes as strips overlapping in one
pixel.</p>
      <p>– Line 2 applies to the image generic neighborhood-to-pixel operation parameterized with the object of the class
implementing Sobel filter. The generic operation, simultaneously for all strips, will call processing operator
consequently for each pixel.</p>
      <p>– Generally speaking, the overlapping of strips specified in line 1 is not strictly obligatory as the code of line 2, if
necessary, will load to each node the missing data from neighbor nodes. However, it does make sense as the code
optimization which the user of the library can perform, having knowledge what operations he/she wants to apply.</p>
      <p>Distinctive and important point in using and extending the library is that a user does not need to know the MPI
model and to understand the details of internodal exchange in a cluster. The given approach showed its efficiency:
1. The performance of library-compatible implementation of the algorithm for circle structure detection on
aerospace images [7] is about 10 percent less compared to the implementation "from scratch" (that is in pure C++ and
MPI, with the direct access to image data). At the same time, the creation itself of parallel program with the use of the
developed framework is much simpler than the one "from scratch".</p>
      <p>2. The efficiency of library-compatible implementation of the algorithm for circle structure detection is about 95%
when executing on 8 nodes of NKS-30T+GPU cluster of the Siberian Supercomputer Center (SSCC) hosted in
ICM&amp;MG SB RAS [8].
4</p>
    </sec>
    <sec id="sec-2">
      <title>SSCCIPGPU Library</title>
      <p>In recent years, there is a great practical interest to the use of modern graphic processors (Graphics Processing
Unit, GPU) as the general-purpose calculator. Generally speaking, GPU is oriented to the efficient solution of the
tasks of computer graphics, in particular it contains the hardware functions allowing to do mass calculations (same
operations over the large volume of data) efficiently (with the productivity of hundreds of gigaflops). These
opportunities allow to use GPU in the tasks which are unrelated to visualization but also based on mass calculations,
for example in image processing and analysis. In a number of practical problems, GPU calculations provided 70-fold
acceleration compared to CPU calculations, which corresponds to the performance typical for supercomputers. The
concept of general-purpose calculations in GPU also received the support of GPU vendors (e.g. CUDA technology
from NVIDIA) which makes available the creation of GPU programs in high level languages without the knowledge
of coprocessor architecture. This fact, as well as the low cost of modern GPUs, make them popular equipment of the
modern supercomputer centers; thus, the main computing power at the moment of the SSCC SB RAS is the hybrid
cluster NKS-30T+GPU which includes 40 nodes, each is equipped with three GPU NVIDIA Tesla M 2090 with
Fermi architecture (compute capability 2.0), 512 kernels, and 6 Gbytes of GDDR5 memory. All of this makes
important the development of software for the GPU involvement in the processing and the analysis of remote sensing
data.</p>
      <p>
        The principles on which the development of SSCCIPGPU [
        <xref ref-type="bibr" rid="ref1">9</xref>
        ] library was based have much in common with the
above-stated principles on which the earlier ParImProLib library was created. We will add only that:
1. C++ and CUDA was chosen as the development tools, which made the library portable and allowed to
implement efficiently the extensible architecture of framework.
      </p>
      <p>2. The multilevel hierarchy of GPU memory makes desirable the possibility to choose how to store images and
algorithm parameters: in global, textural, or constant memory of GPU.</p>
      <p>3. The complexity of the programming model of graphic processor makes desirable the possibility to choose
between the use of synchronous and asynchronous CUDA API, and also to choose how to distribute calculations
between several GPUs of single computing node (multi-GPU processing).</p>
      <p>The following fragment of the code shows the SobelFilter_GPU class, GPU implementation of the Sobel
filter:</p>
      <p>__host__ ParamsForDevice ExportParamsForDevice() const { return ParamsForDevice(); }
– The implementation of the class for GPU is very similar to the corresponding implementation for cluster with
the difference of the use of CUDA directives (__host__ for the code for CPU; __device__, for GPU).</p>
      <p>– Line 7 defines the ExportParamsForDevice method providing the serialization (translation to block of
memory) algorithm parameters. The method is called by the framework to when copying algorithm parameters to
GPU memory. In the case of the Sobel filter, the implementation of the method is trivial as the filter has no
parameters.</p>
      <p>– Line 10, to the described above neighborhood manipulator parameter of the operator of single pixel processing,
adds params parameter which is a serialized form of algorithm parameters (memory block) copied by a framework
to GPU memory.</p>
      <p>– The implementation of the operator of processing (line 10-25), as a rule, deserializes (restores) algorithm
parameters from params memory block (the step is absent for the trivial Sobel filter) and performs the processing of
single pixel using the manipulator of the neighborhood.</p>
      <p>The following fragment of code shows how the filter can be applied to image:
1. Image im(“image.jpg”);
2. im.DoNeighborhoodToPixelOperation&lt;Global1D, Synchronous&gt;(SobelFilter_GPU());
which is very similar to the corresponding code for cluster, except that line 2 specifies the "one-dimensional" access
to image data in global GPU memory and the use of synchronous CUDA API for processing. Again, we specially
note that the user of the library does not need to have deep knowledge in GPU programming to use and extend the
library.</p>
      <p>Similar approach was also used for the implementation of multi-GPU processing. Here, the user has to specify the
strategy of parallelization of calculation between several graphic accelerators (using OS multi-threading versus
switching active GPUs in single OS thread context).</p>
      <p>The following tells about the efficiency of the approach:
1. The performance of the library-compatible implementation of the algorithm for circle structure detection is the
same as the one of the implementation "from scratch" (that is in pure C++ and CUDA). At the same time, the creation
itself of parallel program with the use of the developed framework is much simpler than the one "from scratch".</p>
      <p>2. The library-compatible implementation of the algorithm for circle structure detection, when running on single
GPU NVIDIA Tesla M 2090 of NKS-30T+GPU cluster, is about 90 times faster than similar CPU implementation
running on single Intel Xeon X5670 (2.93 GHz) processor of the cluster.
5</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>
        The results obtained demonstrate that the framework approach to the implementation of high-performance image
processing libraries is promising. Currently, the created prototype libraries are used when developing an experimental
framework of high-performance image processing SSCCIP [
        <xref ref-type="bibr" rid="ref2">10</xref>
        ], a part of an experimental cloud framework [
        <xref ref-type="bibr" rid="ref3">11</xref>
        ]. In
further plans of the authors is the propagation of the approach to the image processing on Intel Xeon Phi processors
with which the new NKS-1P cluster of SSCC SB RAS is equipped.
      </p>
      <p>Acknowledgements. This work was conducted within the framework of the budget project 0315-2016-0003 for
ICM&amp;MG SB RAS with the supercomputer facilities provided by Siberian Supercomputer Center SB RAS.
[2] OpenCV. https://opencv.org.
[3] TensorFlow. https://www.tensorflow.org.
[1] Buchnev A., Pyatkin V., Rusin E.V. Software Technologies for Processing of Earth Remote Sensing Data //</p>
      <p>Pattern Recognition and Image Analysis. 2013. Vol. 23. No. 4. P. 474-480.
[4] Computer Vision Toolbox. https://www.mathworks.com/help/vision.
[5] Intel® Integrated Performance Primitives (Intel® IPP). https://software.intel.com/intel-ipp.
[6] Rusin E.V. Object-Oriented Parallel Image Processing Library // Parallel Computing Technologies. PaCT 2009.</p>
      <p>Lecture Notes in Computer Science. 2009. Vol. 5698. P. 344-349.
[7] Alekseev A.S, Pyatkin V.P., Salov G.I. Crater Detection in Aero-space Imagery Using Simple Nonparametric</p>
      <p>Statistical Tests // Lecture Notes in Computer Science. 1993. Vol 179. P. 793-799.
[8] SSCC SB RAS. http://www.sscc.icmmg.nsc.ru/hardware.html.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Rusin</surname>
            <given-names>E.V.</given-names>
          </string-name>
          <article-title>Tehnologii obrabotki dannyh distancionnogo zondirovanija Zemli na gibridnom klastere NKS30T+GPU [Technologies for processing Earth remote sensing data on the NKS-30Т+GPU hybrid cluster</article-title>
          ] // Interekspo Geo-Sibir'.
          <year>2016</year>
          . Vol.
          <volume>4</volume>
          . No. 1. P.
          <volume>46</volume>
          -
          <fpage>49</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Rusin</surname>
            <given-names>E.V.</given-names>
          </string-name>
          <article-title>SSCCIP - A Framework for Building Distributed High-Performance Image Processing Technologies // Parallel Computing Technologies</article-title>
          .
          <source>PaCT 2011. Lecture Notes in Computer Science</source>
          .
          <year>2011</year>
          . Vol.
          <volume>6873</volume>
          . P.
          <volume>467</volume>
          -
          <fpage>472</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Buchnev</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pyatkin</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pyatkin</surname>
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rusin</surname>
            <given-names>E.</given-names>
          </string-name>
          <article-title>Framework of cloud web services for processing remote sensing data // E3S Web of Conferences</article-title>
          .
          <year>2019</year>
          . Vol.
          <volume>75</volume>
          . Paper 03001.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>