Introduction

A SIMD-based Approach to the Enhancement of Convolution Operation Performance

0 Bogolyubov Institute for Theoretical Physics , Kyiv , Ukraine 1 National Aviation University , Kyiv , Ukraine

Optimization of two-dimensional convolution by means of 16-bit SIMD technologies (ARM-NEON) is considered. It is shown that utilizing 16bit SIMD NEON and built-in assembler one can achieve a significant increase in performance compared to the similar functions of OpenCV, ACL, and ARM Compute library. Throughout the research, ltercoefficients were quantized to match the 8-bit range.

Convolution vectorization SIMD optimization CPU

Introduction

where i  a,,W  (r  a) 1, j  a,, H  (c  a) 1 are indexing pixels of the destination image p ; W and H are width and height of the source P and destination p images (we neglect border effects in the destination at the moment),  is the kernel of the convolution (matrix r  c ), and a , a - so-called “anchors” that define relative position of a filtered point within the kernel.

Equation (1) is rather general and perfectly compatible with cv::filter2D(...) function of OpenCV library [ 3 ] and ARM Computer Library (ACL), but further we will give our attention to square kernels, i.e. c  r , and thus from now on we presume kernel to be square-shaped without special mentioning.

The good part is that every pixel of the destination, at least in principle, can be calculated simultaneously - the task is parallelizable. Parallelizable problems are of such importance that hardware developers created a set of parallel computing platforms (PCP): Nvidia CUDA, ATI Stream Technology (ATI-ST), etc., all accessible through OpenCL API. Every PCP developer provides software toolkit to interact with the PCP: programming language with C-like syntax, additional modules, frameworks, etc. Basis for PCPs are modern GPUs that are able to perform in parallel nearly any paralellizable task. For example, at the moment GeForce GTX 1080 Ti is very popular for CNN learning and significantly accelerates the process. One should notice, shader blocks of GPU are similar to mobile CPUs with RISK architecture that are used in modern smartphones.

In conclusion, improvement of CO performance positively affects nearly any software on any platform due to the wide variety of tasks it is involved in: DI filtration (sharpening, edge detection, blurring, etc.), DI scaling, CNN learning, multimedia, etc. It is worth noting, that CO (1) is the basis for convolutional neural networks (CNN) functioning. Therefore, accelerating CO we achieve higher CNN performance and decrease its learning time. Moreover, in some sense CO and similar tasks have shaped modern approaches to GPU architecture, that emphasizes their importance.

In current contribution we propose a new method of CO optimization that utilizes SIMD. Since SIMD can be applied to integer-valued kernels only, we will demonstrate a reduction method for real-valued kernels that allows this technique to be implied. Besides, we provide experimental comparison of a human-made code based on this approach with another recognized solutions (OpenCV library, AMD ACL, and ARM Compute library). The rest of the paper is organized as follows. First, we consider SIMD pros and cons and in subsection II-B we introduce the reduction method itself. 2

A brief overview of modern software optimization

We will perform the overview in a “bottom to top” style - we consider hardware first, then software, and then algorithmic methods of performance enhancement.

2.1 Acceleration by means of hardware

Well chosen hardware architecture may significantly enhance software product performance. A general perspective on possible options is given by the Flynn’s taxonomy [ 4 ] and the first question to answer is whether the task (e.g. CO) allows parallelization of data flow or instructions flow. Today SIMD principles are implemented in both RISC (e.g. Cortex-A8-23 and Cortex-A53-72 ARM CPUs) [ 5 ] and CISC (e.g. Intel x86 series) CPUs. One can access this feature with specific extensions of assembly language - NEON for RISC architecture, while CISC architecture implies SSEn and AVX1/2 usage. In contrast, MIMD principles are not implemented in modern CPUs, but partially supported by GPUs.

Modern GPUs provide parallel computing features by means of shader blocks-CPU (SCPU), based on RISC architecture, that are used in parallel. Principles of MIMD are achieved by SIMD/MIMD-like instructions for SCPU --- vector instructions for numerous 128/256/...-bit registers (there are 32 or more registers, that is above the number that modern CPUs have). The bad part is you cannot access SIMD/MIMD instructions directly only a small number of intrinsics and pre-implemented operations are accessible: bit shifts, binary logic, etc. Most cases programmers use specific frameworks to access mentioned features, e.g. CUDA and OpenCl for GPU, or OpenCl for CPU/DSP/FPGU. Prevalence of mentioned frameworks led to most GPU manufacturers get certified by AMD/ATI (compatibility with OpenCl) or Intel/NVidia (compatibility with CUDA).

Except using GPU, one can employ co-processor units, e.g. Digital Signal Processor (DSP). For example, Qualcom has developed DSP-Hexagon for embedding into Snapdragon-625/635/835/825 CPUs. This DSP provides very long instruction word (VLIW), which means multithreading at the assembler level - during one interruption 3 assembly instructions with different inputs are processed. Compared to simple SIMD (NEON32 or NEON64) its performance is 4 times higher. Algorithms, optimized for DSP, reduce CPU load up to ∼ 75% and improve audio/video encoding/decoding performance up to ∼ 18 times [ 6 ], [ 7 ]. 2.2

Optimization by means of software

Software we use to produce binary code (e.g. compiler itself, additional libraries, frameworks) highly influences program performance by employing different optimizations and hardware platform capabilities. In scope of current article we are mostly concerned with their ability to perform vectorization without significant loss in precision. Let's consider three well-known compilers: GNU Compiler Collection (GCC/G++) [ 8 ], Clang [ 9 ], and nvcc (compiles cu-files for CUDA).

Probably, the most popular nowadays is GCC developed and supported by FSF community. Actually, GCC, first developed by Richard Stallman, is a whole collection of compilers suitable for different programming languages and architectures. Its main competitor is the ``rising star'' of compilers - Clang. For example, Apple already uses it as the basic compiler for its products. Clang itself is a frontend for different programming languages, e.g. C, C++, Objective-C, Objective-C++, and OpenCL. The actual generation of binary code and vectorization is performed by the LLVM framework. Both Gcc and Clang are performance-oriented, but still they fail compared to human-made assembly code (see comparison [ 1 ] Clang ndk-r14b vs Inline Assembly on Android 5.5.1 (x64) phone with CPU-MT6752).

The last compiler we want to mention is nvcc. It utilizes CUDA and thus allows significant improvement of performance on platforms with NVidia GPU. But as we can see, mentioned compilers and technologies introduce large heterogeneity to the field of program optimization. In a responce OpenCl standard was developed (The Khronos Group Inc.) that is supported by all mentioned hardware developers and provides access to parallel computations on GPU/DSP/CPU.

Except all the advantages of PCPs, they have a drawback - big overhead on transferring data. To avoid the problem, programmers organize data into pools 100 ∼ 200, that allows to achieve 20-fold increase in performance compared to CPU. But using big pools is not always the solution - while CNN learning perfectly fits in this model, processing stream from video-camera does not at all.

Except for good choice of compiler, one can achieve performance enhancement using optimized binary code of frequently used functions like CO, scaling, etc. supplied by different libraries. Many of them contain SIMD-optimized code for armeaby-v7a and arm64-v8a. Besides, a collection of libraries can be combined into a single framework in such a way, that advantages of one library compensate drawbacks of the others. OpenCV and ACL [ 10 ], [ 11 ] are good examples of libraries comprising a wide variety of algorithms, including DI processing, DI analysis, and even module for CNN learning, that are optimized for different CPU architectures and their SIMD: AVX1/2 , SSE4.4 , ARM NEONx32/x64 . OpenCV is well-known and of a high quality, but ACL has better extensibility due to modular architecture and seems to perform better on CO-like tasks (e.g. it is up to ×14 times faster compared to OpenCV on CO for CNN in one thread [ 10 ], [ 11 ]). Thus, further we will use both of them as a reference for comparison.

At the moment SIMD optimization has spread over the wide range of programming products, both proprietary and open-source. For example, kernel of Windows 10 uses AVX1/2 to achieve better performance (obviously, this influences the whole system), while Oracle Java VM utilizes AVX1/2 /3DNow and thus any Java application runs faster. Game engines of id Tech 2-4 (e.g. the one used in Quake III Arena) are good example of open-source projects with SIMD optimization. But, using SIMD, they all face the issue of translating floating-point code to fixed-point with acceptable loss in precision. This can be quite complicated, thus SIMD optimizations used in proprietary software are mostly non disclosable.

One more technique to mention is so-called loop unrolling and tiling [ 12 ], [ 13 ], [ 14 ], [ 15 ] that allows avoidance of redundant comparison operations (e.g. <,>) in cost of slightly enlargening the code. It is mostly performed by means of compiler or by introducing appropriate assembly inline-code into the application. Some libraries like ACL may take advantage of high-level programming language features (e.g. templates in C++) to perform loop unrolling. A simplified ACL-style code is provided in listing to demonstrate example implementation (see Помилка! Джерело посилання не знайдено.).

Passing appropriate parameters to do_unroll<...>::run(...) from (see Fig. 1), one may call function f(...) baseStep×unrollDelta+restSteps times avoiding unrollDelta−1 counter comparisons and decrementations. Achieved performance improvement is discussed in more details in [ 15 ]. Instead of rather general consideration as we did previously, let us focus on CO. The main obstruction for SIMD optimization is translating floating point CO to fixedpoint with acceptable loss of precision. First of all, SIMD operations are performed on integers only.

Thus we should represent elements of kernel  from (1) in a suitable form: where  is a coefficient for normalization. Now we can perform the most resourcedemanding part (additions and multiplications) in a SIMD-style and afterwards normalize the result.

i, j  i, j ,   ,  i, j  (2)

Any kernel can be represented in form (2), but the more precise result we want the more digits should  i, j have. At this point we meet limitations of platform on which we intend to run the program. Thus, we should set some constraints on  to avoid overflow when doing CO.

Suppose, every pixel in original image is represented as byte and thus possesses 8bit values 0,, 255 . The same range is possessed by kernel elements  i, j . Intermediate results are stored as 16-bit signed or unsigned values. To warranty that no overflow occurs, we should make sure that it does not occur on any step of the algorithm. If kernel has positive elements only, condition we need looks as follows r r (28 1)    i, j  216 1. (3) i0 j0 Substantially, this means that even the largest possible inputs from image do not lead to overflow.

If kernel contains negative elements, condition should be much more complicated and depends on the order of additions when doing CO. Instead, we will use much stronger but easier to check condition r r (28 1)   | i, j | 2161 1, (4) i0 j0 that is independent on the operations order. This condition can be slightly relaxed we can use it for positive and negative entries of the kernel  separately. And last thing to mention: one can easily obtain similar results for signed/unsigned 32-bit intermediate values by substituting 16 → 32 in (3) and (4).

What we propose is selecting for given  the biggest  possible, such that  still satisfies (3) or (4) (which one depends on whether kernel is purely positive or not). Of coarse, we should be concerned, whether there exist any useful kernels that can be reduced to a suitable form. And it seems there are plenty of them.

In conclusion, modern hardware provides mechanisms for vectorization, i.e. SIMD technologies, that can be used by programmers to enhance performance of the application. Most cases this technology is utilized by compiler to generate binary code without participation of programmer. Suitable choose of library may be handy as well - many libraries contain SIMD-optimized code. But in some cases human intervention is needed to get the most optimal result. Developing assembly code, one should represent the function in suitable for SIMD optimization form. It is not always possible and often restrictions (3,4) should be satisfied. In the next section we will provide new method of CO optimization and then compare it with existing results, e.g. OpenCV and ARM CL.

Optimization of Convolution Operation by means of SIMD

In current contribution we propose a new method of Convolution Operation (CO) optimization based on SIMD technique. We presume target kernel to satisfy condition (3). In this section we will provide all necessary considerations and assembly code that illustrates proposed approach. Next section will be devoted to experimental comparison of this method's performance to known CO implementations (OpenCV and ARM CL).

Regarding condition (4), provided code should be just slightly modified. We will avoid redundant listings and provide code considering only condition (3), while at the end of the section all necessary modification for condition (4) will be described. We start with basic implementation of CO, (see Fig. 2). It contains no specific optimizations, but still is a good point to start our considerations.

Here qn are ARM-NEON registers, regarding syntax and instructions order we will strictly follow ARM reference manuals. For the sake of simplicity we avoided normalization by coefficient  in (see Fig. 2), but for completeness let us provide it as a separate (see Fig. 3).

In (see Fig. 3) we suppose data for normalization to be stored in registers q12 … q15, while d3 contains normalization coefficient  . Presented code is in some sense multipurpose and may be used with different CO implementations.

Now we switch gears to the CO optimization itself. In (see Fig. 2) we provided some initial version of this operation in assembly code. But it has one significant drawback - slow data loading. Following (see Fig. 4) avoids this problem by using one of the registers as buffer. It is known, that simultaneous loading of 16 bytes is quicker than loading them one-by-one (approximately 10 and 40 cycles respectively). Thus we use one register for preloading extra data and then use this data byte-by-byte without redundant load operations. The main feature of the presented approach (see Fig. 4) is usage of cyclic shift (i.e. vext.8 q0,q0,q1,\#1) that allows kernel buffering and thus we need less ''loading operations'' (for more details please see comments in (see Fig. 4). Worth noting, that provided (see Fig. 4) demands kernel containing not more than 16 elements in one row. If we need kernels with more than 16 elements in a row, the listing should be just slightly modified.

As we mentioned earlier, this code works for kernels satisfying condition (3). To make it applicable to kernels satisfying (4) we need to change all vmlal.u8/u16 operations to vmlal.s8/s16. This small but crucial changes transform (see Fig. 4) into code capable of working with signed integer kernels. Depending on given kernel, one can choose between this two options.

In conclusion, we found a class of kernels that allow significant optimization of CO by means of SIMD and were able to implement appropriate code combining approaches of loop unrolling and method from [ 13 ]. Exploiting significant difference in time for simultaneous 16 byte loading compared to one-by-one loading, we were able to achieve significant speedup of CO. More detailed results and consideration of measurement procedure will be presented in the following section. 4

Experimental setup and Results

Ground truth. To evaluate our results certain reference is needed. As the one we chose functions cv::filter2d(...) from OpenCV library and NEConvolution{N}x{N}::run() from ACL library. The latter is well-known among AI and DIP researchers due to its high-quality and optimized code.

For comparison we used latest stable tags available at the moment we started research, release tags are 4.0.0 (2018-11-18 [11:08:36]) for OpenCV and v19.02 (201902-28 [14:25:18]) for ACL. Compilation was performed with NDK-r18b - latest stable NDK at that moment for to achieve API capability between them. We ensured that libraries utilize vectorization compiling them with flags ANDROID_ABI=armeabiv7a with NEON ANDROID_NATIVE_API_LEVEL=22 CPU_BASELINE/CPU_BASELINE_FINAL=NEON CPU_BASELINE_FLAGS=mfpu=neon -O3 -DNDEBUG. Both OpenCV and ACL were linked as static libraries. Devices. To make our measurements more relevant we used a set of different devices. This helps us to understand the influence of architecture, CPU series, and other parameters on the execution time. Following table lists devices we have used and parameters of their CPUs. Measurement procedure. The pivoting parameter we need to measure is the execution time of each function. Such measurement might be tricky, since it is highly susceptible to transition processes in Android OS. To avoid this problem we used the following procedure: each function (cv::filter2d(...), NEConvolution{N}x{N}::run(), and newCO(...)) was successively called 3 times (for robustness and to simulate RGB processing) and result was stored to array. After collecting 35 data-points we calculated median value and treated it as trice the execution time of the function under consideration.

Kernel sizes varied 2×2, 3×3, … , 15×15 for experiments with our implementation and cv::filter2d(...), while implementation of NEConvolution{N}x{N}::run() necessitates usage of odd-sized kernels only, e.g. 3×3, 5×5, etc. Digital images (DIs) were generated with equal width and height, corresponding formula follows 125 n  Wimage  Himage   8   32  Wkernel 1, where square brackets […] denote integer part of the number. Results are further presented in form of fractions cv::filter2d(...) execution time divided by execution time of our implementation and NEConvolution{N}x{N}::run() execution time divided by execution time of our implementation. Results. First we compared time consumption of the code (see Fig. 4) and reference function cv::filter2d(...), result is presented in figures 5a and 5b.

As coordinates we use sizes of kernel and image, while color intensity designates acceleration, one may achieve using proposed method instead of the reference method (e.g. fraction of execution times: reference function to proposed).

Despite presented results demonstrate advantage of the proposed method, there is still room for improvement. It seems, compiler is unable to unroll cycles effectively on its own,- one may check this by compiling presented code and exploring binary with any suitable disassembler (e.g. IDA or objdump tool). Thus, we may achieve additional 30%-40% of acceleration by utilizing techniques [ 12-13 ].

Results for the modified code are shown in figures 5c-5f. We compared time consumption of the (see Fig. 4), modified with approaches [ 12-13 ], and both reference functions (cv::filter2d(...) and NEConvolution{N}x{N}::run()). Besides, we varied image sizes up to 4500×4500 (~20 [MP]) to emulate modern cameras.

As (fig. 5) suggests, acceleration is independent (almost) on input size, e.g. complexity (big-O) of our solution and reference solutions coincide. Some small decline in acceleration (but it is still greater than 1) may be noted for big kernels (9×9 to 15 ×15). Regarding mean acceleration, it is estimated as approximately 1.7 times.

It is worth noting, we did not use parallelism for acceleration. Employing OpenMp or implementing parallelism by any other means may improve presented results twice or even more. Moreover, no preprocessing, e.g. image tiling, was performed. Probably, this technique may increase performance of the approach as well [ 16-17 ]. 5

Conclusion

In conclusion, we propose a method of convolution operation (CO) acceleration. We show that many kernels utilized for practical applications can be reduced to integer form (table I) that allows for SIMD optimization usage. Despite SIMD itself leads to a significant boost of performance, we were able to push the frontiers even further by exploiting significant difference in time for simultaneous 16 byte loading (approximately 10 cycles) compared to their one-by-one loading (approximately 40 cycles) q2 register is used as a buffer and loading operations are partially substituted with cyclic shift (see Fig. 4).

To test the approach we performed comparison with cv::filter2D(...) function from OpenCV library and with NEConvolution{N}x{N}::run() from ACL library (fig. 5). Our results suggest, the current approach leads to significant speedup (mean values: ~1.7× compared to OpenCV and ~1.5× compared to ACL). Measuring acceleration for different kernels and images we observed no dependence on image size, but kernel size may influence the result - for kernels smaller than 9×9 we were able to achieve ×4.5 acceleration (compared to cv::filter2D(...) function from OpenCV), while for larger kernels presented approach allows only ×1.5 speedup. We did not use parallelism in our code, thus additional ×2 or more acceleration is possible by employing appropriate techniques, e.g. OpenMp library.

We expect current approach to be useful for real-time image processing and convolutional neural networks training as it significantly reduces processing time.

1. Chyrkov , A. , Prystavka , P. : Suspicious Object Search in Airborne Camera Video Stream . In: Hu Z. et al. ( eds) Advances in Computer Science for Engineering and Education . ICCSEEA 2018. Advances in Intelligent Systems and Computing , vol 754 , pp. 340 - 348 . Springer, Cham, Switzerland ( 2018 )

Gnatyuk ,

Kinzeryavyy ,

Iavich ,

Prysiazhnyi , Kh. Yubuzova, High-Performance Reliable Block Encryption Algorithms Secured against Linear and Differential Cryptanalytic Attacks , CEUR Workshop Proceedings , Vol. 2104 , pp. 657 - 668 , 2018 .

3. “Documentation for open-cv,” https://docs.opencv.org/trunk/d4/d86/group__imgproc__filter. html#ga27c049795ce870216ddfb366086b5a04 , 2017 , [Online; accessed 27-November-2017].

4. M. J. Flynn , “ Very high-speed computing systems , ” Proceedings of the IEEE , vol. 54 , no. 12 , pp. 1901 - 1909 , 1966 .

5. “Arm® Cortex® - a53 mpcore processor: Reference book of cortex-a53 cpus ,”http://infocenter.arm.com/help/topic/com.arm. doc.ddi0500g/DDI0500G_cortex_a53_tr m.pdf , 2013 , [Online; accessed 27-November-2017].

6. “Qualcomm extends hexagon dsp,” http://pages.cs.wisc.edu/~danav/pubs/qcom/ hexagon_microreport2013_ v5 .pdf, 2013 ,[Online; accessed 27-November-2017].

7. “ Qualcomm hexagon dsp: An architecture optimizedfor mobile multimedia and communications,” https://developer.qualcomm.com/download/hexagon/hexagon-dsp-architecture.pdf, 2013 , [Online; accessed 27-November-2017].

8. Griffith , GCC: the complete reference . McGraw-Hill , Inc., 2002 .

B. C.

Lopes and

Auler , Getting started with LLVM core libraries . Packt Publishing Ltd , 2014 .

10. “Presentation of arm-cl,” https://community.arm.com/graphics/b/blog/posts/arm -computelibrary-for-computer-vision-and-machine-learning-now-publicly- available , 2017 , [Online; accessed 24-May-2017].

11. “ Video presentation for arm-cl ,” https://developer.arm.com/technologies/computelibrary?_ ga=2.909169.1792656346 . 1530630636 - 1257957724 .1521634632, 2017 , [Online; accessed 24-May-2017].

12. A. Nicolau, “ Loop quantization: unwinding for fine-grain parallelism exploitation ,” Cornell University, Tech. Rep., 1985 .

13. J. Xue, “ Loop tiling for parallelism , volume 575 of kluwer international series in engineering and computer science,” 2000 .

14. T. Veldhuizen, “Template metaprograms. c++ report,” 1995 .

15. V. Sarkar, “ Optimized unrolling of nested loops ,” in Proceedings of the 14th International Conference on Supercomputing, ser. ICS '00 . New York, NY, USA: ACM, 2000 , pp. 153 - 166 . [Online]. Available: http://doi.acm. org/10 .1145/335231.335246

16. Fedushko

, Benova

. Semantic analysis for information and communication threats detection of online service users . The 10th International Conference on Emerging Ubiquitous Systems and Pervasive Networks (EUSPN 2019) November 4-7 , 2019 , Coimbra, Portugal. Procedia Computer Science , Volume 160 , 2019 , Pages 254 - 259 .

17. Gnatyuk

, Akhmetov

, Kozlovskyi

, Kinzeryavyy

, Aleksander

, Prysiazhnyi

. New Secure Block Cipher for Critical Applications: Design, Implementation, Speed and

Security

Analysis , Advances in Intelligent Systems and Computing , Vol. 1126 , pp. 93 - 104 , 2020 .