Performance Analysis of Roberts Edge Detection Using CUDA and OpenGL Marco Calı̀ and Valeria Di Mauro University of Catania, Catania, Italy Abstract—The evolution of high-performance and pro- without the developer. In this paper we have also measured grammable graphics processing units (GPUs) has generated the execution times and compared them to heterogeneous code considerable advancements in graphics and parallel computing. execution on CPU, then we have observed the speedup advan- In this paper we present a Roberts filter based on edge detection algorithm using CUDA and OpenGL architectures. The basic tages. The experiments were performed taking into account idea is to use the Pixel Buffer Object (PBO) to create images different image sizes and numbers. with CUDA on a pixel-by-pixel basis and display them using In Section II the edge recognition operators basis will be OpenGL. The images can then be processed applying a Roberts introduced with particular attention to the Roberts operator filter for edge detection. Finally, it describes the results of an and the respective Kernel. In Section III an introduction to extensive measurement campaign as well as several comparisons among the code performance on CPUs and GPUs. The results GPU programming is presented as well as the multihreading are very promising since the GPU parallel version offers much logic used in CUDA, moreover the use of OpenGL libraries for higher performances than the CPU sequential version. The processing image are explained relatively to the Pixel Buffer execution time of the GPU parallel version is much lower than Object. Portions of the developed algorithm and the related the sequential equivalent execution time. comments are devised in Section IV. Finally, in Section V Index Terms—CUDA, GPU, Image Processing, OpenGL, PBO, Roberts Edge Detection. a performance analysis is presented as well as a comparison between the sequential and the parallel algorithm. I. I NTRODUCTION II. I MAGE P ROCESSING The technological development has increased computer performances and applications can perform more complex The digital filters used in image pre-processing can be rep- tasks [1], [2]. Recently the field of information retrieval has resented as mathematical operators. Such digital filters may be encountered a tremendous growth due to the newly available used to reduce noise, improve contrast, separate objects from computational power and developed architectural supports. A the background, etc. [3], [4], [5], [6]. It is possible to obtain large variety of scientific and industrial applications requires substantial image improvement using parallel processing also to analyze 2D images. In order to extract information from for real time solutions [7]. Among the different types of image images it is often possible to apply specific filters to perform enhancement algorithms, spatial convolution kernel filtering segmentation, edge identification, sharpening, etc. The most produces the worst scenario. commonly used edge detection system is based on Roberts A convolution kernel replaces every pixel with a new one method. so that its value is based on the relationship between the old In this paper we present a GPU parallel algorithm elabo- pixel value and the values of pixels that surround it. In such a rating a great number of images in a short time. The solution convolution procedure, two functions are overlaid: pixel values can then be readily used for the realization of real-time devices of the original image are stored in memory, and it is applied and systems for industrial applications. A real-time execution the mask of the convolution kernel. The kernels can vary in requires to optimize the processing time for image elaboration dimension, determining a different number of neighbouring and analysis. In this work we use the Common Unified pixels involved in the convolution. The kernel operates on the Device Architecture (CUDA): an NVIDIA architecture which image by replacing the original pixels one by one, therefore provides support to interact with the Graphic Processing Unit the operation must be performed by applying the convolution (GPU) for general-purpose computing. The solution presented operator to each pixel in the image. A typical convolution is used to work with pgm grayscale images by OpenGL kernel mask operation is represented in Fig. 1. libraries implementing functions. An example of the features The field of image processing and computer vision often enabled by such functions are: make use of procedures like edge detection. Such a procedure • loading an image in RAM; is particularly useful for features extraction. The edge detec- • mapping an image to a Pixel Buffer Object (PBO); tion eliminates the informations in order to focus only on a • fast transferring the PBO data to GPU memory. specific variations set on the image based on the geometric and structural characteristics of the examined objects in order OpenGL libraries provide a direct control on image as to recognize the borders. An example of such a variation is well as making it possible to track the changes of the image constituted by the region of pixels where the light intensity Copyright c 2016 held by the authors. undergoes abrupt changes. Those rapidly variating regions can 55 as the registers and shared memory usage. We will use the adjective active for the executed blocks on one SM at a certain moment. One block typically is executed on several warps and the number of warps consists of the total number of GPU cores divided by the number of cores contained in one warp. The latter on Nvidia GPU cards has generally been 32, but could change in the future for new card models. CUDA architecture makes it possible to have direct access to the GPU instruction set enabling us to use such GPU card for general purpose parallel computing. The management and programming is supported by the CUDA APIs [10]. Each CUDA thread is mapped to a GPU core. The GPU card can execute one or more kernel grids, as well as the streaming multiprocessors (SM) which execute one or more blocks. CUDA architecture provides APIs and directives in order to be compatible with different standard programming languages such as C, C++, Fortran, etc. The main advantages Fig. 1. Convolution kernel mask operation is then the easy implementation possibilities offered to the developers which have experience in the said standard pro- be identified as the edges of a certain object on the image. The gramming languages, and the possibility to achieve highly edge detection method is based on digital Roberts filter. The modular software systems [11], [12], [13]. Moreover, CUDA Roberts filter operator approximates the intensity gradient of architecture has been proven a more efficient method to the brightness using two different kernels: port developed software systems with respect to other GPU     oriented technologies such as Direct3D and OpenGL. 1 0 0 1 Nevertheless, the OpenGL (Open Graphics Library) [14], Gx = Gy = (1) 0 −1 −1 0 [15], [16] offers a great number of software tools for the Gx , also called horizontal kernel, is able to enhance the development of GPU oriented code. OpenGL offers cross- horizontal component of the intensity gradient, the Gy , also language and multi-platform application programming inter- called vertical kernel, at each point, enhances the vertical faces (API) generally used in rendering applications. These component. When combined Gx and Gy give the Roberts APIs are used to interact with a GPU to achieve hardware- kernel q accelerated rendering, therefore OpenGL allows applications |G| = G2x + G2y (2) to use advanced graphics on relatively small systems. It is possible to mix OpenGL and CUDA technology in order to When G is applied to an image, a threshold is used to enhance the performances of an application. An example is determine which values indicate a boundary and which do given by the use of PBO (Pixel Buffer Object) which makes not. By increasing the pixels gradient value the lines tend to it possible to import multidimensional structures on a pixel- white. by-pixel basis and display them using the related OpenGL APIs. The advantage in such an approach is to directly obtain III. GPU COMPUTING an efficient mapping of pixels directly into threads [17]. In The GPU architecture consists of a scalable number facts, PBO is a specific portion of the video memory in which of streaming multiprocessors (SMs), each containing eight it is possible to render images that can be transformed into streaming processor (SP) cores, two special function units textures. Another important advantage of the approach is the (SFUs), a multithreaded instruction fetch and issue unit, a speed of pixel data transfer through DMA (Direct Memory read-only constant cache, and a 16KB read/write shared mem- Access) channel [18]. The OpenGL PBO mechanism also ory [8]. The SM executes a batch of 32 threads together called permits asynchronous data transfers between the host and the a warp [9], unlike SIMD instructions the concept of warp device [19]. It is therefore important to correctly schedule is not exposed to programmers, rather programmers write a the workload between different memory transfers in order to program for one thread and then specify the number of parallel maximise the performance obtainable with the asynchronous threads in a block, and the number of blocks in a kernel grid. approach. In an SM all threads block should be executed all together. The texture data are loaded from an image source (image On the other hand, a SM can manage multiple concurrently file or video stream) that can be directly loaded into a PBO, running blocks. The number of blocks running on a SM is which is controlled by OpenGL. Fig. 2 gives a simplified determined by the resource requirements of each block as well schema of the texture transfer using the Pixel Buffer Object. Of course, while the transfer is asynchronous, it is anyway needed SIMD and MIMD are types of parallel architectures identified in Flynn’s taxonomy, which basically says that computers have Single or Multiple a certain amount of CPU workload in order to transfer data streams of Instructions processing single or Multiple Data from the host memory to the PBO. Then, after such transfers, 56 Fig. 2. Texture loading with PBO. the GPU controllers (driven by the OpenGL drivers) manage Fig. 4. Upload Application Layout. to copy data from a PBO to a texture object. This means that OpenGL performs a DMA transfer operation without wasting CPU cycles, so the CPU benefits from a lower workload and 1) Creating a PBO on the GPU using glGenBuffers can perform other operations without waiting such a transfer 2) Binding PBO to unpack the buffer target to be completed. 3) Allocating buffer space on GPU with glBufferData Since in this work we make use of both OpenGL and CUDA 4) Mapping PBO to CPU memory denying GPU access directives, in order to correctly map the pixels using the PBO, for now, glMapBuffer returns a pointer to a place in it is firstly needed to create a GL context (which is OS specific) GPU memory where the PBO resides and a CUDA buffer registration. After that it is necessary to set 5) Copying data from CPU to GPU using pointer from up the GL view port and coordinate system [20], generate one glMapBuffer or more GL buffers to be shared with the CUDA application 6) Unmapping PBO (glUnmapBuffer) to allow GPU and, subsequently, register these buffers within the application full access of the PBO again itself (see Fig. 3). 7) Transfering data from buffer to a texture target 8) Unbinding the PBO to allow for normal operation again Steps one to three are only necessary during initialization, while steps four to eight have to be performed every time the texture needs to be updated. Fig. 4 shows a generic upload application layout. B. Download PBOs can also be used to download data back to the CPU. There is one problem which must be addressed though; as the download is an asynchronous operation (using DMA) one must make sure that the GPU does not clear the render buffer before the transfer is complete. The following is a description of how to download data using PBOs. 1) Generating a PBO on the GPU using glGenBuffers 2) Binding PBO to unpack buffer target 3) Allocating buffer space on GPU according to data size using glBufferData 4) Deciding what framebuffer to read using glReadBuffer. One can also read directly from a texture using glGet- TexImage, then skipping the next step 5) Using glReadPixels to read pixel data from the targeted framebuffer to the bound PBO. This call does not block as it is asynchronous when reading to a PBO as opposed Fig. 3. CUDA Buffer Registration. to CPU controlled memory. Map PBO to CPU memory denying GPU access for now. glMapBuffer returns a pointer to a place in GPU memory where the PBO A. Upload resides The algorithm used to upload data using PBOs [21] is as 6) Copying data from GPU to CPU using pointer from follows. glMapBuffer 57 // Boundary control Median range control if Median lower inf then return inf else if Median higher sup then return sup end end return (unsigned char) Median // because the PBO works using data char // Definition space memory to be used in GPU *pointer new array = pointer original array + blockIdx.x*Dimension z cycle (start to threadIdx.x; stop to widht; increase of blockDim.x) Then, perform a texture lookup in a given 2D sampler with Fig. 5. Roberts Edge Detection procedure. the function tex2D and associate it to pointer of new array in the position (step cycle). The elaboration of image is performed by means of the 7) Unmaping PBO (glUnmapBuffer) to allow GPU full following steps: initializing an array in CUDA for the elaborate access of the PBO again image, inserting Original data. 8) Unbinding the PBO to allow for normal operation again // Definition space memory to be used in GPU Steps one to three are only necessary during initialization, *pointer new array = pointer original array while steps four to nine need to be performed every time + blockIdx.x*Dimension z new data have to be downloaded. However, downloads and cycle (start to threadIdx.x; stop to Dimension z; uploads still involve GPU context switch and cannot be done in increase of blockDim.x) parallel with the GPU processing or drawing. Multiple PBOs can potentially speed up the transfers. Then, performing a texture lookup in a given 2D sampler with the function tex2D and insert it in the matrix of Roberts. IV. ROBERTS ALGORITHM upper_left = tex2D(lookup, coordinate_to_perform_lookup -1, dimension_x-1) Roberts edge detection filter can be broken down in different upper_right = tex2D(lookup, coordinate_to_perform_lookup, steps. First we need to set up the image data objects which dimension_x-1) will store the intermediate and final results, obtaining the size lower_left = tex2D(lookup, coordinate_to_perform_lookup -1, of the input image with the OpenGL libraries. Next, we have dimension_x) to do the computation using Roberts operators and the sum of lower_right = tex2D(lookup, coordinate_to_perform_lookup, the Roberts operators to create and edge image on the CUDA. dimension_x) Roberts Edge Detection procedure is displayed in Fig. 5. The algorithm using the features provided by OpenGL and Finally, the matrix of Roberts is associated to the pointer of a CUDA will be described below [22]. new array. We now describe how to create a convolution kernel of Then, it is necessary to define the setup of the texture in an Roberts and apply it for an image processing in CUDA. The external C function, according to the following steps. following code is for Roberts Edge Detection. • Describe the format of the value that is returned when fetching the texture through the // Roberts Edge Detection Matrix2x2_original_image(upper_left, upper_right, "cudaChannelFormatDesc" function lower_left, lower_right) if the image is ”pgm” then // Horizontal kernel principal_diagonal = lower_right - upper_left; assigned format unsigned char else // Vertical kernel assigned format uchar4 secondary_diagonal = lower_left - upper_right; end // Module of gradient Median = square(abs(principal_diagonal)+abs(secondary_diagonal)) • Allocate a CUDA array using cudaMallocArray() The following code is for copying the original image. The function that is included in checkCudaErrors, used main operations are: copy of image, insert original data. to correct the string in case of errors. 58 • Copy from the memory area pointed to by src to • Specify a two-dimensional texture subimage, with target, the CUDA array dst using cudaMemcpyToArray() texture, level, xoffset, yoffset, width, height, format, type, function, and specifying the direction of the copy with pixels, using glTexSubImage2D() function. cudaMemcpyHostToDevice. • Disable server-side GL_DEPTH_TEST by For deleting the texture with an external C function, the glDisable(), or enable server-side GL_TEXTURE_2D following steps are needed. by glEnable(). • Release the CUDA array using cudaFreeArray() • Set texture parameters by means of function, which must have been returned by a previous glTexParameterf(), with target GL_TEXTURE_2D, call. which specifies the target texture of the active • Wrap for the global call that sets up the texture and texture unit, and pname GL_TEXTURE_MIN_FILTER, threads in an external void C function. GL_TEXTURE_MAG_FILTER, GL_TEXTURE_WRAP_S, • Bind the CUDA array to the texture reference tex through GL_TEXTURE_WRAP_T, which specifies the symbolic cudaBindTextureToArray() function. name of a single-valued texture parameter, and finally param GL_LINEAR, GL_REPEAT, which specifies the switch for different cases do value of pname. case original image • Specify the primitive or primitives that will be cre- sets up the texture and threads break ated from vertices presented between glBegin and case elaborate image with ROBERTS the subsequent glEnd (the primitives are specified by sets up the texture and threads break glVertex2f and glTexCoord2f) end • Perform a buffer swap on the layer in use for the current endsw window through glutSwapBuffers Function reshape() consists of the following main steps. • Unbind the texture bound to tex. For elaborating an image with OpenGL the code below • Specify the affine transformation of x and y from nor- is used. Such a graphic library is designed to aid in ren- malized device coordinates to window coordinates using dering computer graphics. This typically involves providing glViewport. • Specify GL_PROJECTION is the target for subsequent optimized versions of functions that handle common rendering tasks. It can be realized purely by code running on the CPU, matrix operations by means of glMatrixMode. • Replace the current matrix with the identity matrix common in embedded systems, or code running on hardware accelerated by a GPU, more common in PCs. By employing through glLoadIdentity, and describe a transforma- these functions, a program can prepare an image to be output tion that produces a parallel projection using glOrtho. • Specify GL_MODELVIEW is the target for subsequent ma- to a monitor. These libraries load the image data into memory, map between screen and world-coordinates, generate of texture trix operations by means of glMatrixMode. • Replace the current matrix with the identity matrix using mipmaps and also ensure interoperability with other third party libraries and SDK. glLoadIdentity. More specifically, for function display() the following For function cleanup() the following steps are per- steps have to be performed. formed. • Map the graphics resources (PBO) in resources for access • Unregister a graphics resource for access by CUDA by CUDA using cudaGraphicsMapResources() using cudaGraphicsUnregisterResource, with function. resource cuda pbo. • Return a pointer through which the mapped • Bind a named buffer object using glBindBuffer tak- graphics resource may be accessed using ing it from GL_PIXEL_UNPACK_BUFFER. cudaGraphicsResourceGetMappedPointer() • Delete a buffer object named by the elements of function. pbo buffer by means of glDeleteBuffers, and • Unmap the graphics resources in PBO (and once delete a texture named by the elements of texid using unmapped, the resources can not be accessed glDeleteTextures. by CUDA until they are mapped again) using In addition to the above functions, a main() function has cudaGraphicsUnmapResources(). been implemented in order to initialize data, load a default • Use function glClear() for the bitwise of masks image, run the processing functions, and save results. that indicate the buffers to be cleared; using the mask GL_COLOR_BUFFER_BIT, indicating the buffers currently V. P ERFORMANCE A NALYSIS are enabled for color writing. • Bind a named texture to a texturing target using The performance analysis has been done by comparing glBindTexture() function; then bind a named buffer execution time of parallel computing with its sequential coun- object using glBindBuffer() function, and taking it terpart. Intel i7 processor and NVIDIA GEFORCE 845m from PBO buffer. devices have been used. Detailed technical specifications are 59 TABLE I TABLE III GPU TECHNICAL SPECIFICATION AVERAGE EXECUTION TIME Memory Bandwidth 16 GB/s Pixel Rate 13.8 GPixel/s CPU GPU Texture Rate 27.6 GTexel/s Image Size Average Average Average Floating Point Performance 883.7 GFLOPS (Pixel) Processing Processing Processing Shading Units 512 Time Time Time Ratio Texture Mapping Units 32 128 x 128 0,000488625 0,000114875 4,2535365 TABLE II CPU TECHNICAL SPECIFICATION 256 x 256 0,001517857 0,000360125 4,2148064 Processor Number i7-4710HQ Intel R Smart Cache 6 MB DMI2 5 GT/s 512 x 512 0,005681625 0,001378625 4,1212259 # of Cores 4 # of Threads 8 Processor Base Frequency 2.5 GHz 1024 x 1024 0,0205595 0,00464925 4,4221111 Max Turbo Frequency 3.5 GHz Processor Graphics Intel R HD Graphics 4600 Graphics Base Frequency 400 MHz Graphics Max Dynamic Frequency 1.2 GHz given in TABLE I and TABLE II. The code has been compiled using CUDA runtime v.7 and CUDA compute capability 5. The tests were carried out using the same grayscale image in four different sizes, in order to evaluate temporal variations for different sizes. The image size has been varied from 128x128 pixels to 1024x1024 pixels. For each of them the algorithm works so that, after loading the image, it is possible to select the original image or the one processed (through the use of Roberts edge detection) by using a selection menu, as shown in Fig. 6 Roberts operator provides the edge detection of the image at small scales. If an object with a jagged boundary is present, as it is shown in Fig. 6 (left), Roberts operator will find the edges at each spike and twist of the perimeter as in Fig. 6 Fig. 7. Average Processing Times. (right). The operator is sensitive to high frequency noise in the image and will generate only local edge data instead of recovering the global structure of a boundary. For each image, multiple runs were carried out and the greatest amounts of processing time were recorded, by using CudaEventElapsedTime() function for the GPU version and SDK Timer function for the CPU version. Then it was calculated the average for different image sizes, as shown in TABLE III. The execution times for the sequential and parallel versions on CPU and GPU are plotted in Fig. 7. Looking at the data in Fig. 7, we can state that the average processing CPU time is greater than the average processing GPU time, particularly when the image size increses then the sequential version of the program has a considerable loger execution time. For each image size, it has been computed the average processing time ratio between CPU and GPU, which is in the range between 4.1 and 4.4, and it is shown in Fig. 8. Fig. 9 shows the comparison between the average processing Fig. 8. Average Processing Times Ratio. time on the GPU and CPU for a number of images equal to 50. It illustrates that the time of the CPU is higher than the 60 Fig. 6. Original image (left) and Processed image (right). the code the steps for loading the file, storing it in memory, the data transport on CPU to GPU and back, and the visualization on a window screen, without direct control of images by OpenGL libraries. Shah [20] has proposed the implementation of the Sobel filter on a GeForce GT 130. Image processing is performed in CUDA environment with the GPU computing; OpenGL libraries provide a control of frame and the insertion into the memory. In the end, the execution times were compared with each other taking into account a heterogeneous code on CPU and observing the improvements of speedup on GPU. VII. C ONCLUSIONS As shown by the performance between execution time of parallel computing with its sequential counterpart, the perfor- Fig. 9. Average Processing Times to 50 Image. mance GPU results indicate that significant speedup can be achieved. In fact, Roberts edge detection can get a substantial speedup, compared to the CPU based implementations, ob- time of the GPU. Moreover, if we process a greater number taining an average processing time CPU over GPU ratio in of images, the time difference is considerable. the range 4.1 to 4.4. The time reduction is significant in a great number of VI. R ELATED W ORKS processing tasks, allowing the detection of images in high The edge detection can be carried out by different types quality and in real time. This is due to GPUs, which pro- of operators such as Roberts, Canny, Deriche, Differential, vide a novel and efficient acceleration technique for image Prewitt, Sobel. An example regarding how to use Sobel Edge processing, cheaper than hardware implementation. Detection can be seen in the following articles. 