123 Development and Research of Conveyor Structures of Binary Number Sorting Algorithms Volodymyr Gryga1, Yaroslav Nykolaichuk2, Nataliia Vozna2, Artur Voronych3, Boris Krulikovskyi4 1. Department of Computer Engineering and Electronics, Vasyl Stefanyk Precarpathian National University, UKRAINE, Ivano-Frankivsk, 57 Shevchenko str., email: v.dr_2000@ukr.net 2. Department of Specialized Computer Systems, Ternopil National Economic University, UKRAINE, Ternopil, 8 Chekhova str., email: ya.nykolaichuk@tneu.edu.ua, n.vozna@ukr.net 3. Department of Computer Systems and Networks, Ivano-Frankivsk National Technical University of Oil and Gas, UKRAINE, Ivano- Frankivsk, 15 Karpatska str., email: a.voronych@it.nung.edu.ua 4. Department of Computer Engineering, National University of Water Management and Nature Resources Use, UKRAINE, Rivne, 11 Soborna str., email: kboris@ukr.net Abstract: The characteristics of the structures the "pairwise-odd" permutation method, the fusion method complexity of conveyor sorting devices, constructed on and other [3]. the basis of parallel algorithms for sorting binary arrays If we use the hardware method, then there is ensured are analyzed. The improved structure of the conveyor implementation of the binary number sorting operation in real sorting device, which compared to the known one had 1.9 time and the possibility of applying new circuit design times less hardware complexity and 1.5 times higher solutions with the use of a modern element base with an performance, was developed and researched. Modeling of orientation towards implementation in the FPGA form. VHDL models of conveyor sorting devices was carried The implementation of highly efficient conveyor devices out, their synthesis and implementation was made in the for sorting binary arrays requires extensive use of a modern Xilinx FPGA by automated designing Vivado system element base, the improvement of existing and the applications. The received practical results of the development of new methods, algorithms and device complexity characteristic of the developed conveyor structures. devices coincide with theoretical calculations. There are In this regard, an important t ask is to develop efficient also determined the relevant areas of application of the structures for conveyor sorting devices (CSDs), which will developed devices. improve the time characteristics of multithreaded data Keywords: special device for sorting arrays of data, processing. The implementation of such devices in the conveyor sorting device, basic sorting elements, languages of the hardware description and their synthesis in comparison scheme. FPGA will enable the designer to choose the optimal hardware cost and time-efficient characteristics of the I. INTRODUCTION structure of the CSD. Sorting is one of the common problems of data processing and it is generally understood as a problem of placing II. CONVEYOR STRUCTURES ANALYSIS OF PARALLEL elements of disordered set of data sets values in order of ALGORITHMS FOR BINARY NUMBERS SORTING monotonic increase or decrease [1]. The sorting operation The conveyor processing principle involves the alignment takes on average 25% of machine time [2] and it is most in time operators of the algorithms sorting on different data often used in the tasks of digital processing of signals and [5-8]. images, similar to convolution operations, fast Fourier Conveyor structures of sorting devices of binary arrays is transform, and others [3]. We know many methods of developed and analyzed below. They are based on the known sequential and parallel sorting of binary numbers [1-9]. There parallel sorting algorithms presented in a graphical form, and is a large number of algorithms to organize sorting in modern their hardware and time complexity are estimated. (Their digital signal processors, and each of these algorithms has its system characteristics) advantages and disadvantages. Performing a software sorting The hardware implementation of the known parallel operation is time-consuming and generally used to perform sorting algorithms in the conveyor structures involves the full consistent data sorting methods. Particularly effective is the reflection of their flow graphs algorithm (tiered-parallel parallel performance of algorithm sorting operations using algorithm forms) [4-8,10,11,13] into the structure of the hardware methods that significantly accelerate the execution operating device. The vertices of the graph (functional time of the algorithm. Such methods of parallel sorting operators) will correspond to the hardware block (operation) algorithms, in which the sequence of executed operations and the arcs will correspond to lines for the transmission of only depends on the number of input data, are the following: input data of intermediate and final results. The conveyor the Batcher’s method [2], the modified "bubbles" method [1], registers are placed on the lines that connect the operating ACIT 2018, June 1-3, 2018, Ceske Budejovice, Czech Republic 124 units of one tier with the operating units of the previous tier. x1 x2 x3 x4 x5 x6 Tier-parallel algorithm forms determines the degree of Rg1 Rg2 Rg3 Rg4 Rg5 Rg6 parallelism of the graph (the maximum number of vertices in a single tier or the width of the graph), as well as the BSE1 BSE2 BSE3 minimum-possible time of calculation of the given algorithm (number of tiers or height of the graph) [6-8]. Rg7 Rg8 Rg9 Rg10 Rg11 Rg12 It is shown the structure of the CSD in Fig. 1, which is based on the numbers sorting by modified “bubbles” sorting BSE4 BSE5 method [1,13]. x1 x2 x3 x4 Rg13 Rg14 Rg15 Rg16 Rg17 Rg18 Rg1 Rg2 Rg3 Rg4 BSE6 BSE7 BSE8 Rg19 Rg20 Rg21 Rg22 Rg23 Rg24 BSE1 BSE9 BSE10 Rg5 Rg6 Rg7 Rg8 Rg25 Rg26 Rg27 Rg28 Rg29 Rg30 BSE2 BSE11 BSE12 BSE13 Rg9 Rg10 Rg11 Rg12 Rg31 Rg32 Rg33 Rg34 Rg35 Rg36 BSE3 BSE4 BSE14 BSE15 Rg13 Rg14 Rg15 Rg16 Rg37 Rg38 Rg39 Rg40 Rg41 Rg42 BSE5 y1 y2 y3 y4 y5 y6 Fig.2. The structure of Conveyor sorting device for 6 values by Rg17 Rg18 Rg19 Rg20 "pairwise-odd" permutation method. x1 x2 x3 x4 x5 x6 x7 x8 BSE6 Rg1 Rg2 Rg3 Rg4 Rg5 Rg6 Rg7 Rg8 Rg21 Rg22 Rg3 Rg24 BSE1 BSE2 BSE3 BSE4 y1 y2 y3 y4 Rg9 Rg10 Rg11 Rg12 Rg13 Rg14 Rg15 Rg16 Fig.1. The structure of Conveyor sorting device for 4 values by the modified “bubbles” sorting method. BSE5 BSE6 BSE7 BSE8 The basis for the above-presented structure of conveyor Rg17 Rg18 Rg19 Rg20 Rg21 Rg22 Rg23 Rg24 sorting device is the basic sorting elements (BSE) and conveyor register (Rg). BSE9 BSE10 The structure of this conveyor device consists of the same Rg25 Rg26 Rg27 Rg28 Rg29 Rg30 Rg31 Rg32 basic sorting operations (max/min). CSD structure (Fig. 1) contains N(N-1)/2 basic sorting operations and N(2N-3) BSE11 BSE12 BSE13 BSE14 conveyor registers for input N values. The time complexity of this conveyor device structure is Rg33 Rg34 Rg35 Rg36 Rg37 Rg38 Rg39 Rg40 determined by the critical distribution of the signal through (2N-3) basic operations of sorting and BSE15 BSE16 BSE17 (2N-3)+1 conveyor registers. Rg41 Rg42 Rg43 Rg44 Rg45 Rg46 Rg47 Rg48 It is shown the structure of the CSD in Fig. 2, which is constructed on the basis of sorting algorithm by "pairwise- BSE18 BSE19 odd" permutation method [2,13]. CSD structure (Fig. 2) contains N(N-1)/2 basic sorting Rg49 Rg50 Rg51 Rg52 Rg53 Rg54 Rg55 Rg56 operations and (N × N) conveyor registers for N input values. The time complexity of this conveyor device structure is y1 y2 y3 y4 y5 y6 y7 y8 determined by the critical distribution of the signal through N Fig.3. The structure of Conveyor sorting device for 8 values by basic operations of sorting and (N+1) conveyor registers. Batcher`s method. It is shown the CSD structure in Fig. 3, which is constructed on the basis of the sorting algorithm graph by CSD structure (Fig. 3) contains 0, 48 N ln 2 N basic sorting Batcher’s method [3,13]. ACIT 2018, June 1-3, 2018, Ceske Budejovice, Czech Republic 125 1 τ CS - time complexity of comparison scheme, operations and ( log 2 N )([log 2 N + 1]) × N conveyor 2 τ - time complexity of multiplexer, MP registers for N input values. The time complexity of this conveyor device structure is τ REG - time complexity of register. determined by the critical distribution of the signal through 1 III. DEVELOPMENT OF AN IMPROVED STRUCTURE OF ( log 2 N )( log 2 N + 1) basic operations of sorting and (N-1) THE CONVEYOR SORTING DEVICE 2 conveyor registers. The bubble CSD (Fig. 1) can be improved by performing The conveyor structures of the sorting devices (Fig. 1-3) independent base-sorting operations separately for the first consist of the same type of basic sorting elements that and second half of the input data on the two conveyor compare two numbers and more of them are sent to the first structures. output and less to the second output. The internal structure of Values X are already ordered by recession and need to be sorting basic element is presented in Fig. 4. i X1 X2 2 further applied to (( N 2) − N 2) / 2 + N 2 basic sorting elements and 4N conveyor registers. It is shown an improved CSD for 4-bit binary numbers (n = 4) for 8 input values (N =8) in Fig. 5. x8 x7 x6 x5 x4 x3 x2 x1 Comparison Rg8 Rg7 Rg6 Rg5 Rg4 Rg3 Rg2 Rg1 Scheme BSE2 BSE1 М1 1 М2 Rg16 Rg15 Rg14 Rg13 Rg12 Rg11 Rg10 Rg9 BSE4 BSE3 Y1(Min) Y2 (Max) Rg24 Rg23 Rg22 Rg21 Rg20 Rg19 Rg18 Rg17 Fig.4. Internal structure of the basic sorting element BSE8 BSE7 BSE6 BSE5 When comparing two numbers (X 1 , X 2 ) in the comparison Rg32 Rg31 Rg30 Rg29 Rg28 Rg27 Rg26 Rg25 scheme "for more" (X 1 > X 2 ), the output of the scheme is formed a comparison sign, the direct value of which controls BSE10 BSE9 the multiplexer M1, to issue a smaller number to the output (Y 1 ) and the inverse value controls the multiplexer M2 to Rg40 Rg39 Rg38 Rg37 Rg36 Rg35 Rg34 Rg33 issue a larger number at the output (Y 2 ) [5,13]. The internal structure of the comparison scheme can be BSE12 BSE11 constructed on the basis of logical elements [5] or accelerated Rg48 Rg47 Rg46 Rg45 Rg44 Rg43 Rg42 Rg41 carry adders [13]. The calculation of the hardware and time complexity of different comparison schemes and 2-input multiplexers is given in paper [6]. BSE16 BSE15 BSE14 BSE13 With the account to these data, the hardware complexity of Rg56 Rg55 Rg54 Rg53 Rg52 Rg51 Rg50 Rg49 the CSD by the modified "bubbles" method for N = 8 and n= 4 equals: BSE19 BSE18 BSE17 ACCSD = N × (N -1)/2 × (A CS + 2 × A MP )+(n × A REG × ×LREG ) = 28 × (24 + 56)+(4 × 4 × 112) = 4032 ( gates ), Rg64 Rg63 Rg62 Rg61 Rg60 Rg59 Rg58 Rg57 where ACCSD - hardware complexity of classical CSD, BSE20 BSE21 A CS - hardware complexity of comparison scheme, Rg72 Rg71 Rg70 Rg69 Rg68 Rg67 Rg66 Rg65 A MP - hardware complexity of multiplexer, BSE22 A REG - hardware complexity of register, Rg80 Rg79 Rg78 R77 Rg76 Rg75 Rg74 Rg73 LREG -quantity of conveyor register. The time complexity of this conveyor device equals: y8 y7 y6 y5 y4 y3 y2 y1 τ CCSD = (τ CS + τ MP ) + (14 ×τ REG ) = Fig.5. An improved structure of the CSD = (6 + 3) + (14 × 2) = 37ν ( micro − cycles ), The number of basic sorting elements for this CSD for the where τ - time complexity of classical CSD, binary numbers is equal to 3(( N 2) 2 − N 2)) / 2 + N 2 , and CCSD ACIT 2018, June 1-3, 2018, Ceske Budejovice, Czech Republic 126 the number of conveyor registers is 6 N + 4 N for N input values. The hardware complexity of an improved CSD for N = 8 and n = 4 is equal to: AICSD = 22 × (A CS + 2 × A MP )+(n × A REG × LREG ) = =22 × (16 + 24)+(4 × 4 × 80) =2160 ( gates ), where AICSD - hardware complexity of improved CSD, A CS - hardware complexity of improved comparison scheme [13], A - hardware complexity of improved multiplexer [13], MP A REG - hardware complexity of register, LREG -quantity of conveyor register. The time complexity of such CSD is equal to: Fig.7. The dependence graph of the time complexity on the τ ICSD = (τ CS + τ MP ) + (10 ×τ REG ) = number of the input data for the CSDs = (3 + 2) + (10 × 2) = 25ν ( micro − cycles ), It is shown in the graph that the time complexity of an where τ - time complexity of classical CSD; improved CSD requires about 1.5 times less microtacts than CCSD the classical CSD. τ CS - time complexity of comparison scheme; V. RESULTS OF CSD’S SIMULATION AND SYNTHESIS τ - time complexity of multiplexer; MP ON FPGA τ REG - time complexity of register. Structures of classical and improved CSD for 8 input –one- So, in comparison with the classic device, we get a byte numbers are described in VHDL (Virtual Hardware = reduction in the hardware complexity in K = 4032 2160 1,9 Description Language). The simulation of the developed A CSDs on the functional level was carried out, their RTL times and increase the speed in= = K 37 25 1,5 time. circuits were obtained and the Xilinx FPGA synthesis was τ IV. RESULTS OF THE CONVEYOR SORTING DEVICES performed. RESEARCH It is shown in Fig. 8 a functional diagram of the improved CSD operation for 8 input one-byte numbers. It is shown a dependence graph of logical elements (valves) number on the input data number for the conveyor structures of the known (classical) and improved sorting devices by the modified "bubbles" method. Fig.8. Functional simulation diagram for CSD of 8 one-byte numbers Fig.6. Graph of the dependence of the number of valves on the The diagram shows that an array of unsorted 8-bit values is input data number for the CSDs given at the inputs of the CSD (D_in1, ..., D_in8) on the 1st The graph shows that the number of logical elements for cycle. Then we get a sorted descending order at the outputs an improved CSD is 1.9 times less than for a classic CSD. (D_out1, ..., D_out8) at the 9th cycle. It is shown a graph of the dependence of the time It is shown in Fig. 9 the topology view of improved CSD complexity in Fig. 7 expressed in the microtacts on the input with the VHDL implementation model on the data value for the structures of the known (classical) CSD xc7a100tcsg324-1 crystal (Artix-7 family) of the Xilinx and the improved CSD with the modified "bubbles" method. firmware by Vivado CAD. ACIT 2018, June 1-3, 2018, Ceske Budejovice, Czech Republic 127 [2] Thomas Braunl,. Parallel Programming: Transl. from german ,К.: Vyscha Shkola., pp. 358, 1997. [3] S.Y. Kung, VLSI array processors. Trans. from English - М.: “Peace”, pp. 672, 1991. [4] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein. Introduction to algorithms, 3rd edition, Massachusetts Institute of Technology, pp.1313, 2009. [5] A. Melnyk, Memory with ordered access: monograph, Lviv: Press Lviv Polytechnic, pp. 296, 2014. [6] A. Melnyk, Computer Architecture, Lutsk:Regional publisher, pp. 470, 2008. [7] V. Voevodin, Parallel Computing. Textbook for high schools, BHV:Peter, pp. 608, 2002. [8] Andrew S. Tanenbaum Computer Architecture Austin. 6th ed., St. Petersburg: “Peter”, pp. 816, 2013. [9] O. Kehret, A. Walz and A. Sikora, “Integration of hardware security modules into a deeply embedded TLS stack” International Journal of Computing, vol. 15, issue 1, pp. 24-32, 2016. [10] V. Zadiraka, Ya. Nykolaichuk. Methods of effective Fig.9. View of the crystal topology of CSD during implementation protection of information flows, Ternopil: Terno-graf, at the FPGA pp.308, 2014. Vivado CAD tools placed the implemented VHDL project [11] S. Melnychuk, A. Voronych, L. Nykolaichuk and B. of the improved CSD almost in the center of the crystal. Krulikovskyi “The Structure and Components of Table 1 presents the results of the synthesis of Embedded Special Processors for Determination of implemented CSDs for sorting 8 one-byte numbers on the Entropy Signals and Random Massages” Perspective FPGA of the Xilinx firmware. technologies and methods in MEMS design. Proceedings of XIIIth International Conference. TABLE 1. RESULTS OF THE CSDS SYNTHESIS ON FPGA MEMSTECH 2017, Lviv-Svalyava, Ukraine, pp.81-84, Classic CSD Improved CSD February, 2017. Blocks Blocks [12] B. Krulikovskyi, A. Davletova, V. Gryga and Y. Clock Clock Nykolaichuk “Synthesis of Components of High FPGA quantity quantity frequency frequency Performance Special Proccessors of Execution of FPGA FPGA (MHz) (MHz) Arithmetic and Logical Operations Dsts Proccessing in (CLB) (CLB) Theoretical and Numerical Basis Rademacher”, 1 Artix 7 950 109,9 475 174,7 Proceedings of XIVth International Conference. As can be seen from Table 1, experimental results coincide CADSM’2017, Lviv-Poljana, Ukraine, pp. 214-217, with analytical calculations. February,2017. [13] V. Gryga, Y. Nykolaichuk, N. Vozna and B. VI. CONCLUSION Krulikovskyi “Synthesis of a microelectronic structure During the research of CSDs it is determined that they are of a specialized processor for sorting an array of binary widely used in multithreaded data processing, and are often numbers” Perspective technologies and methods in used in digital processing of signals, images and sorting MEMS design. Proceedings of XIIIth International networks for the rapid transfer of large data arrays. Conference. MEMSTECH 2017, Lviv-Svalyava, The improved structure of the CSD of binary arrays by the Ukraine, pp.170-173, April, 2017. modified "bubble" method was developed, the hardware and [14] V. Gryga, I. Kolosov and O. Danyluk “The development time complexity calculation was performed. of a fast iterative algorithm structure of cosine As a result of the comparison with the classical CSD for transform” The Experience of Designing and binary number array by the modified bubble method it is Application of CAD System in Microelectronics. received decreasing the cost of the equipment in 1.9 times Proceedings of XIIIth International Conference. and increasing of speed in 1.5 times, which is confirmed by TCSET’2016, Lviv-Poljana, Ukraine, pp.506-509, the results of the practical implementation of the Xilinx February,2016. FPGA. [15] R. Dunets and V. Gryga “Spatio-temporal synthesis of transformation matrix of reverse fast cosine REFERENCES transformation”, Proceedings of XIIIth International [1] D. Knuth, The art of computer programming, V.3: Transl. Conference. CADSM’2015, Lviv-Poljana, Ukraine, from eng. М.: “Peace”, pp. 841, 1978. pp.45-49, February, 2015. ACIT 2018, June 1-3, 2018, Ceske Budejovice, Czech Republic