123


    Development and Research of Conveyor Structures of
           Binary Number Sorting Algorithms
      Volodymyr Gryga1, Yaroslav Nykolaichuk2, Nataliia Vozna2, Artur Voronych3, Boris
                                    Krulikovskyi4
 1. Department of Computer Engineering and Electronics, Vasyl Stefanyk Precarpathian National University, UKRAINE, Ivano-Frankivsk,
                                           57 Shevchenko str., email: v.dr_2000@ukr.net
  2. Department of Specialized Computer Systems, Ternopil National Economic University, UKRAINE, Ternopil, 8 Chekhova str., email:
                                            ya.nykolaichuk@tneu.edu.ua, n.vozna@ukr.net
  3. Department of Computer Systems and Networks, Ivano-Frankivsk National Technical University of Oil and Gas, UKRAINE, Ivano-
                                  Frankivsk, 15 Karpatska str., email: a.voronych@it.nung.edu.ua
  4. Department of Computer Engineering, National University of Water Management and Nature Resources Use, UKRAINE, Rivne, 11
                                                Soborna str., email: kboris@ukr.net


   Abstract: The characteristics of the structures                 the "pairwise-odd" permutation method, the fusion method
complexity of conveyor sorting devices, constructed on             and other [3].
the basis of parallel algorithms for sorting binary arrays            If we use the hardware method, then there is ensured
are analyzed. The improved structure of the conveyor               implementation of the binary number sorting operation in real
sorting device, which compared to the known one had 1.9            time and the possibility of applying new circuit design
times less hardware complexity and 1.5 times higher                solutions with the use of a modern element base with an
performance, was developed and researched. Modeling of             orientation towards implementation in the FPGA form.
VHDL models of conveyor sorting devices was carried                   The implementation of highly efficient conveyor devices
out, their synthesis and implementation was made in the            for sorting binary arrays requires extensive use of a modern
Xilinx FPGA by automated designing Vivado system                   element base, the improvement of existing and the
applications. The received practical results of the                development of new methods, algorithms and device
complexity characteristic of the developed conveyor                structures.
devices coincide with theoretical calculations. There are             In this regard, an important t ask is to develop efficient
also determined the relevant areas of application of the           structures for conveyor sorting devices (CSDs), which will
developed devices.                                                 improve the time characteristics of multithreaded data
   Keywords: special device for sorting arrays of data,            processing. The implementation of such devices in the
conveyor sorting device, basic sorting elements,                   languages of the hardware description and their synthesis in
comparison scheme.                                                 FPGA will enable the designer to choose the optimal
                                                                   hardware cost and time-efficient characteristics of the
                     I. INTRODUCTION                               structure of the CSD.
   Sorting is one of the common problems of data processing
and it is generally understood as a problem of placing
                                                                   II. CONVEYOR STRUCTURES ANALYSIS OF PARALLEL
elements of disordered set of data sets values in order of             ALGORITHMS FOR BINARY NUMBERS SORTING
monotonic increase or decrease [1]. The sorting operation             The conveyor processing principle involves the alignment
takes on average 25% of machine time [2] and it is most            in time operators of the algorithms sorting on different data
often used in the tasks of digital processing of signals and       [5-8].
images, similar to convolution operations, fast Fourier               Conveyor structures of sorting devices of binary arrays is
transform, and others [3]. We know many methods of                 developed and analyzed below. They are based on the known
sequential and parallel sorting of binary numbers [1-9]. There     parallel sorting algorithms presented in a graphical form, and
is a large number of algorithms to organize sorting in modern      their hardware and time complexity are estimated. (Their
digital signal processors, and each of these algorithms has its    system characteristics)
advantages and disadvantages. Performing a software sorting           The hardware implementation of the known parallel
operation is time-consuming and generally used to perform          sorting algorithms in the conveyor structures involves the full
consistent data sorting methods. Particularly effective is the     reflection of their flow graphs algorithm (tiered-parallel
parallel performance of algorithm sorting operations using         algorithm forms) [4-8,10,11,13] into the structure of the
hardware methods that significantly accelerate the execution       operating device. The vertices of the graph (functional
time of the algorithm. Such methods of parallel sorting            operators) will correspond to the hardware block (operation)
algorithms, in which the sequence of executed operations           and the arcs will correspond to lines for the transmission of
only depends on the number of input data, are the following:       input data of intermediate and final results. The conveyor
the Batcher’s method [2], the modified "bubbles" method [1],       registers are placed on the lines that connect the operating


                          ACIT 2018, June 1-3, 2018, Ceske Budejovice, Czech Republic
                                                                      124

units of one tier with the operating units of the previous tier.                      x1           x2        x3         x4            x5             x6
  Tier-parallel algorithm forms determines the degree of
                                                                                     Rg1           Rg2       Rg3       Rg4         Rg5               Rg6
parallelism of the graph (the maximum number of vertices in
a single tier or the width of the graph), as well as the                                BSE1                     BSE2                   BSE3
minimum-possible time of calculation of the given algorithm
(number of tiers or height of the graph) [6-8].                                       Rg7          Rg8       Rg9      Rg10         Rg11            Rg12
  It is shown the structure of the CSD in Fig. 1, which is
based on the numbers sorting by modified “bubbles” sorting                                           BSE4                    BSE5
method [1,13].
                x1         x2          x3          x4                             Rg13             Rg14     Rg15      Rg16         Rg17             Rg18


               Rg1        Rg2       Rg3        Rg4                                      BSE6                     BSE7                      BSE8


                                                                                  Rg19             Rg20     Rg21      Rg22         Rg23            Rg24
                                            BSE1
                                                                                                     BSE9                    BSE10
               Rg5        Rg6       Rg7        Rg8

                                                                                  Rg25             Rg26     Rg27      Rg28         Rg29             Rg30

                                BSE2
                                                                                       BSE11                     BSE12                  BSE13

               Rg9       Rg10       Rg11       Rg12
                                                                                      Rg31         Rg32      Rg33         Rg34      Rg35             Rg36


                 BSE3                       BSE4                                                      BSE14                  BSE15


              Rg13      Rg14        Rg15       Rg16                                   Rg37         Rg38      Rg39         Rg40         Rg41          Rg42


                            BSE5
                                                                                       y1           y2           y3       y4            y5            y6
                                                                            Fig.2. The structure of Conveyor sorting device for 6 values by
               Rg17      Rg18      Rg19        Rg20
                                                                                          "pairwise-odd" permutation method.
                                                                                x1          x2       x3         x4      x5       x6           x7           x8
                                            BSE6
                                                                               Rg1          Rg2      Rg3     Rg4      Rg5        Rg6          Rg7         Rg8


              Rg21       Rg22       Rg3        Rg24                              BSE1                    BSE2             BSE3                  BSE4


                y1         y2          y3          y4                          Rg9         Rg10     Rg11     Rg12     Rg13       Rg14        Rg15         Rg16

Fig.1. The structure of Conveyor sorting device for 4 values by the
                modified “bubbles” sorting method.                                BSE5                    BSE6            BSE7                  BSE8

  The basis for the above-presented structure of conveyor                      Rg17         Rg18    Rg19     Rg20     Rg21       Rg22         Rg23         Rg24

sorting device is the basic sorting elements (BSE) and
conveyor register (Rg).                                                                       BSE9                                  BSE10
  The structure of this conveyor device consists of the same
                                                                               Rg25         Rg26    Rg27     Rg28     Rg29       Rg30         Rg31         Rg32
basic sorting operations (max/min). CSD structure (Fig. 1)
contains N(N-1)/2 basic sorting operations and N(2N-3)
                                                                                 BSE11                   BSE12            BSE13                 BSE14
conveyor registers for input N values.
  The time complexity of this conveyor device structure is                     Rg33         Rg34    Rg35     Rg36     Rg37       Rg38         Rg39         Rg40
determined by the critical distribution of the signal through
(2N-3)      basic      operations       of      sorting     and                               BSE15               BSE16             BSE17
(2N-3)+1 conveyor registers.
                                                                               Rg41         Rg42    Rg43     Rg44     Rg45       Rg46        Rg47          Rg48
  It is shown the structure of the CSD in Fig. 2, which is
constructed on the basis of sorting algorithm by "pairwise-
                                                                                                         BSE18            BSE19
odd" permutation method [2,13].
  CSD structure (Fig. 2) contains N(N-1)/2 basic sorting                       Rg49         Rg50    Rg51     Rg52     Rg53       Rg54         Rg55        Rg56
operations and (N × N) conveyor registers for N input values.
  The time complexity of this conveyor device structure is                       y1          y2      y3          y4     y5        y6           y7           y8
determined by the critical distribution of the signal through N
                                                                            Fig.3. The structure of Conveyor sorting device for 8 values by
basic operations of sorting and (N+1) conveyor registers.                                          Batcher`s method.
  It is shown the CSD structure in Fig. 3, which is
constructed on the basis of the sorting algorithm graph by                  CSD structure (Fig. 3) contains 0, 48 N ln 2 N basic sorting
Batcher’s method [3,13].


                           ACIT 2018, June 1-3, 2018, Ceske Budejovice, Czech Republic
                                                                   125

                           1                                           τ CS - time complexity of comparison scheme,
operations     and           ( log 2 N )([log 2 N + 1]) × N conveyor
                           2                                           τ - time complexity of multiplexer,
                                                                         MP
registers for N input values.
   The time complexity of this conveyor device structure is            τ REG - time complexity of register.
determined by the critical distribution of the signal through
 1
                                                                       III. DEVELOPMENT OF AN IMPROVED STRUCTURE OF
   ( log 2 N )( log 2 N + 1) basic operations of sorting and (N-1)              THE CONVEYOR SORTING DEVICE
 2
conveyor registers.                                                       The bubble CSD (Fig. 1) can be improved by performing
   The conveyor structures of the sorting devices (Fig. 1-3)           independent base-sorting operations separately for the first
consist of the same type of basic sorting elements that                and second half of the input data on the two conveyor
compare two numbers and more of them are sent to the first             structures.
output and less to the second output. The internal structure of           Values X are already ordered by recession and need to be
sorting basic element is presented in Fig. 4.                                         i
                             X1       X2                                                                         2
                                                                       further applied to (( N 2) − N 2) / 2 + N 2 basic sorting
                                                                       elements and 4N conveyor registers.
                                                                          It is shown an improved CSD for 4-bit binary numbers
                                                                       (n = 4) for 8 input values (N =8) in Fig. 5.
                                                                                 x8    x7    x6    x5      x4    x3 x2 x1
                           Comparison
                                                                               Rg8         Rg7      Rg6         Rg5          Rg4      Rg3      Rg2     Rg1
                            Scheme
                                                                                                          BSE2                                     BSE1

              М1                           1       М2                         Rg16        Rg15      Rg14      Rg13           Rg12   Rg11       Rg10       Rg9


                                                                                               BSE4                                     BSE3

               Y1(Min)                             Y2 (Max)                   Rg24         Rg23       Rg22    Rg21       Rg20         Rg19     Rg18    Rg17


        Fig.4. Internal structure of the basic sorting element                   BSE8                    BSE7                  BSE6                BSE5

   When comparing two numbers (X 1 , X 2 ) in the comparison
                                                                              Rg32        Rg31      Rg30      Rg29           Rg28   Rg27      Rg26    Rg25
scheme "for more" (X 1 > X 2 ), the output of the scheme is
formed a comparison sign, the direct value of which controls
                                                                                                 BSE10                                      BSE9
the multiplexer M1, to issue a smaller number to the output
(Y 1 ) and the inverse value controls the multiplexer M2 to                    Rg40       Rg39      Rg38      Rg37           Rg36   Rg35      Rg34    Rg33

issue a larger number at the output (Y 2 ) [5,13].
   The internal structure of the comparison scheme can be                                                BSE12                                     BSE11

constructed on the basis of logical elements [5] or accelerated                Rg48       Rg47      Rg46      Rg45           Rg44   Rg43      Rg42    Rg41
carry adders [13].
   The calculation of the hardware and time complexity of
different comparison schemes and 2-input multiplexers is
given in paper [6].                                                             BSE16                 BSE15                   BSE14                BSE13
   With the account to these data, the hardware complexity of
                                                                              Rg56        Rg55     Rg54       Rg53       Rg52       Rg51       Rg50   Rg49
the CSD by the modified "bubbles" method for N = 8 and
n= 4 equals:
                                                                                               BSE19                 BSE18             BSE17
        ACCSD = N × (N -1)/2 × (A CS + 2 × A MP )+(n × A REG ×
     ×LREG ) = 28 × (24 + 56)+(4 × 4 × 112) = 4032 ( gates ),                 Rg64    Rg63         Rg62         Rg61     Rg60       Rg59       Rg58   Rg57


where ACCSD - hardware complexity of classical CSD,                                                      BSE20                BSE21

A CS - hardware complexity of comparison scheme,                              Rg72    Rg71         Rg70       Rg69       Rg68       Rg67      Rg66     Rg65

A MP - hardware complexity of multiplexer,
                                                                                                                     BSE22
A REG - hardware complexity of register,
                                                                              Rg80    Rg79         Rg78         R77      Rg76       Rg75       Rg74   Rg73
LREG -quantity of conveyor register.
  The time complexity of this conveyor device equals:                          y8         y7        y6           y5          y4       y3       y2         y1
τ CCSD = (τ CS + τ MP ) + (14 ×τ REG ) =                                              Fig.5. An improved structure of the CSD

     = (6 + 3) + (14 × 2) = 37ν ( micro − cycles ),                      The number of basic sorting elements for this CSD for the
where τ      - time complexity of classical CSD,                       binary numbers is equal to 3(( N 2) 2 − N 2)) / 2 + N 2 , and
        CCSD


                             ACIT 2018, June 1-3, 2018, Ceske Budejovice, Czech Republic
                                                                      126

the number of conveyor registers is 6 N + 4 N for N input
values.
  The hardware complexity of an improved CSD for N = 8
and n = 4 is equal to:
         AICSD = 22 × (A CS + 2 × A MP )+(n × A REG × LREG ) =

         =22 × (16 + 24)+(4 × 4 × 80) =2160 ( gates ),
where AICSD - hardware complexity of improved CSD,
A CS - hardware complexity of improved comparison scheme
[13],
A - hardware complexity of improved multiplexer [13],
 MP

A REG - hardware complexity of register,
LREG -quantity of conveyor register.
  The time complexity of such CSD is equal to:                               Fig.7. The dependence graph of the time complexity on the
           τ ICSD = (τ CS + τ MP ) + (10 ×τ REG ) =                                   number of the input data for the CSDs

                   = (3 + 2) + (10 × 2) = 25ν ( micro − cycles ),          It is shown in the graph that the time complexity of an
where τ          - time complexity of classical CSD;                    improved CSD requires about 1.5 times less microtacts than
          CCSD                                                          the classical CSD.
τ CS - time complexity of comparison scheme;
                                                                        V. RESULTS OF CSD’S SIMULATION AND SYNTHESIS
τ - time complexity of multiplexer;
 MP                                                                                       ON FPGA
τ REG - time complexity of register.
                                                                           Structures of classical and improved CSD for 8 input –one-
  So, in comparison with the classic device, we get a                   byte numbers are described in VHDL (Virtual Hardware
                                 =
reduction in the hardware complexity in K  =
                                          4032 2160 1,9                 Description Language). The simulation of the developed
                                                      A
                                                                        CSDs on the functional level was carried out, their RTL
times and increase the speed in=  =
                               K 37 25 1,5 time.                        circuits were obtained and the Xilinx FPGA synthesis was
                                        τ
IV. RESULTS OF THE CONVEYOR SORTING DEVICES                             performed.
                  RESEARCH                                                 It is shown in Fig. 8 a functional diagram of the improved
                                                                        CSD operation for 8 input one-byte numbers.
   It is shown a dependence graph of logical elements
(valves) number on the input data number for the conveyor
structures of the known (classical) and improved sorting
devices by the modified "bubbles" method.


                                                                            Fig.8. Functional simulation diagram for CSD of 8 one-byte
                                                                                                      numbers
      Fig.6. Graph of the dependence of the number of valves on the        The diagram shows that an array of unsorted 8-bit values is
                    input data number for the CSDs                      given at the inputs of the CSD (D_in1, ..., D_in8) on the 1st
  The graph shows that the number of logical elements for               cycle. Then we get a sorted descending order at the outputs
an improved CSD is 1.9 times less than for a classic CSD.               (D_out1, ..., D_out8) at the 9th cycle.
  It is shown a graph of the dependence of the time                        It is shown in Fig. 9 the topology view of improved CSD
complexity in Fig. 7 expressed in the microtacts on the input           with the VHDL implementation model on the
data value for the structures of the known (classical) CSD              xc7a100tcsg324-1 crystal (Artix-7 family) of the Xilinx
and the improved CSD with the modified "bubbles" method.                firmware by Vivado CAD.


                              ACIT 2018, June 1-3, 2018, Ceske Budejovice, Czech Republic
                                                                   127

                                                                     [2] Thomas Braunl,. Parallel Programming: Transl. from
                                                                         german ,К.: Vyscha Shkola., pp. 358, 1997.
                                                                     [3] S.Y. Kung, VLSI array processors. Trans. from English -
                                                                         М.: “Peace”, pp. 672, 1991.
                                                                     [4] Thomas H. Cormen, Charles E. Leiserson, Ronald L.
                                                                         Rivest and Clifford Stein. Introduction to algorithms, 3rd
                                                                         edition, Massachusetts Institute of Technology, pp.1313,
                                                                         2009.
                                                                     [5] A. Melnyk, Memory with ordered access: monograph,
                                                                         Lviv: Press Lviv Polytechnic, pp. 296, 2014.
                                                                     [6] A. Melnyk, Computer Architecture, Lutsk:Regional
                                                                           publisher, pp. 470, 2008.
                                                                     [7] V. Voevodin, Parallel Computing. Textbook for high
                                                                           schools, BHV:Peter, pp. 608, 2002.
                                                                     [8] Andrew S. Tanenbaum Computer Architecture Austin.
                                                                           6th ed., St. Petersburg: “Peter”, pp. 816, 2013.
                                                                     [9] O. Kehret, A. Walz and A. Sikora, “Integration of
                                                                           hardware security modules into a deeply embedded TLS
                                                                           stack” International Journal of Computing, vol. 15,
                                                                           issue 1, pp. 24-32, 2016.
                                                                     [10] V. Zadiraka, Ya. Nykolaichuk. Methods of effective
Fig.9. View of the crystal topology of CSD during implementation           protection of information flows, Ternopil: Terno-graf,
                            at the FPGA                                    pp.308, 2014.
   Vivado CAD tools placed the implemented VHDL project              [11] S. Melnychuk, A. Voronych, L. Nykolaichuk and B.
of the improved CSD almost in the center of the crystal.                   Krulikovskyi “The Structure and Components of
   Table 1 presents the results of the synthesis of                        Embedded Special Processors for Determination of
implemented CSDs for sorting 8 one-byte numbers on the                     Entropy Signals and Random Massages” Perspective
FPGA of the Xilinx firmware.                                               technologies and methods in MEMS design.
                                                                           Proceedings of XIIIth International Conference.
     TABLE 1. RESULTS OF THE CSDS SYNTHESIS ON FPGA                        MEMSTECH 2017, Lviv-Svalyava, Ukraine, pp.81-84,
                     Classic CSD           Improved CSD                    February, 2017.
                  Blocks                 Blocks                      [12] B. Krulikovskyi, A. Davletova, V. Gryga and Y.
                              Clock                 Clock                  Nykolaichuk “Synthesis of Components of High
       FPGA      quantity               quantity
                            frequency             frequency                Performance Special Proccessors of Execution of
                  FPGA                   FPGA
                              (MHz)                 (MHz)                  Arithmetic and Logical Operations Dsts Proccessing in
                  (CLB)                  (CLB)
                                                                           Theoretical and Numerical Basis Rademacher”,
 1     Artix 7     950       109,9         475        174,7                Proceedings of XIVth International Conference.
  As can be seen from Table 1, experimental results coincide               CADSM’2017, Lviv-Poljana, Ukraine, pp. 214-217,
with analytical calculations.                                              February,2017.
                                                                     [13] V. Gryga, Y. Nykolaichuk, N. Vozna and B.
                     VI. CONCLUSION                                        Krulikovskyi “Synthesis of a microelectronic structure
   During the research of CSDs it is determined that they are              of a specialized processor for sorting an array of binary
widely used in multithreaded data processing, and are often                numbers” Perspective technologies and methods in
used in digital processing of signals, images and sorting                  MEMS design. Proceedings of XIIIth International
networks for the rapid transfer of large data arrays.                      Conference. MEMSTECH 2017, Lviv-Svalyava,
   The improved structure of the CSD of binary arrays by the               Ukraine, pp.170-173, April, 2017.
modified "bubble" method was developed, the hardware and             [14] V. Gryga, I. Kolosov and O. Danyluk “The development
time complexity calculation was performed.                                 of a fast iterative algorithm structure of cosine
   As a result of the comparison with the classical CSD for                transform” The Experience of Designing and
binary number array by the modified bubble method it is                    Application of CAD System in Microelectronics.
received decreasing the cost of the equipment in 1.9 times                 Proceedings of XIIIth International Conference.
and increasing of speed in 1.5 times, which is confirmed by                TCSET’2016, Lviv-Poljana, Ukraine, pp.506-509,
the results of the practical implementation of the Xilinx                  February,2016.
FPGA.                                                                [15] R. Dunets and V. Gryga “Spatio-temporal synthesis of
                                                                           transformation matrix of reverse fast cosine
                         REFERENCES                                        transformation”, Proceedings of XIIIth International
[1] D. Knuth, The art of computer programming, V.3: Transl.                Conference. CADSM’2015, Lviv-Poljana, Ukraine,
    from eng. М.: “Peace”, pp. 841, 1978.                                  pp.45-49, February, 2015.


                          ACIT 2018, June 1-3, 2018, Ceske Budejovice, Czech Republic