=Paper=
{{Paper
|id=Vol-1837/paper27
|storemode=property
|title=The GPU-Oriented Tree Representation Based on the Method of Finding the Remainder
|pdfUrl=https://ceur-ws.org/Vol-1837/paper27.pdf
|volume=Vol-1837
|authors=Vladimir Voronkin,Andrey Malikov,Elmira Azarova,Aleksey Shchegolev
|dblpUrl=https://dblp.org/rec/conf/ysip/VoronkinMAS17
}}
==The GPU-Oriented Tree Representation Based on the Method of Finding the Remainder==
<pdf width="1500px">https://ceur-ws.org/Vol-1837/paper27.pdf</pdf>
<pre>
        The GPU-Oriented Tree Representation Based on the
                           Method of Finding the Remainder


                Vladimir Voronkin                      Andrey Malikov                      Elmira Azarova

           vl.voronkinraxperi. om                   malikovn stu.ru                 elmira.azarovamail.ru

                                                   Aleksey Sh hegolev

                                                  a.wegolevgmail. om


                                            North-Cau asus Federal University,
                            Institute of Information Te hnologies and Tele ommuni ations,
                                                      Russial Federation


                                                          Abstra t
                        This paper des ribes a new method of the tree-like stru ture represen-
                        tation in data bases. The developed method of nding relationships be-
                        tween nodes is based on the unique prime fa torization theorem whi h
                        states that every natural number        an be written as a produ t of prime
                        numbers. The developed method is implemented using NVIDIA graph-
                        i s pro essing unit parallel    omputing and NVIDIA CUDA te hnology.
                        The paper analyzes the pro ess of tree        onstru tion with the developed
                        method applied, represents the algorithms of GPU-based pro essing of
                        trees   oded using the developed method, and also the results of perfor-
                        man e snapshot.


1    Introdu tion
Computer systems are developing rapidly, pro essor ar hite tures are              hanging, the main and disk memorysizes
are in reasing.    Nowadays the rapid in rease of power and data volumes allows for the development of high-
performan e data management and pro essing systems.
    A large amount of data in the world in one form or another is            urrently represented as a hierar hy or mutual
dependen y where some pie es of information depend on the others.
    What is more, data management systems are beginning to migrate from the disk-oriented te hnology to the
memory-oriented. The development of DBMS has led to integrating in-memory te hnology support by many
database management systems. [GBE13℄
    Data pro essing in the main memory enables a             onsiderable in rease of the data pro essing rate in favor of
safety. Modern DBMS, supporting in-memory te hnology pro ess data in the memory mirroring                         hanges to the
disk.
    The appli ation of memory-oriented data pro essing te hnologies together with the new methods of represent-
ing trees using GPU      omputing     an in rease the data pro essing rate.
    This resear h aims to develop a new method of hierar hy-representation in databases that exhibits high
performan e in tree manipulation operations.


Copyright c 2017
Copyright   by theby the paper’s
                   paper's       authors.
                           authors.       Copying
                                    Copying       permitted
                                             permitted       for private
                                                       for private and aand academic
                                                                         ademi       purposes.
                                                                                purposes.

In: S.A.Hölldobler,
         Editor, B.A.Coeditor
                       Malikov, (eds.):
                                 C. Wernhard  (eds.): of
                                        Pro eedings   YSIP2  – Proceedings
                                                         the XYZ  Workshop,of the
                                                                               Lo Second Young Scientist’s
                                                                                  ation, Country,          International
                                                                                                  DD-MMM-YYYY,           Workshop
                                                                                                                      published at
on Trends
http://      in Information Processing, Dombai, Russian Federation, May 16–20, 2017, published at http://ceur-ws.org.
         eur-ws.org


                                                              195
                                                               1
2     Related Works
This work is devoted to the des ription of the developed method of representing trees in databases. The develop-
ment of the method was initially-oriented towards the maximum parallelization of tree pro essing and maximum
independen e of node       ode from other node      odes. The paper fo uses on a radi ally new method of representing
trees in relational and nonrelational databases, and also demonstrates that the GPU                  omputing of hierar hi al
relationships   an be more ee tive than the CPU. The developed method of representing trees is oriented towards
the parallel pro essing mode with the use of a GPU.


2.1    Methods of Representing Trees

All the methods of representing trees in databases          an be put into two      ategories:
     methods of representing trees in non-graph databases;
     methods of representing trees in graph and hierar hi al databases.
    The following methods belong to the rst         ategory:
     The adja en y list method of relational hierar hi al modeling.
    The adja en y list method is based on the storage of dire t parent- hild relationships. Ea h               hild in a treelike
hierar hy relates to its parent whi h is at one level higher. Using this hierar hy property, a table-level relational
dependen e      an be   reated.
     The materialized path method.
    A materialized path is a string eld that      onsists of element names separated by a separator hara ter [Tro03℄.
Parents' names of a node for whi h a materialized path is built are used as the materialized path elements.
     The Nested Sets method.
    The Nested Set method implies that there are two additional elds for the node des ription: a tree                     oding
using the algorithm for the nested sets        oding introdu ed by Joe Celko [Cel12℄ [Cel℄ [Cel04℄, is done as follows:
moving down the left side of a tree, it's ne essary to travel through all the subtrees beginning with the leftmost
to the rightmost and assign ea h node an auto in rement value. When moving down the tree, an auto in rement
value is as signed to the variable responsible for the interval starting point (½left bound); when moving up - to
the variable responsible for the interval end (½right bound).
     The nested intervals method.
    The nested intervals method, introdu ed in [Tro05℄, is based on the materialized path                  oding using a nite
 ontinued fra tion [q1 , q2 , . . ., qn ], where [q1 , q2 , . . ., qn ], where q1 , q2 , . . ., qn are the steps of a materialized
path. Rational numbers a/b, where a >= b >= 1 and GCD(a, b) = 1 are used as tree element                       odes.
    As an example,      onsider how the    ode for a materialized path ½ 3.2.2 is      reated using a     ontinued fra tion:

                                                              1      17
                                                       3+          =
                                                            2 + 21   5
    To transform a materialized path to a rational expression the             onvergen e prin iple      an be used, while for
the inverse transformation the algorithm for the gradual trun ation should be used, that in                   ase of a rational
expression is the same as Eu lid's algorithm for nding GCD [Gai99℄.
     Modi ations and      ombinations of the afore-named methods.
    There are also modi ations of the given methods that            an x the inherent problems [Kol09℄ [vor14℄ [MT11℄.
For example, in [MT11℄ the improvement of the nested intervals method is introdu ed through the interval
 oding in the residue number system (RNS). Using RNS enables redu ing the length of numbers used by storing
a number as a short length remainder. Representing node               odes as numbers expressed in RNS enables parallel
pro essing of not only dierent tree nodes but also one and the same node remainder.                     Thus, the maximum
parallelization of data pro essing is in reased.        In [vor14℄ the method of storing nested intervals expressed in
RNS is introdu ed that solves some problems of sorting numbers expressed in RNS and the index                      onstru tion.
    Su h methods as the Materialized Path, nested intervals and modi ations of the both in lude the data on
 hildren and/or parents in the key while        oding and     an be pro essed in parallel with su ient ee tiveness.
    The methods of representing trees in graph and hierar hi al databases are di ult to              lassify as most of graph
databases use their own formats and stru tures of data representations.
    In some   ases pointers are used for linking neighbor nodes. The pointers are bound to the data that appear
in ar s. In its stru ture this hierar hy model is similar to the adja en y lists.
    Other [Kup86℄ [KV93℄ graph databases use a table model for storing relationship ar s.
    Another format of storing graph data is XML and XML-based formats [MS11℄ [BELP℄.


                                                              196
                                                               2
                                                         Index lookup to nd root Node


                                                                                                         Index lookup on Person.id                        Index lookup on Company.Name
      Person 1         Traverse relation       Company 1
                                                                                                                   Person         Index lookup on CompanyId              Company
                            WORKS

                             Sin e 1998             Name: Google                                              Id      Name                    Works in
          Name: Larry                                                                                                                                              Id      Name
                                                                                Person 3                      1
          Page                                                                                                      Larry Page
                                                                                                                                  PersonId               Sinse     1       Google
                                                                                                                                             CompanyId
                                      S
                                                                                                              2    Joshua Blo h
                                    K     Traverse relation                                                                        1         1           1998      2       Ora le
                                  R                                                                           3    Brian Goetz
                                O                                       S            Name: Brian                                                                   ...       ...
                               W                                       K                                                                                 2001
                 Person 2        Sin e 2001         Company 2      O
                                                                    R                Goetz                    N         ...        2         1
                                                                  W
                                                                                                                                                                   N         ...
                                                                                                                                   3         2           2010
                                                                   Sin e 2010


                       Name: Joshua                    Name: Ora le                                                           b) Relational representation
                       Blo h


                               a) Graph representation


                                                           Figure 1: Representation types of dependent data

    Fig. 1b and 1a illustrate the aforesaid. Fig. 1b illustrates a relational representation of hierar hi al data in
DB, while Fig. 1a illustrates data representation in the form of a graph.
    The illustration data exemplify the storage of hierar hy data in databases.
    One of the distin tive features of graph databases is the base                                                           hosen for database obje ts representation,
relations between the data,                            omplex obje ts and their attributes.
    All graph databases are based on a mathemati al des ription of a graph that determines graph manipulation
operations.             A graph model depends on the base, for example:                                                     a dire ted or undire ted graph, marked or
unmarked graph nodes, weighted and unweighted ar s.
    Some models su h as GOOD, GMOD, G-Log, Gram represent both the s heme and entity as a named digraph.
LDM is an ex eption whose s hemes are digraphs with leaves representing data and internal nodes represent
stru tured data. LDM                            onsists of a table with two- olumns, ea h of whi h asso iates entities with data types
(primitive, tuple or set). A more detailed des ription of graph models is represented in [AG08℄.


2.2      GPU Data Pro essing

Graphi s pro essing unit seems to be an advan ed highperforman e platform for                                                                   omputing. The basi            distin -
tive feature enabling GPU's high performan e is its ar hite ture having a great number of ALU. The GPU and
CPU ar hite tures are                        ompared in [PT11℄. A number of works [BS10℄ [HKOO11℄ point to the advantage of GPU
pro essing of data over CPU. On average, the queries were pro essed 530 times faster in relational databases.
Some works [PSHL10℄ [BMCN09℄ show the in reased rate of sorting data using GPU. As far as hierar hi al and
graph stru tures are                        on erned, the works [HN07℄ [HKOO11℄ [WO10℄ are worth mentioning.                                                      in whi h GPU
graph       omputations are made and whi h des ribe the speedup of data pro essing using GPU                                                                     ompared to CPU
(when pro essing graph stru tures).
    Hereafter, mentioning GPU                                   omputing means the usage of NVIDIA CUDA te hnology.
    CUDA (Compute Unied Devi e Ar hite ture) is a NVIDIA te hnology designed for the development of
appli ations for massive parallel                                omputing units. Due to a great number of ALU. GPU implements the thread
model of           omputation: there are input and output data that                                               onsist of the same elements whi h               an be pro essed
independently. The elements pro essing is performed by the kernel.
    A   kernel is a fun tion exe uted on GPU and                                                   alled from CPU. A kernel is                omposed of a grid of multiple
threads grouped into a hierar hy.
    The highest level  grid                                   orresponds to the kernel and groups all the kernel threads. A grid is a one- or
two-dimensional array of blo ks. Ea h blo k is a one/two/three-dimensional array of threads. Furthermore, ea h
blo k is an entirely independent set of intera ting threads.


3       Prin iples of Tree Constru tion
By ½tree we will understand a                                   onne ted a y li               graph        omposed of a set of nodes n = (k, d), where k is the
node      ode, d is data asso iated with the node:


                                                                                           T = {n} = {(k, d)}

    Let us dene for a node n operations key(n) = k a                                                 ess to the node         ode, data(n) = d a          ess to data asso iated
with the node.


                                                                                                      197
                                                                                                       3
     Denote by   the node identier a prime number unique within node Ids. Let us introdu e a set of primes I ,
representing primes in in reasing order of their values:


                                                    I = s1 , s2 , . . . , sm , . . .

     Let us   all a produ t of the node parent identiers by the node Id               the node ode.
     The   ode of a tree root is   the root identier.
     Let us introdu e the operation of getting the following prime number from a set pf primes I :


                                                    next() = si , i = i + 1

     Primes are needed for tree      onstru tion, be ause tree de omposition should be unique.
     The usage of one and the same identier for more than one node is impossible.
     A tree is formed as follows:
     1. A prime is   hosen as a root node.
     2. Ea h node added to the tree has a        ode equal to the produ t of the parent                ode by the identier of the
node added.
     Ea h tree node in the   ode en odes the entire path starting from the root. This method of node                  ode   reation
is similar to the Materialized Path method.


4     Tree Pro essing Operations
Let us show admissible tree operations. Let us introdu e the fun tion of nding the remainder on dividing a by
b:
                                                                        a
                                                   mod(a, b) = a − b · ⌊ ⌋
                                                                        b
     The following tree operations are admissible:
     1. add(T, d, nparent ), where nparent is parent node, d is data added to the tree  operation of adding a node
to the tree.
     Pre ondition:
                                                         exists(T, nparent )
     Realization:
                                           T := T ∪ {n(key(nparent ) · next(), d)}                                              (1)

     Predi ate exists(T, n) determines if the node n existing in the tree T :


                                                   exists(T, n) := ∃n ∈ T

     2. Operation of sear hing a node with the       ode k in the tree node(T, k).


                                             node(T, k) := {x ∈ T |key(x) = k}                                                  (2)


     3. Operation of sear hing node parents in the tree parents(T, k).


                              parents(T, k) := {x ∈ T |mod(k, key(x)) = 0 ∧ k 6= key(x)}                                        (3)


     4. Operation of sear hing a dire t parent parent(T, k).


                                           parent(T, k) := M AX(parents(T, k))                                                  (4)


     5. Operation of sear hing a subtree subtree(T, k).


                                       subtree(T, k) := {x ∈ T |mod(key(x), k) = 0}                                             (5)


     6. children(T, k)  operation of sear hing dire t         hildren.


                                   children(T, k) := {x ∈ T |k = key(parent(T, key(x)))}                                        (6)


     7. Operation of subtree removal remove(T, k).


                                                                 198
                                                                  4
    Realization:
                                           T := {x ∈ T |mod(key(x), k) 6= 0}                                             (7)

    8. Operation of transferring a subtree rooted nold to a new parent nnew move(T, nold , nnew ).
    Pre ondition:
                           exists(T, nold ) ∧ exists(T, nnew ) ∧ mod(key(nnew ), (nold )) 6= 0
    Realization:


                                                 T := T ∪ {x ∈ subtree(T, key(nold))|
                           |n(recalcKey(T, key(x), key(nold), key(nnew )), data(x))}                                     (8)


    where recalcKey(T, k, kold, knew ) is an operation of node     ode re al ulation:

                                                                         k
                                      recalcKey(T, k, kold, knew ) :=          ∗ knew
                                                                        kold
    Considering the possibility of exe uting parallel operations with a tree as a set, the following features should
be noted:
    1. The possibility of parallel tree pro essing using GPU.
    Tree operations (2, 3, 4, 5, 6) that do not modify the tree have no need for intera tion with other nodes. All
su h operations     an be pro essed in parallel or simultaneously. Parallelization of tree manipulation operations
up to one thread per node is possible.
    2. The order of re ords in tree representation in the memory inuen es only the order of re ords in the result
set and does not inuen e the number of re ords in it (the result set).
    3. The usage of primes in a tree in the sequential order is optional.


5    GPU Computation of Trees
Ea h node represents a pair: {   node ode : pointer to node properties }. Let us         all this pair a   re ord. Sin e the
node properties     an represent a sequen e of dynami ally-sized free elds, ea h node           ode is asso iated with the
pointer to properties lo ation in the le rather than the properties themselves.
    In the le data is stored as a sequen e of re ords. When downloading, re ords are divided as follows: keys are
re orded into GRAM, pointers to the properties are re orded into RAM. Downloaded data (keys and pointers)
is stored as a sequen e. For su h form of representation it is important that the order of re ords in RAM is
stri tly the same as in GRAM. When swapping some nodes in GRAM, it is ne essary to swap nodes in RAM in
a similar way.
    When exe uting a query, say, a lookup query, all the re ords in the tree are reviewed. The query result is
 omputed on-the-y,      omputing the desired result while exe uting the query. Thus the sear h is not narrowed.
    To save the time of the pro essing result transfer to RAM, the GPU makes a bit map of the query result, in
whi h the bit set denotes a node    orresponding to the query, the removed denotes a non- orresponding node, Fig.
2). The bit map is formed in kernels (a kernel is a fun tion exe uted on the graphi s             ard) of query exe ution;
as a result, one map bit    orresponds to ea h node. For example, with a 4-byte key the size of data transmitted
through the bus (between GPU and CPU) is de reased by 32 times.
    To get   ertain nodes in the resulting sele tion, it is ne essary to nd set bits in the bit map. The ordinal
number of a set bit is a number of a storage      ell in RAM (and GRAM) in whi h the data address of the found
node (the found node     ode) is stored.
    Let us   onsider the pro ess of exe uting some tree operations. Ea h operation is exe uted for ea h node in
the set.
    The whole tree pro essing is realized on GPU.
    The result of   omplex query pro essing, for example a      ommon parent sear h, is formed as a result of uniting
bit maps of ea h query result, thus avoiding long operations of ea h node sear hing.
    For example commonP arents(A, k, r) operation of sear hing          ommon parents for two nodes with the         odes k
and r looks like this:                                                       \
                             commonP arents(A, k, r) := parents(A, k)            parent(A, r).
For nding    ommon parents, the most rational (with regard to the time of nding a solution) method is exe uting
two queries for parent sear h for ea h node and uniting bit maps of the queries results through the bitwise AND


                                                          199
                                                           5
                                             Figure 2: Bitwise AND s heme

operator (Fig. 2). Uniting of bit maps as well as         al ulations take pla e in GPU. The GPU exe ution of the
bitwise AND operator for a 4-byte whole requires four multipro essor         y les without    onsidering memory a    ess
delay (4     y les for the shared memory).


6     The Experiment Results
To    he k the developed method of representing trees, the DBMS prototype was designed supporting the basi
operations of data manipulation: insertion, deletion, transfer, sear h. Sear h and transfer queries are exe uted
on GPU, insertion and deletion are exe uted on CPU. The developed DBMS prototype does not                    a he data.
Reading is performed dire tly from the disk. Queries are not        a hed either. The result of sear h query is a bit
map returned to the       lient in the form of a   ursor. At the next   ursor re ord query, the sear h for the next set
bit in the map takes pla e, then reading of the do ument and key          orresponding to the bit lo ation in the map.
     The performan e test of the developed method of representing trees was            arried out in    omparison with
DBMS MongoDB. MongoDB was              hosen as the popular and high performan e DBMS. As an index, MongoDB
uses B+tree. For hierar hy representation in MongoDB the Materialized Path was               hosen as a hierar hy re ord
type.


6.1       Test Set

A tree-like database is used as a set test. As a DB base, the All-Russian Classier of Addresses (KLADR) is
used. [kla℄
     Data is represented as a tree with 2.5 million nodes. Ea h node has a dierent number of          hildren. The tree
is not balan ed. The height of the tree is 5 levels.
     For running experiments the KLADR referen e was extended. The original referen e guide has unused elds
and o      upies approximately 400 MB of the disk spa e.
     Ea h tree node has the elds represented in Table 1.
     It is ne essary to dis riminate between a geographi al division and a tree node to whi h the geographi al
division is atta hed.
     The elds represented in Table 1 are      ommon elds for the both DBMSs in the test.          MongoDB uses the
Materialized Path (eld ½matPath) to store the hierar hy data. In the eld ½matPath the DBMS prototype
stores the node      ode whi h is the produ t of parents Ids by the node Id.
     The data sets are the same for the both DBMSs tested.
     To   he k the performan e of the method of hierar hy representation in       omparison to the analogues, a series
of tests were    ondu ted.


                                                           200
                                                            6
                                                 Table 1: Tree node elds


                  Node name      Data type             Des ription

                  name            har[32℄              node name
                  matPath        var har[128℄          node materialized path
                  objName         har[16℄              name of obje t asso iated with node
                  objData        var har[10000℄        data atta hed to obje t
                  gpsX           oat                  obje t    oordinate X in GPS     oordinate system
                  gpsY           oat                  obje t    oordinate Y in GPS     oordinate system
                  level          int                   obje t nesting level in a     ordan e with KLADR
                  objType         har[8℄               obje t type in a      ordan e with KLADR
                  objCode         har[16℄              obje t    ode in a    ordan e with KLADR


    A Cold Startup, with all the tests being done, was performed for ea h DBMS ½warm-up. After that, all the
tests were   arried out. Before testing of ea h DBMS the              omputer was rebooted.
    Ea h test was repeated three times. The test results represent the average re eived value.
    The power supply plan was set to ½peak performan e and also all the power-saving fun tions were turned o.
    For ea h DBMS the following tests were performed: re ord addition, subtree sele tion, sele tion of all the
node parents, tree traversal, node       ode sear h.
    Testing of node addition time was performed as follows: the time of all the nodes addition was taken into
 onsideration. After that, the result was divided by the total number of nodes.
    The sele tion of all the node      hildren assumes obtaining a          ursor for the node   hildren, with reading them
from the disk. 3500 queries were exe uted during this test.
    The sele tion of all the parents also assumes obtaining a            ursor for the result set, with reading re ords from
the disk. Similarly, 3500 queries were exe uted during this test.
    Tree traversal assumes reading the whole tree.
    The sele tion of threads with reading assumes exe ution of 10,000 queries.              After obtaining the    ursor for
 hildren, all the hildren are read out from the disk. This test allows determination of the real DBMS performan e
as MongoDB, when obtaining a           ursor, return only the rst resulting re ords. The rest of the re ords are found
in the index only when reading all the previously found.
    The total size of the database tested is 24.4 GB without            onsidering the normative do uments.
    Two dierent experiments were        arried out using the DB tested. The rst one in ludes the whole database,
the se ond one in ludes only 100000 nodes (the number of queries and data sets are left un hanged). The need
for doing several experiments with one data set is         aused by the ne essity of determining the degree of impa t
of the re ords number in the tree on its pro essing performan e.
    In addition to the database tested, an experiment was set up in whi h 100000 tree nodes were used as a test
set a materialized path was atta hed to ea h node as properties. The performan e of this experiment enables
understanding the degree of impa t of the re ord size on the tree representation method performan e.


6.2     Hardware and Software

A     omputer with Intel   R Core i5TM -4590 pro essor running Windows 8.1 operating system was used as a om-
puting platform. The pro essor is a 3.30 GHz 64 bit quad- ore, with maximum throughput of 32 GB/se . The
ma hine has 8 gigabytes of memory. The graphi s           ard used is an NVIDIA GeFor e GTX 760. with 1152 CUDA
 ores. 4 GB of global memory, and supports a maximum throughput of 192.2 GB/se .
    CUDA 6.5 driver is used on the          omputer.     As for the software.       MongoDB 2.6 is used.     When testing,
exe utable DBMS les were used that were downloaded from the o ial site. The developed DBMS was                     reated
using MS VS Studio 2013 and optimization ag O3.


6.3     Test Results

Table 2 represents the test results. All the values are average results for a set of queries.
    The results in Table 2    onrm a possibility of the justied appli ation of the developed method of hierar hy
representation in databases. Su h a result is noti ed with large data volumes and a great number of re ords.
If a re ord is small, this method performan e be omes inferior to a              ommon tree index (B+tree for MongoDB)


                                                                201
                                                                 7
                                               Table 2: Query exe ution average time


                            Query                                     Represented method         MongoDB


                                                                            (average time, se /test)

                            Node addition                             768,44                     14304,02
                            Subtree sear h                            162,01                     283,25
                            Parent sear h                             126,08                     163,00
                            Reading the whole tree                    490,74                     8 436,37
                            Obtaining arbitrary nodes                 159,90                     8 880,27


                          Table 3: Performan e                 omparison of tests with dierent data sets
 Database / test              Node addition                   Subtree sear h      Parent sear h               Node sear h
                                           2,5 million re ords, 10 KB/re ord
                                                                        queries/se
 MongoDB                                  174,78                   12,36               21,47                           1,13
 Developed method                      3253,34                     21,60               27,76                           62,54
                                             100 000 re ords, 10 Kb/re ord
                                                                        queries/se
 MongoDB                                  162,70                   23,33               93,97                           5,77
 Developed method                      3245,34                     21,09               30,02                           57,22
                                             100 000 re ords, 20 b/re ord
                                                                        queries/se
 MongoDB                               8673,03                    6603,77              350000                      2554,74
 Developed method                         62500                    30,04               108,97                      1988,64


(Table 3). The dieren e is           aused by the restri tion to the maximum number of queries per se ond on the part
of hardware. The minimum time of                    onguration and startup of the empty kernel (for the                       ard tested) in the
syn hronous mode is approximately 0.10 mse :                     onsequently. the graphi s     ard      annot exe ute more than 1015
thousand kernels per se ond.

  Figure 3 shows the relation of the query pro essing time.


                          Data transfer                                                                Data transfer
                            0.0004%                                                                      1%
                                                                                                                         CPU
                                              CPU                                                                        15%
                                              20%


                                                                                                                                GPU
                                                    GPU                                                                         15%
                                                    9.9996%


                   Disk                                                                         Disk
                   70%                                                                          69%


                  a) Children sear h                                                            b) Parent sear h


                                       Figure 3: Time distribution when exe uting queries


  Reading data from the disk takes most of the time for the both queries. The exe ution of kernels requires a
little time. On average, the kernel exe ution takes 4.4 mse                    that provides the maximum 227 queries per se ond
without   onsidering CPU operation time (for 10 KB/re ord). When testing DB with a small re ord size (20
byte/re ord), the kernel exe ution time is almost the same as the testing time for large re ords. This results
in the method being predisposed to building hierar hies based on nding the remainder, dealing with a large
amount of data, the CPU pro essing of whi h is a long-running operation.


                                                                      202
                                                                       8
7     Future Improvements
7.1      Threaded Exe ution

So far the multithreaded software produ t has not been developed (we mean CPU multithreading, not GPU
multithreading). Multithreaded software is an important aspe t of future developments. If a query is pro essed
using multithreading, the maximum speedup will be a hieved when exe uting data modi ation queries (re ord
addition, transfer). The appli ation of NoSQL-style nonlo king re ord fun tions will enable performing database
modi ation operations in the asyn hronous mode, thus improving the performan e.


7.2      Multiple GPU Computing Support

Currently one graphi     ard performs all      al ulations. Addition of several graphi   ards will allow in reasing the
terminal    apa ity of data bank or the query exe ution speed through pro essing          hunks on ea h graphi     ard
when storing the same d it i on dierent graphi         ards.
    When using several graphi     ards, the low-speed CPU-GPU bus transfer appears to be a restri tion. Despite
the fa t that PCI 3.0 xl6 bus has a theoreti         duel-sided ex hange rate 128 GB/se     and 8 GT/s (Giga Trans-
a tion/s), it is restri ted by the    ore memory performan e.       In a tual pra ti e the bus reprodu tion speed is
approximately 39 GB/se      (on tested PC).
    What is more, many motherboards espe ially in the moderate pri e range have only one PCI slot per 16 data
transmission lines whereas the se ond one and the su        eeding have x8 and x4 in general.
    Even though using less lines de reases multi- ard      onguration performan e by 515% on average, and using
multi- ard    ongurations in most      ases   annot rea h its full potential due to the RAM restri ted speed, the
benet from using su h    ongurations ex eeds the expenses.


7.3      Hardware Restri tions

Currently an important GPU restri tion is la k of thread syn hronization resour es on the whole graphi            ard.
Thread syn hronization is possible only inside blo ks.           Blo k syn hronization on GPU is impossible.      User
syn hronization methods are based on using ags. Hardware syn hronization of the blo ks is more important as
the performan e in reases    ompared to the software syn hronization.
    Another GPU nuan e is SIMT ar hite ture.            SIMT ar hite ture allows appli ation of one and the same
 ommand to dierent data. The drawba k of the ar hite ture is that all the threads go through ea h bran h
of the    ode when bran hing is used even if the thread does not exe ute any instru tions. In other words, some
threads are in operation, the others are idle.
    It is important to note a small number of registers available for ea h thread. This problem be omes a ute
with a large number of threads and a        omplex    ode(large data type). With 100,000 threads, 16 byte size data
type and 512 threads per blo k, ea h blo k uses 60 registers in 63 theoreti ally possible (the data taken from
the CUD A proler). This results in a less number of threads being exe uted in parallel. For example, when the
number of registers used in a thread dropped to 13, the pro essing speed of the same amount of data in reased
by almost 10 times.
    When running the experiment, the redu tion in the number of threads per blo k did not result in the expe ted
performan e improvement.


7.4      Key Compression

The tree pro essing method des ribed in the arti le assumes using plural multiplying for dening key values.
The keys obtained by primes multipli ation are exposed to a rapid in rease resulting in large-size numbers. With
100 thousand re ords in a tree it is ne essary to use keys whose size is more than 128 bit.
    The most frequent arithmeti      operations performed with keys in a tree are division, multipli ation and taking
the division remainder. Hardware support of su h operations is impossible due to CPU and GPU ar hite ture
 hara teristi s. To perform arithmeti      operations with su h keys, it is important to use dierent algorithms that
are less e ient than hardware realization of su h operations.
    Using nonpositional notations will allow the size redu tion of the used numbers by storing a number as a small-
size remainder.   When performing non-modular operations, the la k of need for           onsidering the dependen ies
between the remainders opens up an opportunity for the realization of e ient parallelism. This together with
the advan ed features of Kepler ar hite ture (bit       ontrol within warp), will allow the e ient realization of bit
mapping and dening zero remainder for nonpositional notations.


                                                           203
                                                            9
8       Con lusion
The paper represents a method of storing tree-like stru tures in databases oriented towards multithread pro essing
with regard to CUDA parallelization.
    This proje t demonstrates using GPU for hierar hi al data pro essing.
    The result of the arti le as a follows: a new method of representing trees in databases is developed, and the
realization of a hierar hi al database prototype with GPU data pro essing.
    On average, the implemented software runs by 19 times faster     ompared to the materialized path pro essing
in DBMS MongoDB on CPU. Despite the query result variations, the minimum speedup was 1.29 times.


Referen es
[AG08℄       Renzo Angles and Claudio Gutierrez.       Survey of graph database models.    ACM Comput. Surv.,
             40(1):1:11:39, 2008.


[BELP℄       U. Brandes, M. Eiglsperger, J. Lerner, and C. Pi h. Graph markup language (graphml).


[BMCN09℄ Ranieri Baraglia, Via G Moruzzi, Gabriele Capannini, and Fran o Maria Nardini.             Sorting using
             BItoni    netwoRk wIth CUDA.     LSDS-IR Workshop, (July), 2009.
[BS10℄       P. Bakkum and K. Skadron. A        elerating sql database operations on a gpu with   uda: Extended
             results. Te hni al report, University of Virginia Department of Computer S ien e, 2010.


[Cel℄        Joe Celko. Trees in SQL.


[Cel04℄      Joe Celko. Hierar hi al SQL, 2004.


[Cel12℄      J. Celko.   Joe Celko's Trees and Hierar hies in SQL for Smarties. Joe Celko's Trees and Hierar hies
             in SQL for Smarties. Elsevier/Morgan Kaufmann, 2012.


[Gai99℄      A. T. Gainov.     Number Theory. Part I. Resour e book.    Novosibirsk State University. Fa ulty of
             Me hani s and Mathemati s., 1999.


[GBE13℄      Mar el Grandpierre, Georg Buss, and Ralf Esser. In-Memory Computing te hnology - The holy grail
             of analyti s?   Deloitte & Tou he GmbH Wirts haftsprüfungsgesells haf, (07), 2013.
[HKOO11℄ Sungpa k Hong, Sang Kyun Kim, Tayo Oguntebi, and Kunle Olukotun. A               elerating CUDA graph
             algorithms at maximum warp.   Pro eedings of the 16th ACM symposium on Prin iples and pra ti e
             of parallel programming - PPoPP '11, page 267, 2011.
[HN07℄       Pawan Harish and P. J. Narayanan. A       elerating large graph algorithms on the gpu using   uda. In
             Pro eedings of the 14th International Conferen e on High Performan e Computing, HiPC'07, pages
             197208, Berlin, Heidelberg, 2007. Springer-Verlag.


[kla℄        KLADR - the All-Russian Classier of Addresses.


[Kol09℄      M.A. Kolosovskiy. Data stru ture for representing a graph?:      ombination of linked list and hash
             table.   CoRR, 2009.
[Kup86℄      Gabriel Mark Kuper.      The Logi al Data Model: A New Approa h to Database Logi . PhD thesis,
             Stanford, CA, USA, 1986. UMI order no. GAX86-08173.


[KV93℄       G. M. Kuper and M. Y. Vardi. The logi al data model.       ACM Transa tions on Database Systems,
             18(3):379413, 1993.


[MS11℄       Anders Møller and Mi hael S hwartzba h. Xml graphs in program analysis.      S i. Comput. Program.,
             76(6):492515, 2011.


[MT11℄       A. Malikov and A. Turyev.     Nested intervals tree en oding with system of residual lasses. IJCA
             Spe ial Issue on Ele troni s, Information and Communi ation Engineering, ICEICE(2):1921, 2011.


                                                        204
                                                         10
[PSHL10℄   Hagen Peters, Ole S hulz-Hildebrandt, and Norbert Luttenberger. Fast in-pla e sorting with        uda
           based on bitoni   sort bitoni   sort. In   Parallel Pro essing and Applied Mathemati s, pages 403410.
           2010.

[PT11℄     Jonathan Pala ios and Josh Triska. A Comparison of Modern GPU and CPU Ar hite tures: And
           the Common Convergen e of Both. pages 120, 2011.

[Tro03℄    Vadim Tropashko. Trees in SQL: Nested Sets and Materialized Path. 2003.

[Tro05℄    Vadim Tropashko. Nested intervals tree en oding in sql.      SIGMOD Re ord, 34(2), June 2005.
[vor14℄    DBMS Index for Hierar hi al Data Using Nested Intervals and Residue Classes. Young S i. Int. Work.
           Trends Inf. Pro ess., 2014.

[WO10℄     Yangzihao Wang and John Owens. Large-s ale graph pro essing algorithms on the gpu. Te hni al
           Report January 2011, UC Davis, 2010.


                                                         205
                                                          11

</pre>