Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


 ALGORITHM FOR SOLVING PROBLEM SYNTHESIS THE
 OPTIMAL LOGICAL STRUCTURE DISTRIBUTED DATA IN
         ARCHITECTURE OF GRID SERVICE
                               Nurmatova E.V.1, Gusev V.V.2
        1
            MIREA — Russian Technological University, 20, Stromynka, Moscow, 107996, Russia
        2
         National Research Center “Kurchatov Institute” - Institute for High Energy Physics, 1,
        Science sq, Protvino, 142281, Russia

                                     E-mail: a nurmatova@mirea.ru


The questions of constructing optimal logical structure of a distributed database (DDB) are considered.
Solving these issues will make it possible to increase the speed of processing requests in DDB in
comparison with a traditional database. In particular, such tasks arise for the organization of systems
for processing huge amounts of information from the Large Hadron Collider. In these systems various
DDB are used to store information about: the system of triggers of data collection from physical
experimental installations, the geometry and the operating conditions of the detector while collecting
experimental data. Two interrelated stages in the synthesis algorithm are proposed. At the first stage,
the problem of distribution of database clusters between the server and clients, followed by the
problem of optimal distribution of data groups of each node by types of logical records are addressed.
At the second stage the problem of database localization on the nodes of the computer network is
solved, in addition to the results of the first stage, the characteristics of the DDB are taken into
account. Optimal logical structure of DDB will ensure the efficiency of the information system on
computational resources. As a result of its solution, the local network of the DDB is decomposed into
a number of clusters that have minimal information connectivity with each other. Solving the problem
of synthesis of the optimal logical structure is also of great practical importance for the automated
design of logical structures, for the automated formation of query specifications and adjustments of the
DDB.

Keywords: data warehouse, optimal logical data structure, applications, large data volume,
synthesis algorithm.

                                                                         Elena Nurmatova, Victor Gusev


                                                     Copyright © 2021 for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                   461
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


1. Previous work
         For a grid architecture with a large number of requests, users, and large amounts of data, it is
advisable to formulate the problem of synthesizing the optimal logical structure of a DDB based on
the criterion of the minimum total time for implementing a set of user requests. Indeed, along with the
necessary information that is really required by the user, redundant information, which arises as a
result of the localization of information elements that are not required by the user in one record, is
transmitted from the database server. Excessive information "clogs up" the communication channels,
which in the future will require an increase in network bandwidth due to the power of hardware and
software.
         The variety of options for alternative solutions for the choice of not fully defined evaluation
criteria, is a rather weak side of the problem of determining the optimality of the developed logical
data structure in a distributed architecture.
        When working with quantitative criteria, which include request response time, update cost,
memory cost, time to create, reorganization cost, the contradiction of the criteria to each other can
cause the difficulty [1].
        There are optimality criteria, which are immeasurable properties, poorly represented in
quantitative terms or in the form of an objective function. The qualitative criteria for evaluating a DB
include flexibility, adaptability, availability for new users, compatibility with other systems, the ability
to convert for use on another computing platform, the possibility of recovery, the possibility of
fragmentation and expansion of the structure.
        It is expedient to consider the synthesis of the logical structure of the DDB as a sequential
solution of three particular problems, the results of which are the determination of:
    1)    optimal localization of data groups, providing a minimum of total traffic in the system and
         satisfying the specified restrictions;
    2) the structure of the optimal distribution of groups by types of logical records, providing a
         minimum of the total time of local processing of information on the servers of network nodes
         under the given restrictions;
    3) the structure of the database localization by the system nodes, providing the minimum value
         of the total time of access to the localization and DB processing nodes.
        The solution of the problem of data processing in real time requires a special organization of
the logical and physical structure of the DDB to ensure the response time of the system on the order of
1-2 seconds or less (the minimum implementation time for operational requests).


2. Stages of an approximate algorithm for synthesizing the optimal logical
structure of the DDB
         Consider an approximate algorithm for solving the problem of synthesizing the optimal logical
structure of the DDB and the structure of localization of the database according to the criterion of the
minimum total time for the implementation of a set of user requests , consisting of a sequence of
stages (figure 1).
         Stage 1. At this stage, the localization of data groups in the computing system is determined
by the criterion of the minimum total traffic. To solve this problem, an approximate algorithm for
distributing DDB clusters between the server and clients of the local network is used. At the first step
of the stage, the graph of the canonical structure of the DDB is reduced to a disconnected graph with
the calculation of the "weight" of each data group. The weight of each group consists of the weight of
the data group itself and the weight of the arcs, taking into account the requirements of users:


                                                    462
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


                                 Figure 1. The main stages synthesis algorithm
                                                               гр
                                                     𝑉𝑖 = 𝑉𝑖        + 𝑉𝑖𝑖св′
                 гр
        where, 𝑉𝑖  the total weight of the data group; 𝑉𝑖𝑖св′  the weight of the arcs of the graph of
the canonical structure of the DDB.
                                                       𝑘0   𝑝0
                                               гр
                                          𝑉𝑖        = ∑ ∑ з𝑘𝑝 з𝑘𝑝 𝑝𝑖
                                                      𝑘=1 𝑝=1
                                          𝑘0 𝑝0             𝐼

                                 𝑉𝑖𝑖 ′ = ∑ ∑ 𝑘𝑝 𝑘𝑝 𝑝𝑖 ∑ 𝑝𝑖 ′ 𝑎𝑖𝑖Г ′
                                   св            з з

                                         𝑘=1 𝑝=1         𝑖 ′ 𝑖

        Then the weight of the i-th group is
                                     𝑘0    𝑝0              𝐼

                              𝑉𝑖 = ∑ ∑ 𝑘𝑝 𝑘𝑝 𝑝𝑖 (1 + ∑ 𝑝𝑖 ′ 𝑎𝑖𝑖Г ′ )
                                           з з

                                   𝑘=1 𝑝=1              𝑖 ′ 𝑖


        where, з𝑘𝑝  the frequency of usage of queries by users; з𝑘𝑝  the elements of the matrix for
using queries by DDB users; 𝑝𝑖  the matrix for using data groups when executing queries; 𝑎𝑖𝑖
                                                                                            Г
                                                                                               ′

the semantic contiguity matrix of data groups.
       At the second step of the stage, the computer network graph is transformed to a disconnected
graph with the calculation of the "weight" of each node:
                                                                     𝑅0

                                                    𝑉𝑟 = 𝑡𝑟 + ∑ 𝑡𝑟𝑟 ′
                                                                    𝑟 ′ 𝑟


                                                            463
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


        where 𝑡𝑟  the total average duration of data processing in the r-th node, consisting of the
time of decomposition of the query into subqueries, route selection and connection establishment, etc.;
        𝑡𝑟𝑟 ′  the average duration of data transmission between nodes, determined based on the
matrix of logical distances between the servers of the nodes of the computer network.
        At the third step of the first stage, the matrix 𝑉 = ‖𝑣𝑖𝑟 ‖ is formed, whose elements are equal:
𝑣𝑖𝑟 = 𝑉𝑖 × 𝑉𝑟 for 𝑖 = ̅̅̅̅        ̅̅̅̅̅̅
                       1, 𝐼 ; 𝑟 = 1, 𝑅0 .

        At the fourth step of the first stage, the problem
                                                       𝐼    𝑅0

                                            min ∑ ∑ 𝑣𝑖𝑟 𝑥𝑖𝑟
                                            {𝑥𝑖𝑟 }
                                                      𝑖=1 𝑟=1
is solved under the constraints:
         by the number of data groups, the localization of which is possible on one node
                                            𝐼

                                          ∑ 𝑥𝑖𝑟 𝑁𝑟 , 𝑟 = 1,
                                                          ̅̅̅̅̅
                                                             𝑟0
                                          𝑖=1
                                                                             r0

             on the admissible duplication of groups by network nodes      x  M ,
                                                                            r 1
                                                                                   ir    i

                                            𝑟0

                                           ∑ 𝑥𝑖𝑟 𝑀𝑖 , 𝑖 = ̅̅̅̅
                                                           1, 𝐼
                                           𝑟=1


             on the amount of available external memory of the network servers for storing data
                                                 𝐼

                                            ∑ 𝑥𝑖𝑟 𝑖 𝑖 взу
                                                          𝑟
                                                𝑖=1
               where, 𝑖  the vector of group lengths in bytes; 𝑖  the vector of number of
               instances in groups; взу
                                      𝑟  the amount of available memory on the server of the 𝑟-th
               host; 𝑥𝑖𝑟 = 1, if the 𝑖-th data group is included in the r-th network node; 𝑥𝑖𝑟 = 0 –
               otherwise.
        This is a linear integer programming problem. Its solution makes it possible to determine the
optimal localization of data groups by network nodes.
        Stage 2. At this stage, the problems of optimal distribution of data groups of each node
according to the types of logical records are solved by the criterion of the minimum total time of local
data processing in each network node. The number of synthesis tasks for this stage is determined by
the number of network nodes.
        The initial data are subgraphs of the graph of the canonical structure of the DDB, as well as
the temporal and volume characteristics of the subgraphs of the canonical structure of the DDB, the set
of requests from users and network nodes [2].
        The synthesis problem for this stage is solved using exact or approximate algorithms [3]. The
following restrictions are used: restrictions on the number of groups in a logical record, on the one-
time inclusion of groups in records, on the cost of storing information, on the required level of
information security of the system, on the time of performing operational transactions on DB servers,
on the total time of servicing operational requests on servers. As a result, we determine the logical
structures of the DB for each node in the network.


                                                           464
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


                         Figure 2. Summary diagram stages solving synthesis problem

         Stage 3. Localization of the DB by network nodes. Initial data of the stage: results of the
previous stages and characteristics of the DDB. Restrictions: on the total number of synthesized
logical records located on the server of the r-th node of the computer network; on the amount of
available external memory of the network servers for storing the database; on the number of copies of
logical records placed on the network.
        As a result of the proposed algorithm (figure 2), localization matrices of the set of data groups
by the types of logical records are formed (result of stage 1) and then groups of records by network
nodes (result of stage 2, table 1) are formed. The timing of the algorithms is also evaluated.
        Table 1 is an example of a matrix for localizing records by network server nodes. It is indexed
row by row by t-numbers of logical record groups, and by columns by r-numbers of network nodes.
                             Table 1. Matrix of localization of logical records by network nodes-servers
                                                         Network node
                                                          numbers, t
                                                        1     2      3
                                                  1     1     0      0
                                    Logical       2     0     1      0
                                    record        3     0     0      1
                                    group         4     0     1      0
                                    numbers, r    5     0     1      0


3. Similar solutions
        As alternative solutions for comparing the results of synthesizing the data structure according
to various criteria, we analyzed the analogue that solves the NP-hard nonlinear integer discrete
optimization problem from the DDB domain [2] and implements 3 neural network algorithms for
synthesizing the optimal logical structure DDB according to the criterion of the minimum total time of


                                                      465
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


sequential processing of a set of user requests:
        1. NS-GA-algorithm (HNN) – the evolutionary optimization algorithm based on artificial
             Hopfield neural networks and genetic algorithms;
        2. TM-algorithm (ТМ) – the neural network optimization algorithm based on modified taboo
             search;
        3. RTM algorithm (DTM) – the distributed neural network optimization algorithm based on
             modified taboo search.
        Another analogue considers the formulation of the problem of synthesizing the optimal logical
structure of the DDB in fuzzy conditions according to the criterion of the minimum total loading time
of the DDB [1].


4. Conclusion and Future plans
         Further work within the framework of this topic is the development of software that
implements the search algorithm for a variant of the logical structure of the DDB, which ensures the
optimal value of the specified criterion for the efficiency of the functioning of the grid system and
satisfies the main system, network, and structural constraints.


References
[1] Kosterin E.V., Minin Yu.V., Ivanova O.G., Al-Matari N.A.Kh. Statement of the problem of
synthesizing the optimal logical structure of a network database in fuzzy conditions.– Information and
security, Voronezh, 2014. – 574 – 579 p.
[2] Nurmatova E.V., Gusev V.V., Kotliar V.V. Analysis of the features of the optimal logical structure
of distributed databases// Collection of works the 8th International Conference “Distributed
Computing and Grid-technologies in Science and Education”.— Dubna, 2018.— 167 p.
[3] Amaru L.G. New Data Structures and Algorithms for Logic Synthesis and Verification.- Springer,
2016. — 262 p. — ISBN: 9783319431734, EISBN: 9783319431741


                                                   466