=Paper=
{{Paper
|id=Vol-2267/579-584-paper-111
|storemode=property
|title=Analysis of the features of the optimal logical structure of distributed databases
|pdfUrl=https://ceur-ws.org/Vol-2267/579-584-paper-111.pdf
|volume=Vol-2267
|authors=Elena V. Nurmatova,Victor V. Gusev,Victor V. Kotliar
}}
==Analysis of the features of the optimal logical structure of distributed databases==
Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 ANALYSIS OF THE FEATURES OF THE OPTIMAL LOGICAL STRUCTURE OF DISTRIBUTED DATABASES E.V. Nurmatova 1,a, V.V. Gusev 2, V.V. Kotliar 2 1 «MIREA — Russian Technological University», Moscow 2 NRC "Kurchatov Institute" — IHEP E-mail: a nurmatova@mirea.ru The questions of constructing optimal logical structure of a distributed database (DDB) are considered. Solving these issues will make it possible to increase the speed of processing requests in DDB in comparison with a traditional database. Optimal logical structure of DDB will ensure the efficiency of the information system on computational resources. The problem of constructing an optimal logical structure of DDB is reduced to the problem of quadratic integer programming. As a result of its solution, the local network of the DDB is decomposed into a number of clusters that have minimal information connectivity with each other. In particular, such tasks arise for the organization of systems for processing huge amounts of information from the Large Hadron Collider. In these systems various DDBs are used to store information about:1) the system of triggers of data collection from physical experimental installations (ATLAS, CMS, LHCb, Alice), 2) the geometry and the operating conditions of the detector while collecting experimental data. Keywords: optimal logical structure, query implementation scheme, distributed databases, grid architecture. © 2018 Elena V. Nurmatova, Victor V. Gusev, Victor V. Kotliar 579 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 1. Introduction Optimal logical structure of DDB will ensure the efficiency of the information system on computational resources.Such tasks arise in various areas of information and communication technologies, where distributed systems, systems with high loads, or systems of increased reliability are already used. In particular, such tasks arise for the organization of systems for processing huge amounts of information from the Large Hadron Collider (LHC). In these systems various DDBs are used to store information about: the system of triggers for collecting data from physical experimental installations (ATLAS, CMS, LHCb, Alice), the geometry of the detector while collecting experimental data, description of the metadata of the events on the detectors (for example, maintenance work on the installation), the operating conditions of the detector while collecting experimental data [1]. The last DDB, the Conditions database, is the most complex in its organization. For access to it a special program interface CORAL [2] was written. The use of DDB directly in the process of data processing of the LHC in the distributed computing environment Grid should also be noted. To optimize the transfer of the results of calculations of individual tasks over computer networks, a multi- level hierarchical system is used. This system keeps track of where data is stored and where it is necessary to transfer them [3]. 2. Synthesis of the optimal logical structure of a distributed database DDB construction technologies can be used for storing data in monitoring systems of complex computing infrastructures, which are a combination of engineering, network, software environments[4]. The constant search for methods to reduce the costs of data storage and their processing served as an objective need to deepen research on these issues and determined the relevance of the topic of this work. Synthesis of the optimal logical structure of a distributed database is the process of finding the optimal mapping of the canonical structure of the DDB to a logical one, which provides the optimal value of a given criterion for the performance of information systems and satisfies the basic system, network, and structural constraints [5]. When mapping a canonical structure into a logical one the groups of data of the canonical structure of the DDB are combined into types of logical records with their simultaneous distribution among the nodes of the computing system. The logical structure of DDB will be understood as an ordered set of logical records and connections (relations) between them, distributed over the nodes of the computing system. These logical records and connections reflect the semantic and functional properties and features of a given subject area of the information system. Let the logical structure of the database be given by the graph G ( N , L ) . To implement the q-th query on the graph G ( N , L ) , it is necessary to select the search tree Gq ( N q , Lq ) , where N q N is a subset of vertices, and Lq L is the subset of connections of the graph G chosen by 0 the search tree G q . It should be noted that one of the vertices N q is the entry point to the logical structure of the database. Figure 1 shows an example of a search tree G q and the direction of its bypass. Reducing the response time to a query can be achieved by eliminating the vertices to which it is necessary to return for passing to the next branch. The graph Gu thus obtained will contain only those vertices (records) whose elements form the output structures of the query q. Thus, the task of 580 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 minimizing the response time to a query of the q-th type is reduced to the problem of choosing the optimal connection between the records of the graph G u . Figure 1. The scheme of the graph Gq ( N q , Lq ) and of direction of its bypass u Figure 2. The scheme of a complete graph G u and connections between all vertices of a subset N q The scheme of a complete graph (see Figure 2) can be used to pose the problem of choosing the optimal structure of connections between records used to form the output structures of the q-th query with restrictions on the possibility of using a certain access path. The created DDBs can have a large dimension and therefore they are loaded and introduced in parts. To this end, the LS of DDB should be divided into a number of substructures or clusters that have minimal mutual connectivity under the following restrictions: on the dimension of clusters, on types of storage devices used, on the degree of semantic proximity of logical records included in clusters, etc. The source for the formulation and solution of this problem is the information obtained as a result of the synthesis of the LS of DDB (see Table 1). Table 1. The notation for source variables. N {n j / j 1, J } the multiplicity of the record types of the LS of DB the matrix of connections of records structure W w jj' the matrix of access paths for implementing queries of users F f jjK' R {r1,rj ,, rJ } the vector of a set of logical records {1 , j ,, J } the vector of characteristics of records length in bytes To formalize the task, we introduce the variables X je 1 , if the j-th record is included in the e-th cluster; X je 0 - otherwise. 581 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 The task of decomposition of the logical structure of the RDB into the multiplicity of clusters that provide minimal connectivity between them is formulated as follows: 0 J min x x , { x j , } , ' 1 j , j ' 1 jj' j j' ' with restrictions on the one-time inclusion of a logical record in the cluster, 0 x 1, j 1, J , 1 j and with restrictions on the total number of logical records in a cluster j0 x j 1 j M 1, 1, 0 , where M is an allowable number of records in the cluster. This task is a problem of quadratic integer programming. As a result of solving the posed problem, the LS of DDB is decomposed into a number of clusters, which have minimal information connectivity with each other. The development of a database of a separate cluster can be further carried out taking into account the importance of the logical records included in it in terms of user requirements, development complexity, and other factors. The solution of such a task is of great practical importance for computer-aided design of the LS of DDB and for generating specifications for queries and corrections of DDB. This is especially important when organizing a DDB in the “client-server” architecture, in which the query language SQL is used as an interface. The query specifications for it include two main parts: query objects and search conditions. In the request objects the information elements and the logical records required by the user are listed. It should be noted that the manual construction of queries takes a long time, because the user requires a detailed knowledge of the composition and structure of logical records, the interconnections between them, the characteristics of the organization, etc. 3. The logic circuit implementation of the query Queries in the DDB are characterized by the composition of the requested data, the frequency of their use, and the performance characteristics of the average values of the number of analyzed and selected when searching for instances records. The logic scheme of the implementation of a query of DDB user includes the following sequence of operations: Designing and initiating a request by the user in the node of the computing system to which it is attached, in the query language of the selected control system of DDB. Transferring the information to the database server via communication channels for the implementation of the query. Processing the query using the appropriate methods and by means of the database and solving the following main tasks: selecting from the database the data required in the query; decomposition of a query for subqueries (tasks), the number of which is determined by the number of required database servers; selecting optimal access ways to the required database servers; establishing logical connections with the nodes of the computing system on which the required database servers are located. Transferring subqueries to the required database servers via communication channels. Servicing subqueries with database servers. Transferring the blocks of selected data from the database servers to the server of the node that initiated the request via communication channels. Assembling blocks of the data into an array required in the query. 582 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 Evaluation criteria, as a tool of designing a DDB, are necessary for choosing a rational database structure among several alternative possibilities. It can be said that most of the problems and failures in developing a database model arise from a vague idea of what is meant by optimal database design. At present, as well as in the nearest future, uncertainty in the choice of these criteria will remain the weakest link in the development of a database model. 4. Time, cost and volume characteristics of DB functioning Difficulties in determining the criteria for choosing alternative solutions are mainly caused by two factors [5]: 1. The first problem is that almost infinite number of different database structures that satisfy the same set of system requirements can be built. Selection criteria should allow differentiation of all currently available alternatives. 2. The second problem is that alternatives are difficult to evaluate, since the criteria have the property of sensitivity and the duration of the various criteria is different. When solving the problem of the synthesis of the LS of DDB, the following basic time, cost and volume characteristics of the functioning of the DDB are used. The main time characteristics are: the duration of the implementation of a given set of queries Tq and the duration of the implementation of a given set of transactions Т k. In sum, these two indicators give the total duration of the “workload” of the DDB: p0 s0 T T pq Tsk , p 1 s 1 q k where T p - the duration of the implementation of the p-th user query; Ts - the duration of the implementation of the s-th transaction. The components of the times T pq and Tsk depend on many factors: the network and telecommunications equipment used, the technical parameters of the servers, the network and telecommunications system-wide software, the characteristics of the relational database management system (RDBMS), the bandwidth of the communication channels, the running times of programs at various levels of network protocols, etc. The main cost characteristics of a DDB are: the cost of storing information in the DDB E xp ; the query and transaction costs at a given time interval Eexq ; the cost of transmitting information via communication channels Eexk . The sum of these components determines the total cost of the functioning of the DDB: E Exp Eexq Eexk . The cost of storing information in the DDB is determined by the physical amount of information V xp and the cost of storing a unit of information (one logical record) on the server k xp . If we accept that the cost of storing information in all nodes of the computing system is a constant value, then Exp Vxp kxp . In practice, k xp approximately equals to 1.2 - 1.5. And finally, the volumetric characteristics of the functioning of the DDB, the sets of user queries and transactions are calculated from the following assumption: the amount of memory occupied by one instance of the record equals to the sum of the amount of useful information and the product of length of one address reference and the number of pointers in the record. 583 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 5. Conclusion The problem of constructing an optimal logical structure of DDB is reduced to the problem of quadratic integer programming. As a result of its solution, the local network of the DDB is decomposed into a number of clusters that have minimal information connectivity with each other. Further research will be related to the design of an algorithm for the synthesis of the optimal logical structure of a distributed database. The algorithm will consist of two interrelated stages.At the first stage, it is supposed to solve the problem of distribution of clusters of a distributed database between a server and clients, then the solution of the problem of optimal distribution of data groups of each node to logical record types follows. At the second stage of the algorithm, the problem of localizing the database by computer network nodes is solved, and in addition to the results of the first stage, the characteristics of the distributed database are taken into account. References [1] A. V. Vaniachine LHC Databases on the Grid: Achievements and Open Issues — The Proceedings of the IV International Conference on “Distributed computing and Gridtechnologies in science and education” (Grid2010), JINR, Dubna, Russia, 28 June - 3 July, 2010. [2] Radovan Chytracek, Dirk Düllmann, Giacomo Govi, Alexander Kalkhof, Zsolt Molnár, Andrea Valassi “Distributed Database Access in the LHC Computing Grid with CORAL” IEEE Nuclear Science Symposium conference record. Nuclear Science Symposium, 2008. [3] D. Ciangottini, J. Balcas, M. Mascheroni, E. A. Rupeika, E. Vaandering, H. Riahi, J. M. D. Silva, J. M. Hernandez, S. Belforte, T. T. Ivanov “A comparison of different database technologies for the CMS AsyncStageOut transfer database” J. Phys.: Conf. Ser., 898, 042048, 2017. [4] V. Kotliar, V. Anshukov, V. Ezhovac, V. Gusev, A. Kotliar, G. Latyshev, A. Shishov “Development of the active monitoring system for the computer center at IHEP”, CEUR-WS, ISSN 1613-0073, 2017. [5] N. A. Kuznetsov Methods of analysis and synthesis of modular information management systems. - M .: Fizmatlit, 800 p., 2012. 584