1 Introduction

Performance optimization algorithm of a distributed database with a hierarchical network topology

0 Arij Al Adel Post-graduate of Faculty of Radio Engineering and Cybernetics Moscow Institute of Physics and Technology(national research institute) Moscow , Russia 1 National Research University “Higher School of Economics” Moscow , Russia

301 307

This paper addresses analyzing the data flows that appear when distributed database works. An algorithm for optimizing database replication is proposed. As a protocol data replication two-phase commit protocol (2PC) with two levels of lock records (shared lock (Shared), and an exclusive lock (Exclusive)) is considered.

enterprise information system distributed database two-phase commit protocol (2PC) optimization algorithm

1 Introduction

ܶ ாூௌ = ܶ ௅஼ + ܶ ்௅

+ ܶ ௅ௌ , Modern enterprise information systems (EIS) applied for enterprises automation are usually designed using databases. Enterprises with a distributed structure naturally faced with the automation of the entire enterprise challenge. Usually, the automation of separate structural subdivisions through the simulation of business processes can improve the situation within the subdivisions but at the same time there is the problem of data synchronization of various systems. The solution to this problem is to develop a distributed EIS. The EIS includes a distributed database [ 1 ] consisting of a set of local databases, server equipment, system software, client computers and application software. The functioning EIS may limit or even determine the speed at which business processes are performed. Therefore, the task of optimizing the performance of EIS becomes very important. The EIS database at any given time contains all the information about automated business processes. The business process should be implemented in the most efficient manner. Thus the performance of EIS significantly affects the efficiency of enterprise business processes. There are many methods for evaluating the performance of EIS [ 2, 3, 4 ]. The main criterion for the effectiveness of EIS used in these methods is economic efficiency. Due to the fact that for many business processes related to customer service, production operation, etc., the main indicator of their effectiveness is ܶ ாூௌ - the time spent by EIS on the operation.

Where,

ܶ ௅஼ is the time taken by the application executed on the client computer to complete the business process operation (Client level), ܶ ்௅ - the time taken to transfer data between the client computer and the database server (Transportation level), ܶ ௅ௌ - the time taken by the server to complete requests / transactions (Server level).

For geographically distributed enterprises business processes are also distributed. Therefore, to increase the efficiency of geographically distributed EISs it is necessary to optimize the ܶ ௅ௌ time by constructing a replication scheme. Now there are several methods which aim to optimize database performance. These methods can be applied at all stages of system design: at the knowledge domain analysis, at the design of the logical structure of the database, at the synthesis of the physical database model. Knowledge domain analysis affects the logical structure of the database, the database logical Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). structure has a direct impact on the algorithms of software applications and the functioning of the system software [5]. Therefore, the optimization process is carried out consistently. At each subsequent stage the results of the previous one are used. Despite the undoubted theoretical and practical interest, a common disadvantage of all methods is that to solve the optimization problem, information is needed about the work of all components of the EIS architecture the collection of which is difficult.

The proposed algorithm has several advantages: it does not have restrictions on the number and structure of databases, it requires a minimum amount of input data, it is easy to implement. It can serve as a basis for designing the network topology optimization algorithms it makes it possible to evaluate the design solution in the early stages of development.

2 Problem Statement

Consider the hierarchical structure of the enterprise network. In general database servers located at network nodes may be interconnected by a multiple channels forming a graph. A tree network is very important case and it is often used in practice. This type of network topology is used for vertically integrated companies. For example, the network of the gas stations have a 3-tier administration structure. Suppose one server as root server. It is connected to data transfer channels with subordinate servers. Child servers are not interconnected by information transfer channels. Child servers, in turn, may have other child servers. The root server always contains a database, which we will call the Central Database (CDB). The remaining databases will be called Remote Database (RDB). Figure 1 shows hierarchical topology of a distributed database. when r1 <> r2 and r1 is the immediate parent of the server r2 , or when r1 = r2 when r1 <> r2 and r1 is not the immediate parent of the server r2 Let W

Depending on the replication protocol used various elementary operations on database records can be distinguished.

^ ws / s 1, S` — a set of elementary operations on database records.

Consider the operation of a distributed database over the period T . Let every client be connected to any of the servers. Each client performs different tasks. The application sends requests to the server directly connected to the client’s computer.

Let / [ irs — number of elementary operations ws over a period T to be performed on a table di on the node qr .

We introduce a vector of variables that describes the deployment of tables on every server.

ܺ ൌ ሾ ݔ ௥௜ ሿ , where xir 1 if table di belongs to the node qr , xir 0 otherwise. Since all tables are stored on the root server, so we have xi0

CDB is to store and process all information about the enterprise activities. Its structure is the same for the remote databases. Remote databases contain a subset of information from the CDB. Any of the remote database can be retrieved from the information contained in the CDB. All servers run a special program - the database server. Each of these programs can access the server's RAM, server disk memory, receive and transmit network requests. The collaboration of these programs allows to perform elementary operations on the database [6, 7].

We provide a formal description of the distributed database model. Let the Central database D consists of some set of tables:

D ^ di / i 1, `I All records of one table have the same number of attributes. Let's call the number of attributes in the table - “Table length”. K ^ ki / i 1, `I — set of the Table length — number of records in table i — set of servers (network nodes) N ^ ni / i Q ^ qr / r

1, `I 0, R` Let q0 the root server. ܪ ൌ ൣ݄ { 0,

௥ భ ௥ మ ൧ 1,

— the correspondent matrix to the topology of the computer network where hr1r2 takes the following values: performing operations on the table di over a period T . operations on the table di over the period T .

We introduce the following notation:

B bir — the total amount of transmitted data over the channel, that bind the node qr with the parent node when

— total amount of data received by channel, that bind node qr with the parent node when performing The task of optimizing the performance of distributed DB is formulated as follows: it is required to find such an arrangement of tablesܺ ൌ ሾ ݔ ௥௜ ሿ at which the minimum of the transmitted data is reached min ¦ bir

cir i,r (1)

3 Designing a performance optimization algorithm for a distributed database

To get the values B bir and C cir

we analyze the data flows resulted in the execution of one elementary operation on the database. Each elementary operation leads to the data transfer over each channel of the network. Let ܵ݁݊݀ሺ ௜ ǡ ݎǡ ݍ ௥ ೕ ǡ ݓ ௦ ǡ ܺሻ - the amount of transmitted data, and ݒܴܿ݀ሺ ௜ ǡ ݎǡ ݍ ௥ ೕ ǡ ݓ ௦ ǡ ܺሻ the amount of received data, respectively, on the channel binding the server qr , located on the child node, with the server on the parent node, when performing an elementary operation ws on table di , which was initiated from the server q r j with a given replication table X , then bir ¦ [ irj s Send (di , r, qrj , ws , X ) , cir rj ,s ¦ [ irjs Rcv(di , r, qrj , ws , X ) rj ,s (2) Consider as a protocol data replication two-phase commit protocol (2PC) with shared and exclusive locks. [8].

Let a transaction be initiated on one of the nodes, which changes the state of the data located in the database on the network nodes Participants in a transaction are interacting processes that run on servers which in turn contain database table records. One of the processes is selected and referred to as "Coordinator". Coordinator sends a request to the involved processes. All processes respond "committed" or "failed", depending on whether the requested operation can be performed. When responses from all processes came the Coordinator takes one of the decisions “commit a transaction” or “rollback a transaction” and sends the decision made to all participants in the transaction. Later when analyzing the data flow it is assumed that all transactions are successfully completed [9]. The distributed transaction is shown in Figure – 2.

Operation 4 Deleting an entry (with the already superimposed exclusive lock) Operation 9 Operation 5 Replacing the

record (with the already superimposed exclusive lock)

Operation 10

five database servers. Moreover the servers q0 , ݍ ଵ ݍ ଶ , q4 — contain a table record, q3 — does not contain a table record ( xi0 1 , ݔ ଵ௜ ൌ ͳ xi2 1 , xi3 0 , xi4 transmitted data at each stage over the transaction are shown in Table 2.

1 ). The transaction process is shown in Figure 3. The corresponding amounts of 7 0 0 0 2 5

3 q3 1. Request to the parent server to get a record

2. Inquiry about availability for child servers 3. Get answers about the readiness of the child

servers

4. Commit command to the server transaction

initiator and, if necessary data transmission

5. Commit command to other child nodes The server qr contains the table di Data transfer direction Column number Elementary operation Operation 1 Operation 2 Operation 3 Operation 4 Operation 5

Based on the data in Table 2, it is possible to get the total amount of transmitted data over each channel. All elementary database operations are treated in the same way. The results are shown in Table 3. 0 L L 0 L L

L L L L L

L L + ki

L L L 0 L L 0 2 L

Looking at table 3 it is clear that the amount of transmitted data have the form ߙ ڄ ܮ ൅ ߚ ڄ ݇

௜ where D and E non-negative integer coefficients. A convenient representation of data amount is matrixes of coefficients $ s 1, ,8 — column number in the Table 2. D st and %

E st where ܾ ௥௜ ൌ σ ௥׊ ೕ ǡ௦ ߦ ௜௥ ೕ ௦ ڄ ሺ ߙ ௧௦ାଵ ڄ ܮ ൅ ߚ ௧௦ାଵ where t = notlog ( qr j is subordinate to the node qr ) where column number t determined by the value of the function arguments and the name of the function. Let function notlog (condition) takes the value 0 if the condition takes the value «true» and value 1 if the condition is “false”.

Finally the total amount of transmitted data and received over the channel binding the node qr with parent node: ڄ ܮ ൅ ߚ (3) ڄ ݇ ௜ ሻ , ܿ ௥௜ ൌ σ ׊௥ ೕ ǡ௦ ߦ ௜௥ ೕ ௦ ڄ ሺ ߙ ௧௦ାଶ

ڄ ݇ ௜ ሻ ௧௦ାଶ essentially bir and cir depend only on the value of xir and do not depend on the location of the tables on other servers. K L ܨ ൌ ൣ݂ / $ ௥ భ ௥ మ ൧ [ irs D st и %

E st

The following optimization algorithm is proposed. Input data for algorithm

^ ki / i 1, `I — table length — service packet length — computer network topology — number of elementary operations — coefficient matrixes

Result:

ܺ ൌ ԡ ݔ ௥௜ ԡ — table replication

The algorithm is as follows: Step 1. The initial values of the replication scheme are set.

X xir , xi0 1 and xir 0 r z 0 Dlookup

{1, , I} Dlookup

empty the algorithm terminates go to step 8. Rlookup

Step 3. Servers set that need to be considered is being initialized .

{0}

Step 4. В current server is selected r1 Rlookup , Rlookup

Rlookup \ {r1}

If the set Rlookup

empty go to step 2.

Step 2. A database table that have not yet been reviewed is selected. i1 Dlookup , Dlookup Dlookup \ {i1} . If the set

0 then go to step 5. If xi1r2 If xi1r2 to step 5.

Step 8. Completion of the algorithm work.

Step 5. One of the information channels that bind the current server is selected r1 containing database table i1 with directly subordinate server r2 . If all channels are reviewed move to step 4.

Step 6. Data set is processed bi1r2 + ci1r2 , transmitted over the channel r1r2 in cases where the table i1 will not replicate to the server xi1r2 0 and when the table will be replicated to the child server xi1r2 1

Step 7. Based on the calculated quantities a decision is made to replicate the table. i1 to server r2 . The table is replicated to the child server r2 if this can lead to a decrease in the amount of data transmitted over the channel r1r2 . 1 first we add the server r2 к списку серверов, которые необходимо рассмотреть Rlookup Rlookup

{r2} go

Maintain optimal replication scheme values X

xir . with this, minimum total amount of transmitted data is reached ݊݉݅

Scheme1 Scheme2 0,29 0,02 0,41 0,03 0,56 0,04 0,71 0,05 0,88 0,06 1,06 0,07 1,26 0,08

The results of numerical experiments showed the efficiency of the replication scheme between database servers when implementing business processes in the distributed EIS.

5 Conclusion

The direction of further research may be the consideration of unreliable information transfer channels, equipment failures, determination of the time of user requests.The proposed approach can be used in the design of distributed registries using blockchain technology. 5. Antonios Makris, Konstantinos Tserpes, and Dimosthenis Anagnostopoulos. A novel object placement protocol for minimizing the average response time of get operations in distributed key-value stores. In Big Data (Big Data) 2017,

IEEE International Conference on, pages 3196–3205. IEEE, 2017.

6. Bermbach, D., Kuhlenkamp, J.: Consistency in distributed storage systems. Networked Systems, pp. 175-189.

Springer, 2013

7. Domaschka J., Hauser C.B., Erb, B.: Reliability and availability properties of distributed database systems. In: Enterprise Distributed Object Computing Conference (EDOC 2014), IEEE 18th International. pp. 226-233. IEEE, 2014 8. https://docs.oracle.com/cd/B19306_01/server.102/b14231/ds_txns.htmhttps://docs.oracle.com/cd/B19306_01/server.

102/b14231/ds_txns.htm 9. B. M. M. Alom, F. Henskens, and M. Hannaford, "Deadlock Detection Views of Distributed Database," in International conference on Information Technology & New Generartion (ITNG- 2009) Las Vegas, USA: IEEE

Computer Society, 2009

M. T.

Özsu and

Valduriez , Principles of Distributed Databases (3rd edition) ( 2011 ), Springer, ISBN 978 -1- 4419 - 8833-1

2. Nicolaou

A. I.,

Firm performance effects in relation to the implementation and use of enterprise resource planning systems . J Information System , 18 , 79 - 105 , 2014

3. Hawary

, Heeks

. Explaining ERP failure in a developing country: a Jordanian case study . J Enterprise Information Manage , 23 ( 2 ), 135 - 160 , 2010

4. Guedria

, Naudet

, Chen

maturity model for enterprise interoperability . Enterprise Information System , 9 ( 1 ), 1 - 28 , 2015