Optimization of the Database Structure of a Distributed Corporate Information System Node Using the Analytic Hierarchy Process

Optimization of the Database Structure of a Distributed Corporate Information System Node Using the Analytic Hierarchy Process MykhailoDvoretskyi m.dvoretskiy@gmail.com Petro Mohyla Black Sea National University

68-Desantnykiv St 10 54003 Mykolaiv Ukraine

SvitlanaDvoretska Petro Mohyla Black Sea National University

68-Desantnykiv St 10 54003 Mykolaiv Ukraine

HlibHorban gleb.gorban@gmail.com Petro Mohyla Black Sea National University

68-Desantnykiv St 10 54003 Mykolaiv Ukraine

YuriyNezdoliy nezdoliy.yura@gmail.com Petro Mohyla Black Sea National University

68-Desantnykiv St 10 54003 Mykolaiv Ukraine

IT&I-2020 Information Technology and Interactions KNU Taras Shevchenko

December 02-03 2020 Kyiv Ukraine

Optimization of the Database Structure of a Distributed Corporate Information System Node Using the Analytic Hierarchy Process 46DFB97E50386826A5526DC11F4B5EDD GROBID - A machine learning software for extracting information from scholarly documents Corporate information system, database management system, distributed database, SQLquery, data replication, multidimensional analysis, multicriteria problem, analytic hierarchy process 0000-0001-5913-6859 (M. Dvoretskyi) 0000-0001-5199-9430 (S. Dvoretska) 0000-0002-6512-3576 (H. Horban) 0000-0002-6003-5585 (Y. Nezdoliy)

The relevance of the problem of optimizing the database structure of a node in corporate information systems (CIS) is due to the widespread use of information technologies of multilevel, geographically dispersed computer systems, including those with distributed databases. One of the research aims is to determine and build a mathematical model of the optimality criteria for the structure of a remote node of the distributed corporate information system database. The statistics of user SQL-queries activity is taken into account and presented in the form of a multidimensional database. Criteria of the model effectiveness are formulated, which are independence from the central node of the database, the size of the local database, and an indicator of the level of need for data synchronization. The problem of multicriteria optimization is solved by using of hierarchy analysis method. Among the using method's features can be mentioned: different sets of optimality criteria for the evolving individuals; quantifying of the data representation marker value into 5 alternatives and automatically presetting the matrices of pairwise comparisons on the last level of the hierarchy. Solving the problem of multicriteria analysis and choosing the best alternative makes possible to determine the optimal level of the data representation marker. It makes possible to classify the attributes and tuples of DB relations according to their representation on the node of distributed CIS.

Introduction

In information systems development, there is a trend of transition from local to distributed databases (DDB). There are many database management systems (DBMS) that allow you to host, maintain and process data on various nodes of computer information systems (CIS). The main task of distributed database management systems is to provide access control to the data of many users and ensure the integrity and consistency of data [1]. Within one company there is a need to automate different types of accounting [2,3]. The attempt to automate all types of accounting leads to so-called "universal" corporate information systems [3], which create a single accounting environment and provide access to all necessary data for analysis and decision support. This approach has many disadvantages [2,4], which can be eliminated by using separate specialized solutions [3,5]. But this path leads to use of several databases (and perhaps DBMS) that require their synchronization [6]. So, in addition to the main functions of the distributed DBMS: input, storage, processing and sharing data a specific important function is to ensure the collaboration of many users with distributed information [7,8].

Topicality

The database structure optimizing is considered in [9][10][11][12][13][14][15], but insufficient attention is paid to improving the automated systems performance by optimizing the structure of the CIS distributed database on the basis of statistics of SQL-queries. Also, in [9][10][11] while considering the design of automated control and data processing systems, building of data warehouses and multidimensional models, the use of a combined strategy of distributed data representation in CIS is not considered. In [12][13][14][15] the authors consider the issue of increasing the productivity of automated systems through the use of materialized views, database restructuring and relations denormalization. However, the optimality of the structure of a single distributed CIS node is ignored. A key factor influencing the reliability and accessibility of the database is the so-called localization of links [5]. If the database is distributed so that the data hosted in a node is called exclusively by its user, it indicates a high level of link localization. If such data distribution is not possible and to execute the user's requests you need to access the information of other nodes, it indicates a low level of links localization.

A combined data distribution strategy is the best in terms of combining the benefits of strategies with and without duplication. But when using it, in addition to the task of synchronizing duplicate information, the task of designing the structure of the database is actual, depending which node data belonging to. In addition, the performance of the system will directly depend on the decision on the need for partial or complete duplication of data. Some tables of a relational database can be duplicated completely, and someafter projection and selection. That is, for optimized data representation on a remote node, it is necessary to use vertical and horizontal data fragmentation procedures.

Therefore, the issue of data distribution between nodes of distributed and territorially dispersed CIS is quite important. Therefore, the task of optimizing the structure of the database of a geographically remote node in corporate information systems is relevant.

Purpose of publication

The purpose of the research is to create a mathematical optimization model and subsequent choosing the best alternative to the marker of data representation of the remote node of distributed CIS. The research is related to only to relational databases. The relational data model is based on a simple and at the same time powerful mathematical apparatus, based mainly on theory of sets and mathematical logic [10,16]. So, when building a mathematical model, it is considered appropriate to use the basic concepts of set theory.

The developed model should take into account the statistics of user requests to local and remote data. Using filtering by selected dimensions, the appropriate subsets of data can be obtained [17]. For dimension elements, the term "data representation marker" was proposed, which determines the level of their need at the node of the distributed corporate information system (DCIS). From the value of this marker, aggregated on the database subset, corresponding to the remote node, will depend on the values of the criteria of model efficiency. It is independence from the central node of the database, the size of the local database and the level of data synchronization [18,19]. Therefore, one of the tasks is the mathematical representation of the optimality criteria dependence on the value of the data representation marker.

The obtained multicriteria problem must be solved to determine the optimal level of data representation marker. It should be noted that the optimality criteria, the models of which were defined, are independent, monotonic and are represented on the set of real numbers in the interval [0; 1]. The classical methods of Pareto and Slater [20,21] can give results only at the first stage of modeling. But when calculating the optimal level of the data representation marker they are ineffective due to the decrease in the level of one criteria while increasing others. The solution of the problem is also complicated by the fact that the solution space is defined on a set of real numbers, and therefore the set of solutions contains a large number of alternatives.

The main part

Among the well-known relational algebra operations [10], due to the horizontal and vertical fragmentation of data on the distributed CIS node, the operations "projection" (hereinafter P) and "selecting" (hereinafter S) are considered here. Let tupbe a tuple of the relation R, tup[P] be a part of this tuple containing only the values of the attributes that are included in the subset P of the relation scheme Rshema (P ⊂ Rshema). Then the projection of R on P will be the relation, consisting of tuples of all values from the set P, which exists in the relation R, i.e. R[P] = {tup[P] | tup ∈ Rdata}. The scheme of the resulting set can be defined by the following set of attributes: R[P]shema = {A1 …, Am}, where Ai ∈ Rshema. The selection displays tuples, and the result is a relation containing a subset of all unique tuples of the relation R, for which a certain logical condition is true R[S] = {tup | tup ∈ Rdata ˄ F(tup, S) = true}, where S is a logical condition of SQL-query, and F(tup, S) is a function that reflects its fulfillment for the corresponding tuple. The scheme of the resulting set will equal to the scheme of the basic relation, i.e. R[S]shema = R shema. Within the SQL-query for data selecting, a number of relations can be involved, all of which are the result of sequential execution of select and projection operations to the base relation (database table). R'' = R' [P], where R' = R[S], i.e. To avoid the need for further replication some data that required on the DDB node can only be presented on the central node of the database and participate the query through the use of distributed queries. So the resulting relation R remote will only be a subset of R''union. Due to the fact that to represent the data on the remote node it is necessary to use elements of both vertical and horizontal data fragmentation (both projection and selecting), a subset of the base relation R that will describe the relation of the remote node can be represented as follows:

𝑅 𝑠ℎ𝑒𝑚𝑎 𝑟𝑒𝑚𝑜𝑡𝑒 = {A | A ∈ Rshema , Rprimary ⊂ 𝑅 𝑠ℎ𝑒𝑚𝑎 𝑟𝑒𝑚𝑜𝑡𝑒 , A ∈ Rprimary ˅ Fa(Node, A) = true} (2)

To make a decision on the attribute representation on a node, the function Fa(Node, A) will be used. Besides, the set of attributes of the relation primary key in any case must be represented on the remote node. The set of tuples, in turn, will be determined by the formula:

𝑅 𝑑𝑎𝑡𝑎 𝑟𝑒𝑚𝑜𝑡𝑒 = { tup | tup ∈ Rdata , tupprimary ∈ R remote-dep data ˅ Ftup(Node, tup) = true} (3)

As we can see, the tuple must be represented in the case of entering its primary key to the set of these relations, depending on the current. Otherwise, the need for data is solved using the evaluation function Ftup(Node, tup).

The model of presenting user queries should support the possibility of their further classification according to belonging to a particular workplace, location, user role and other criteria that can be added to the model. That is, the user query is defined as

Q = <Workplace, User, Application, Rset'', Qset inner >,(4)

where workstation = <Type, Location>; User = <Role, Name>; Rset'' = { R'' | {tup[P] | tup[P] ∈ R[P]data ˄ F(tup, S) = true} }the set of resulting relations obtained from the basic relations (tables) of the database by the corresponding queries; 𝑄 𝑠𝑒𝑡 𝑖𝑛𝑛𝑒𝑟is a set of nested queries of the main query Q. When planning the structure of the database of the remote node of distributed CIS, several factors will be involved -availability and speed of data obtaining, independence from the central DB node, the DB size, the level of data reliability, the need for further synchronization.

In the first step, the simulation begins with the presentation in the remote node the complete copy of the central node DB. In this case, the data availability and independence from the central node of the database has a maximum level. The speed of data obtaining compared to the central node is usually lower due to less powerful computing resources, but can be increased by performing selecting and projection operations and decreasing the number of data locks. The local database is large, therefore this criteria is not optimal. Also, all data requires synchronization with the central node, which is quite a resource-intensive operation.

The second step is to exclude all unnecessary data from the remote node. To solve this problem, on the basis of a relational model of user SQL-queries (4) was created a multidimensional database [22] with following set of dimensions: <DateTime, WorkplaceType, Location, UserRole, Application, R, A, tup>. For the dimensions elements the term of data representation marker is proposed. It reflects the level of data representation necessity at the node of distributed CIS. For each element value of marker is taken from the following set: {"necessary", neutral", "not required"}. To dimension the "Location", the marking is performed automatically with the value "necessary" for the corresponding remote node and "not required" for all others.

When determining the value of the representation marker for the row of the fact table [22], the max function is used, which reflects the principle of absorption. Determining the value of the marker when performing the consolidation of rows of the fact table on the values of <R, A, tup> (for the table cell) can be performed by moving average method. But the question of the specific influence of each dimension remains unresolved. In addition, it should be taken into attention, that for some subsets of dimensions pessimistic scenario should work (data is needed, no matter what), and for someoptimistic (data should not be duplicated in any case).

So, we have a model where each dimension attribute has a value, a marker and a weight Adim ={Val, Mrk, vol}, where Mrk = {"obligatorily", "necessary", " neutral", "not required" , "forbidden"}, and volweight (ignored for the values of the marker "obligatorily" and "forbidden"). By converting a non-numeric linguistic variable of markers into a numeric value ("obligatorily" -"2", "necessary" -"1", "neutral" -"0", "not required" -"-1", "forbidden" -"-2"), the aggregation function was defined:

𝐴𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑒 𝑖=1 𝑛 𝑀𝑟𝑘 𝑖 = { 2, 𝑖𝑓 ∃ 𝑀𝑟𝑘 𝑖 = 2 −2, 𝑖𝑓 ∃ 𝑀𝑟𝑘 𝑖 = − 2 ˄ ∄ 𝑀𝑟𝑘 𝑖 = 2 ∑ (𝑀𝑟𝑘 𝑖 * 𝑛 𝑖=1 𝑉𝑜𝑙 𝑖 ∑ 𝑉𝑜𝑙 𝑖 𝑛 𝑖=1 )(5)

When deciding on the data representation on a remote node, we consolidate the rows of the fact table by the tuple <R, A, tup> and calculate the value of the marker for each of its elements by formula (5). And based on following the decision about data representation is made:

Repr (Node,R,A,tup)=(𝐴𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑒(𝑅, 𝐴, 𝑡𝑢𝑝) 𝑖=1 𝑛 𝑀𝑟𝑘 𝑖 > 𝑘𝑜𝑒𝑓 𝑟𝑒𝑝𝑟 𝑛𝑜𝑑𝑒 ),(6)

where 𝑘𝑜𝑒𝑓 𝑟𝑒𝑝𝑟 𝑛𝑜𝑑𝑒the threshold coefficient of data representation in a certain node Node, that is defined at the range of [-1, 1]. The third step is to completely abandon the local database and place all the data on the central node (or, in some cases, in other nodes) of distributed CIS. In this case, we have the maximization of optimality for criteria of the need for data synchronization. That is because there is no duplication of data. The level of reliability is also maximum, and the size of the local database has a minimum value (no local database). But, at the same time, the availability of data and the access speed are minimized, and the work of CIS is highly dependent on the central node availability.

The value of some criteria improved compared to the second step, but at the same time the value of the others got worse. It is logical to assume that the optimal values of all DCIS DB structure criteria acquire between the 2nd and 3rd steps. To be able to perform the analysis and find the optimal distribution of data between the remote and central nodes, it is necessary to formalize the database structure quality criteria.

Criterion of independence from the central database node, and, accordingly, the availability and access speed directly depend on the representation of user SQL-query data on the node of distributed CIS. Using the model of the user SQL-query (4) and the resulting relation of the remote node (1, 2), we can determine the function of the request data availability:

Favailab(Node, Q) = { 1, 𝑖𝑓 ∀ 𝑅'' ∃ 𝑅 𝑠ℎ𝑒𝑚𝑎 𝑟𝑒𝑚𝑜𝑡𝑒 , 𝑅'' ∈ 𝑅 𝑠ℎ𝑒𝑚𝑎 𝑟𝑒𝑚𝑜𝑡𝑒 ˄ ∀ 𝑄 𝑖𝑛𝑛𝑒𝑟 𝐹 𝑎𝑣𝑎𝑖𝑙𝑎𝑏 (𝑄 𝑖𝑛𝑛𝑒𝑟 ) = 1 0, 𝑖𝑓 ∃ 𝑅'' ∄ 𝑅 𝑠ℎ𝑒𝑚𝑎 𝑟𝑒𝑚𝑜𝑡𝑒 , 𝑅'' ∈ 𝑅 𝑠ℎ𝑒𝑚𝑎 𝑟𝑒𝑚𝑜𝑡𝑒 ˅ ∃ 𝑄 𝑖𝑛𝑛𝑒𝑟 𝐹 𝑎𝑣𝑎𝑖𝑙𝑎𝑏 (𝑄 𝑖𝑛𝑛𝑒𝑟 ) = 0 . (7)

The aggregate value of the data availability level and independence from the central DB is defined as the average value

The set of user SQL-queries Qnode is a subset of all user queries Qall (Qnode ⊂ Qall), where for each element the function of belonging to a remote node is equal to one. Next, we consider the criterion of the local database size. This criterion affects both the performance of queries to the local database and the power of computing resources required to perform database and CIS administration operations. The database under the relational DBMS control (including distributed) is presented on disk space as a file or group of files [7,8]. At the same time, any modern relational DBMS has mechanisms for obtaining information about how much disk space is used by each relation. In the vast majority of cases, the total value of the relations size equals the total value of the database files sizes.

But the information about size of R does not make it possible to determine the size of R'', which is the result of a sequence of selecting and projection operations, and is part of the set R remote . On the other hand, each DBMS provides information about the amount of disk space required to store the value of the attribute defined on a particular domain [7,8]. The size of the tuple can be determined as

SizeR = 𝑆𝑖𝑧𝑒𝑅0 𝑖 𝐷𝐵𝑀𝑆 + p х ∑ 𝑆𝑖𝑧𝑒(𝑇𝑦𝑝𝑒 𝑖 ) 𝑛 𝑖=1 , (9)

where Ai ∈Di ∈Typei , pis the relation power, and SizeR0 i DBMSis the size of the i-th relation if it is empty.

However, the values obtained by ( 9) cannot be used in calculations, because SizeR almost never equals to SizeR dbms . This may be due to the presence of additional data structures (indexes) related to the table, as well as other properties of data representation on the disk. Therefore, for each relation we determine the correction factor Next, when determining the size of R '' (subset of R) we use the following formula SizeR'' = KoefsizeR × p' × ∑ 𝑆𝑖𝑧𝑒(𝑇𝑦𝑝𝑒 𝑖 ) 𝑛′ 𝑖=1 (11) where p is the power of R'', n'is the number of elements of the set R remote shema (number of attributes), and each attribute Ai ∈ Di ∈Typei.

But for each individual case of the subject area, the size (11) will take different values, and therefore its absolute value has no sense. Therefore, it was decided to present the final value of the criterion of the local database size in proportion to the size of the database in the CIS central node. The last of the above criteria is the need for data synchronization. First, we define a subset of the remote node data for which the data change operations are performed. To do this, define the model of the SQL-query that modify data Q modif = <Виміри, R'' modif , type>, where R'' modifis a subset on the relation R, which changes due to data modification operations, type = {insert, update, delete}operation type. R'' modif is defined as

R'' modif = {tup[P modif ] | tup[P modif ] ∈ R[P modif ]data ˄ F(tup, S) = true} (13)

where Sis a logical condition, defined in SQL query, F(tup, S)is a function that reflects its fulfillment for the corresponding tuple, and P modifis a set of attributes that are modified.

Considering the set of queries to the database, the resulting subset 𝑅′′ 𝑛𝑜𝑑𝑒 𝑚𝑜𝑑𝑖𝑓 of the base relation R can be defined as the union of subsets R'' modif of all queries (13) received by the database from the remote node 𝑅′′ 𝑛𝑜𝑑𝑒 𝑚𝑜𝑑𝑖𝑓 = ⋃ 𝑅 𝑖 ′′ 𝑚𝑜𝑑𝑖𝑓 𝑛 𝑖=1

. Similarly, we define the set 𝑅′′ 𝑚𝑎𝑖𝑛 𝑚𝑜𝑑𝑖𝑓 , which will be modified on the central node or other nodes with future synchronization with the central node. The intersection of the sets 𝑅′′ 𝑛𝑜𝑑𝑒 𝑚𝑜𝑑𝑖𝑓 та 𝑅′′ 𝑚𝑎𝑖𝑛 𝑚𝑜𝑑𝑖𝑓 will determine the subset of the basic relation on which data conflitct can take place. This data require the use of more resource-intensive synchronization algorithms [6].

𝑅′′′ 𝑛𝑜𝑑𝑒 𝑚𝑜𝑑𝑖𝑓 = 𝑅′′ 𝑛𝑜𝑑𝑒 𝑚𝑜𝑑𝑖𝑓 ∩ 𝑅′′ 𝑚𝑎𝑖𝑛 𝑚𝑜𝑑𝑖𝑓 (14)

Based on (14), we add to the multidimensional DB (5) the dimension SyncroFlg = {true, false}, which will be determined on the tuple <R, A, tup>. Next, based on the aggregate value of the representation marker 𝐴𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑒 𝑖=1 𝑛 𝑀𝑟𝑘 𝑖 and the representation coefficient 𝑘𝑜𝑒𝑓 𝑟𝑒𝑝𝑟 𝑛𝑜𝑑𝑒 perform filtering of the multidimensional DB according to the decision on representation (6) and SyncroFlg = true. Aggregate the results by <R, A, tup> and count the number of queries. The ratio of the obtained value to the total number of queries according to (6) will be an indicator of the level of data synchronization need

Fsynchro = 𝑝 𝑛𝑜𝑑𝑒 𝑚𝑜𝑑𝑖𝑓 𝑝 𝑛𝑜𝑑𝑒 , (15)

where 𝑝 𝑛𝑜𝑑𝑒 𝑚𝑜𝑑𝑖𝑓relation power, including queries of the remote node (according to the decision on representation), which includes the values of the tuples attributes (cells), which are also included in the set R'' modif node, and 𝑝 𝑛𝑜𝑑𝑒the cardinality of all queries, attributes and tuples of which are represented in the remote node.

A multicriteria problem, that was obtained, must be solved to determine the optimal level of data representation marker. Classical Pareto and Slater methods [20,21] can give results only at the first stage. But when calculating the optimal level of data representation marker are ineffective due to the decrease in the level of some criteria of optimality while increasing others. The solution of the problem is also complicated by the fact that the solution space is determined on a set of real numbers, and therefore the set of solutions contains many alternatives. The analytic hierarchy process (AHP), which is a general methodology for solving a wide class of decision-making problems, allows to combine a relatively simple mathematical apparatus with knowledge and experience of the decision maker. The basis of this method is the representation of the decision process in the form of a multilevel hierarchy. This hierarchy should reflect all the components of the problem to be solved. The method is based on the principles of decomposition, pairwise comparisons and hierarchical composition. The main stages of the method are building a hierarchy, estimating the importance and priorities, checking the consistency of priorities and synthesis of the solution.

When compiling the hierarchy, following relationship between the levels elements was used: goal -stakeholders -criteria -alternatives. The value of the data representation marker (alternative) is a real number in the interval [-1, 1]. It leads to potential large number of alternatives at the 4th level of the hierarchy and therefore the matrices of pairwise comparisons by criteria can become very big. This complicates estimation process for the decision makers. It is proposed to simplify the task by reducing the number of alternatives to 5: "low" (L) -"-1", "lower them medium" (LM) -"-0.5", "medium" (M) -"0", "higher then medium" (HM) -"0.5", and "high" (H) -"1". The level of "decision makers" is represented by the elements "Owner", "Database Administrator", "Database Developer" and "CIS Operator". The obtained hierarchical model is shown in Fig. 1.

Figure 1: Hierarchical model of the distributed CIS node structure optimization problem

Note that the list of criteria differs for the decision makers. Thus, all three criteria are important for the owner (the database size, the need for synchronization and independence from the central database), because they have influence on both the quality of CIS and the cost of equipment. For the database administrator, the criteria of database size and the need to organize data synchronization are important. In turn, for the database developer and CIS operator, the criterion of database size is not critical. It is clear that the relative weight of each of the criteria for different decision makers will also differ. Using the scale of relative importance of the criteria [23] and with the involvement of the decision maker (which at this stage is the owner) we build a matrix of pairwise comparisons for decision makers (Table 1). At the third level of the hierarchy, the corresponding matrices of pairwise comparisons are formed according to the criteria of optimality for each decision maker. Thus, for the decision maker "owner" we have the following matrix of pairwise comparisons of optimality criteria (Table 2). To check the conflicts existence between matrix elements, the consistency index (CI) is calculated. For the data in Table 2 CI = 3.2%, which indicates the allowable level of consistency (in case the value is higher 10% there is a need to adjust the values of the matrix).

The next step in the classical analytic hierarchy process is to fill in the matrices of pairwise comparisons of alternatives separately for each criterion of optimality, similar to Table 1 and Table 2. In our case, the presence of mathematical models for calculating the values of the optimality criteria formulated in (8,12,15) allows to perform the initial calculation of matrix data based on numerical values of the data representation marker for each alternative. Next, the matrix is submitted to the decision maker for approval. For example, the size of the local node database depending on one of the five alternatives can change as follows (Table 3). Based on the above data, the size of the database at low (min) and high (max) level of the data representation marker differs by 0.75 / 0.02 = 37.5 times. According to principles of pairwise comparisons and the axiom of homogeneity, we perform normalization of the values given in Table 3, using a slightly modified formula of natural normalization:

where 𝑊 𝑖 is the value of the optimality criterion for the i-th alternative, and k = 9. Normalized according to (17) the database size values (Table 3) are presented in Table 4. After rounding to the integer according to mathematical rules, we build a matrix of pairwise comparisons of alternatives for the criterion of the local database size (Table 5).

According to (4) we perform the calculation of the matrix of alternatives relative weight by the criterion of the local database size. Also, we similarly calculate the priority vectors of alternatives according to the criteria of independence from the central node and the need for data synchronization. As a result, we obtain following vectors. According to ( 16) and ( 18) we calculate the global vector of priorities for the decision maker "owner" (19).

By performing the appropriate calculations, we obtain global priority vectors for other decision makers

Using the obtained results of the global priorities vectors for decision makers (19), (20) and the matrix of preferences of decision makers (Table 1), we calculate the vector of global priorities of alternatives (Table 6). The performed calculations allow to organize decision support when choosing the optimal level of the data representation marker among the proposed alternatives.

Summary and conclusion

Based on the relational data model, the concept of data slices of the set of database relations is formalized. Using the definition of selecting and projection operations, as well as taking into account the hierarchical structure of user queries, the model that describes their structure was built. This model includes analytical characteristics and allows to define for each base relation a subset (node relation), which will consist of elements that are part of the resulting sets of SQL-queries sequence.

The term of data representation marker for elements of analytical dimensions was proposed. Using the offered aggregation function the level of representation marker for each attribute and tuple of relation is calculated. To determine the optimal value of the representation marker, several optimality criteria are introduced and mathematical models are built for each of them. This allow to calculate their values depending on the limit level of the data representation marker at the node of distributed CIS. Solving a multi-criteria problem and finding the optimal level of data representation at a remote node can increase the level of data availability and efficiency of distributed CIS. Efficiency is defined as the ratio of result and resources, so taking into account the vector of relative weight of the optimality criteria of the model ( 16), we calculate the efficiency as The comparison of the obtained results for the database of the KIS node of the subject area is given in Table 7. Thus, the results of the research allow to increase the efficiency of using the distributed CIS node of the subject area by 25% compared to the presentation of only critical data, and by 11% compared to the presentation of all necessary data of the central database, respectively. The research can be followed by presenting the obtained vector of global priorities in the form of fuzzy sets of one variable. Dephasing the obtained results can make numerical value of the optimal level of data representation at the RKIS node more accurate.

1 )1R'' = {tup[P] | tup[P] ∈ R[P]data ˄ F(tup, S) = true} (Considering the set of queries to the database, the resulting subset R''union of the base relation R can be defined as the union of subsets R' of all queries received by the database from a remote node R''union = ⋃ 𝑅 𝑖 ′′ 𝑛 𝑖=1 , or R''union = {tup[Punion] | tup[Punion] ∈ R[Punion]data ˄ F(tup, Sunion) = true}, where tup[Punion] = ⋃ 𝑡𝑢𝑝[𝑃 𝑖 ] 𝑛 𝑖=1 , and Sunion = ⋁ 𝑆 𝑖 𝑛 𝑖=1

Qnode = {Q | Favailab(Node, Q) = 1}, where Favailab(Node, Q) = { 1, 𝑖𝑓 ( ∃ 𝑅 ′′ ∈ 𝑅 𝑠𝑒𝑡 ′′ → 𝐴𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑒(𝑅 ′′ ) 𝑖=1 𝑛 𝑀𝑟𝑘 𝑖 > −1) ˅ ( ∃ 𝑄 𝑖𝑛𝑛𝑒𝑟 ∈ 𝑄 𝑠𝑒𝑡 𝑖𝑛𝑛𝑒𝑟 → 𝐹 availab (Node, 𝑄 𝑖𝑛𝑛𝑒𝑟 ) = 1) 0, 𝑖𝑓 ( ∀ 𝑅 ′′ ∈ 𝑅 𝑠𝑒𝑡 ′′ → 𝐴𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑒(𝑅 ′′ ) 𝑖=1 𝑛 𝑀𝑟𝑘 𝑖 ≤ −1) ˄ ( ∀ 𝑄 𝑖𝑛𝑛𝑒𝑟 ∈ 𝑄 𝑠𝑒𝑡 𝑖𝑛𝑛𝑒𝑟 → 𝐹 availab (Node, 𝑄 𝑖𝑛𝑛𝑒𝑟 ) = 0)

− 𝑚𝑖𝑛 𝑖 𝑊 𝑖 ) (𝑚𝑎𝑥 𝑖 𝑊 𝑖 − 𝑚𝑖𝑛 𝑖 𝑊 𝑖 ) * (𝑘 − 1) + 1,

Table 11The matrix of pairwise comparisons for decision makersOwnerDB AdminDB developerCIS operatorOwner1357DB Admin1/3135DB developer1/51/313CIS operator1/71/51/31

Table 22The matrix of pairwise comparisons of optimality criteria for the decision maker "owner"BD sizeIndependence levelNeed for synchronizationBD size11/71/3Independence level715Need for synchronization31/51

Table 33Dependence of the database size on the selected alternativeMarker levelLowLower then mediumMediumHigher then mediumHighDB size0,020,240,470,550,75

Table 44Normalized criteria values of the database sizeMarker levelLowLower then mediumMediumHigher then mediumHighDB size (Normalized)13,415,936,819

Table 55Dependence of the database size on the selected alternativeDB sizeLowLower then medium MediumHigher then mediumHighLow13679Lower then medium0,333331223Medium0,166670,5112Higher then medium0,142860,5111High0,111110,333330,511

Table 66Calculation of global priorities of alternativesOwnerCIS operatorDB AdminDB developerGlobal0,5630,0550,2630,117prioritiesLow0,090,050,410,040,17Lower then medium0,090,080,150,080,1Medium0,150,160,10,170,14Higher then medium0,30,310,120,330,26High0,370,40,230,380,34

Table 77Comparison of the database structure effectiveness in different strategies and levels of data representation at the node of the distributed CISUsing central node DBPresenting only critical dataPresenting of all necessary dataFull data duplicationOptimal level of data representation markerIndependence0,7300,350,9710,97DB size0,08100,020,7510,63Synchro need0,18800,150,0710,08DB node efficiency-8,55899,58922,712610,7257Efficiency increase,%-25,32%11,85%295,41%-

Acknowledgements

This research was partially supported by the state research projects: "Development of information and communication decision support technologies for strategic decision-making with multiple criteria and uncertainty for military-civilian use" (research project no. 0117U007144, financed by the Government of Ukraine); "Development of information-analytical system for military-civil application as a information protection factor in the conditions of multi-criteria, uncertainty and risk" (research project no. 0120U101222, financed by the Government of Ukraine).

Principles of Distributed Database Systems 3rd MTamerÖzsu PatrickValduriez 2011 Springler Web-application of warehouse accounting in non-automated points of sale, Science works MDvoretskyi SBorovlova Series: Computer technologies 2018 308 Petro Mohyla Black Sea National University Rel.. in Ukrainian Management of a trade enterprise for Ukraine 1C Enterprise 8 NataliaKozliuk SvetlanaUhrymova Warehouse accounting at trade enterprises

Phenix

2005 in Russian HGarcia-Molina JDUllman JWidom Database Systems: The Complete Book 2nd Edition Pearson 2008 Automatic synchronization of distributed databases in split mode in Russian MaksymKuznetsov Yhor Symdianov BHV-Piterburg 2010 MySQL 5. in Russian Microsoft SQL Server 2019: A Beginner's Guide DusanPetkovich Seventh Edition 7th Edition Mc-Graw-Hill Education 2020 Kindle Edition Organization of databases and knowledge bases VVPasichnyk VAReznichenko 2006 Publishing group BHV in Ukrainian Fundamentals of database design: Textbook for students of higher education institutions VYe Malakhov 2006 Odesa Science and technology in Ukrainian VVPasichnyk NBShakhovska Data warehouses: a textbook

Lviv

Magnoliya 2008 in Ukrainian Finding Patterns in Query Distribution to Manage Materialized Views ABKunhurtsev YuNVozovykov Proceedings of the Odessa Polytechnic University the Odessa Polytechnic University

Odesa

2008 2 Publishing house Odessa National Polytechnic University in Russian Relational database restructuring model by denormalizing the schema of relations ABKunhurtsev SLZynovatnaia Proceedings of the Odessa Polytechnic University the Odessa Polytechnic University

Odessa

Publishing house 2006 2 Odessa National Polytechnic University in Russian Methods and tools for designing information systems and distributed databases VAFilatov RVSemenets Bulletin of Kherson National Technical University 4 2007 in Russian Zemlianskaia, Optimization of distributed corporate information networks using genetic algorithms and object modeling SVLazdyn SYu Scientific works DonNTU 147 2009 in Ukrainian Designing Relational Database Systems (Dv-Mps Designing) RebeccaMRiordan 2001 Microsoft Press Information technology for determining useful data while optimizing the structure and minimizing the volume of the distributed database node MDvoretskyi SDvoretska Bulletin of Cherkasy 4/2019. 2019 State Technological University in Ukrainian Data Utility Assessment while Optimizing the Structure and Minimizing the Volume of a Distributed Database Node MDvoretskyi SDvoretska YNezdoliy SBorovlova Proceedings of the 1st International Workshop on Information-Communication Technologies & Embedded Systems (ICTES 2019) the 1st International Workshop on Information-Communication Technologies & Embedded Systems (ICTES 2019)

Mykolaiv

2019 Query parsing in order to optimize distributed DB structure MFisun MDvoretskyi AShved YDavydenko 10.1109/IDAACS.2017.8095071 Proceedings of the 9th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS) the 9th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS)

Bucharest

2017 Searching for Pareto-Optimal Solutions IKovalenko YDavydenko AShved 10.1007/978-3-030-33695-0_10 Advances in Intelligent Systems and Computing IV. CCSIT 2019. Advances in Intelligent Systems and Computing NShakhovska MMedykovskyy

Cham

Springer 2020 1080 Method of Detection the Consistent Subgroups of Expert Assessments in a Group Based on Measures of Dissimilarity in Evidence Theory AShved IKovalenko YDavydenko 10.1007/978-3-030-33695-0_4 Advances in Intelligent Systems and Computing IV. CCSIT 2019. Advances in Intelligent Systems and Computing NShakhovska MMedykovskyy

Cham

Springer 2020 1080 ABarsegyan MSKupriyanov VVStepanenko IIHolod Methods and models of data analysis: OLAP and Data Mining

BHV-Petersburg

2004 in Russian