<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data Utility Assessment while Optimizing the Structure and Minimizing the Volume of a Distributed Database Node</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mykhailo Dvoretskyi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Svitlana Dvoretska</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuriy Nezdoliy</string-name>
          <email>nezdoliy.yura@gmail.com3</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Svitlana Borovlova</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>While using universal accounting systems, appears a set of problems, among which can be mentioned at least central database overload and low reliability. But shift to the several specialized solutions leads to existence of the set of databases, some data in which needs to be synchronized. Another cause of database splitting can be company branches, removed from the central office geographically in case, when cloud approach cannot to the solution. It all leads to distributed database structure. The article highlights the difficulties, connected with distributed queries and distributed transactions while getting data from relational database, hosted on several remote nodes. It is suggested to minimize the number of such queries by placing valuable data for the node locally. To do this, firstly SQL-queries parser was created. It presents SQL-query as a hierarchical tree, in nodes of which can be subqueries, tables, table columns and tuples. Based on the parsing results, the queries to database (DB) were analyzed in the analytics of the application, user, location, list of used tables, attributes, tuples and other possible dimensions. It allows to determine valuable for the node data based on list of needed applications, node location and other available input parameters. The problem of new and modified data classifying and ways of its solving also is presented. The paper shows the results of testing of the research while the structure of the database of the automated retail store were designing.</p>
      </abstract>
      <kwd-group>
        <kwd>distributed transaction</kwd>
        <kwd>database management system</kwd>
        <kwd>the consolidated accounting</kwd>
        <kwd>distributed database</kwd>
        <kwd>distributed SQL-query</kwd>
        <kwd>data replication</kwd>
        <kwd>text parsing</kwd>
        <kwd>parse tree</kwd>
        <kwd>profiling</kwd>
        <kwd>ANTLR</kwd>
        <kwd>multidimensional analysis</kwd>
        <kwd>classification</kwd>
        <kwd>neural network</kwd>
        <kwd>perceptron</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        While creating the database, the user desires to organize information by some
characteristics. It allows obtaining needed information quickly, by giving a set of search
criteria. The necessity of automation of different company accounting types (warehouse,
accounting, staff, etc.) in some cases may cause usage of "universal" accounting
systems. If we choose this approach, we need to be ready to face set of difficulties. It can
be mentioned central database overload, low reliability, system vulnerability and far
from perfect mechanisms of accounting automation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        If the specialized accounting solutions are used, many of mentioned misfortunes can
be solved. But this approach leads to existence of different database management
system platforms in measure of one company. So, their data need to be synchronized. That
is why the information systems development moves from the local databases usage to
the side of distributed databases (DDB). The main tasks of distributed database
management system (DBMS) are to provide access control to the data for many consumers
as well as the integrity and consistency of data while several storage nodes are used.
So, such a DBMS needs to coordinate the simultaneous work of many users with shared
information [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2-4</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Topicality</title>
      <p>While solving the problem of distributed database users autonomy, we van face many
issues, connected with database structure. There are several approaches to present data
in distributed databases, among which combined data sharing strategy unites
approaches related to non-duplication and duplication of data in remote nodes to take
advantage of all of them. But when it is used, besides the problem of synchronization
of duplicate information, the task of optimal design of the remote node database
structure becomes urgent. The question is to determine if the data belongs to the particular
node of the distributed database. Besides, the system performance directly depends
from the decision of partial or complete duplication of data in remote nodes.</p>
      <p>
        Information systems review in solving the problems of automation of different
accounting types within the same company, and the disadvantages of using universal
accounting systems allowed to make the conclusion of need to optimize the structure of
remote database (DB) nodes by identifying useful data and minimizing the volume of
the distributed database node [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Purpose of publication</title>
      <p>The research aims to increase the level of data availability in the remote node of the
distributed database. It also aims to increase the software usage efficiency, which is
connected with the database data. All this achieved by reducing the number of
distributed requests in the way of optimizing the structure of the distributed database node
and minimizing the amount of data stored in it.</p>
      <p>
        To achieve the goals of the research, it is needed to solve several tasks. We need to
perform the research of methods of formal grammars creation and usage. As the result,
the subsystem of SQL-query code parsing need to be developed [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We need to develop
the model and create the technology of accounting user activity based on SQL-queries
profiling with the set of analytical characteristics [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Next, it is needed to detail user’s
SQL-queries to database tables in the level of relation attributes and tuples. Finally, the
information technology for decision support [
        <xref ref-type="bibr" rid="ref10 ref9">9-10</xref>
        ] in designing the structure of the
distributed database node and determining the amount of data required needs to be
designed.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>The main part</title>
      <p>
        When using a combined approach to presenting data in the distributed database, the
choice of replication type will depend on several factors [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5-7</xref>
        ]. Among them, one of the
important is whether the request type is remote or distributed. In the case of a remote
request, when the interaction with other nodes is required only to ensure data
replication, it is considered appropriate to use asynchronous replication. Such a decision is
justified by the possibility to exclude from the transaction all "unnecessary" nodes.
Only the node that stores the data remains (see Fig. 1).
      </p>
      <p>If we need to synchronize data in the case of the distributed request, the interaction with
the remote nodes cannot be made outside the transaction. This is due to the fact that the
distributed transaction in any case includes pieces of data stored on other nodes and not
presented at the node of the request execution. In view of this, carrying out the
replication process outside the transaction does not seem quite appropriate. This is due to the
fact that the reduction of transaction execution time is insignificant compared to the
possible data conflicts, related to non-synchronous updating (see Fig. 2).</p>
      <p>
        Let’s try to formalize the main task on the way to increase the level of general
availability of data and efficiency of software systems usage, that work with database data,
when designing a remote node of a distributed database. Based on the given above, it
is reducing the number of distributed queries in which the data of multiple database
nodes are involved and to replace them with local ones [
        <xref ref-type="bibr" rid="ref2 ref3">2-3</xref>
        ].
      </p>
      <p>To minimize the number of distributed transactions within the remote node of
distributed database, the subsystem of the user requests accounting was created. It includes
queries classification based on belonging to one or another automated workplace,
geographical location, user role or other criteria, which can be dynamically added to the
system, depending on one or another subject area. The datalogical model of developed
technology is shown in Fig. 3.</p>
      <p>In the center of the model is users’ queries log and users’ query catalog (QueriesLog
and QueryList tables). If a query is nested or has multiple levels, its hierarchical
structure is described using the InnerQueriesList table. Queries are classified by the type of
software, workstation, user, and location, from wich they come from, binding to future
nodes in the distributed database (Workstation, Places, ApplicatioList, WorksAppList,
and WorkPlaceType tables). Based on the structure, given in fig.3, it is also evident that
including the additional analytical characteristics that can be required depending on a
particular subject area, will not require major changes to the database. After parsing the
SQL query code (parsing is the process of converting the source code to a structured
view), they are also classified by the list of database tables, which are present in the
query, and, after doing more in-depth analysis, by the list of relation attributes and
tuples (QueryRelations tables , RelationList, and FieldList).</p>
      <p>The populating of the developed model with the input data is realized with the help
of DBMS profiling mechanisms. While making the research, the SQL Profiler software
were used. This product is included to the standard SQL Server installation package.
This choice is due to a sufficient list of functionality that application provides and
compatibility with the DBMS version that is used in the company, where research results
were tested. If the input conditions change, the choice may be different in favor to
another application.</p>
      <p>The subsystem has the user interface, consisting of data input and editing forms for
the main entities. There are mechanisms for importing data from text and csv files so
that they can interact with third-party software (for example, while importing user list,
software, or workstation tasks). Some entities can also be filled, based on the list of
users’ SQL queries.</p>
      <p>To perform further analytical analysis of the accumulated statistics, it is necessary
to determine specific relationships (as well as links to their attributes and tuples) from
the text of user’s queries. Tools for parsing and linguistic analysis of the text were used
to solve this problem. SQL, like every programming language, has rules that define the
syntactic structure of the program. The syntax of programming language constructions
can be described using context-free grammars or BNF notation (Backus-Naur Form).</p>
      <p>In the research, the ANTLR parser generator was used. It is LL (*) and has been in
existence for more than 20 years. In 2013, it released its 4th version. Now its
development is underway on GitHub. At the moment, it allows to generate parsers in Java, C#,
Python and JavaScript. The initial step in implementing the T-SQL query parsing
subsystem is to create a grammar. Further, the generated lexer and parser classes allow
presenting the query code as a tree model. Then it can be used to determine the list of
relationships and attributes that were used in the query. Figure 4 shows an element
hierarchy tree for the "select (select number from groups where id = students.id), name,
age from students" query, which consists from multiple attributes and has one subquery.</p>
      <p>Then, each request is presented as an object that has the following attributes: the
workstation, the user, and the software from which the request came; the text of the request
itself; the collection of tables, which the query requested; and a collection of subqueries,
if there are any. In this case, the "table" (TsqlRelation) entity is a subset of the database
table and contains the table name, the collection of attributes that are involved in the
query, and the collection of table primary key values, according to the tuples set that
are returned as the result of the query. The fragment of the classes diagram of the SQL
query parser information technology is shown in Fig. 5.</p>
      <p>So at this step we have the set of relationships with the attributes used in the query. This
is sufficient to determine the list of tables represented in the node of the distributed
database and to limit the total amount of data by applying the projection operation to
them. However, this may not be enough, as there are usually some tables in the
accounting system database that have a lot of tuples. Such tables take a lot of disk space and
create extra workload when querying them. At the same time, in one or another remote
node of the distributed database, rather a small percentage of the total data amount
included in them is actually used.</p>
      <p>The next step is to determine the set of tuples used in a particular query. To do this,
we extend the TsqlRelation class with the TsqlField type collection that will determine
the relation primary key. If the is no primary key, or the key contains a large number of
non-numeric data type attributes, and as a result, it has a large volume, the unique
autoincremental not null field can be created at the DB level, which will be used later to
identify the tuple.</p>
      <p>Further, if the request has an nested queries, each of them is processed separately.
An additional database query for each table in the query relationship collection is
executed, the attribute list in which is replaced with the relation primary key. The query
result is stored as the set of table attributes used by the current query. For example, if
the request selects the students' names with student group numbers from two tables,
using the nested query to filter by student specialty (Fig. 6), it will be divided into the
next queries set (Fig. 7).</p>
      <p>After identifying attributes set, it is available to determine the subset of valuable data
for the remote database node. The determining of the subset is made using table filtering
by the value of the primary key. It allows to place the most of data, that node needs,
locally and, accordingly, reduce the number of distributed queries. At the same time,
the amount of data in the local node is also minimal, which in turn minimizes the need
to synchronize different versions of the same data.</p>
      <p>However, in the process of operating the described technology, there is a problem,
connected with modifying the data in database, when new data is added to the tables or
the existing ones are modified or deleted. In this case, the simplest solution of the
problem is to replicate to the remote database node any new or changed data (since we do
not know the value of its usefulness to the node). When the volume of this data reaches
a critical point, for example, based on server load while retrieving or synchronizing
data, it is necessary to re-analyze the SQL queries and update the base of attributes and
tuples of the database tables, that are used by a remote node.</p>
      <p>
        It should be mentioned, that performing the complete analysis of the database tables
attributes and tuples evaluation is a very resource-intensive operation and cannot be
performed frequently. So the approach, connected with regular recalculations is
unacceptable for large and frequently changing databases. Therefore, it was suggested to
present the evaluation problem of new or modified data in the form of a data mining
classification task [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. As an input we have the name of the table and the list of values
of its attributes (new or changed row), and as an output - the decision about its value
for the remote database node. To solve this problem it is proposed to use the classical
neural network perceptron with one hidden layer. Network training is based on the data,
obtained from the SQL query parsing analysis results. Then, after being trained, the
network classifies new and changed data.
      </p>
      <p>
        According to need of performing the analysis of the stored data in terms of multiple
dimensions, as well as likely large volumes, the data required for the analysis was
presented in the form of multidimensional database or, in other words, on-line analytical
processing (OLAP) cube [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. To do this, a number of views were added at the
relational database level. They implements the fact table and the dimension tables in the
form of a star model. The structure of the obtained multidimensional cube is shown in
Fig. 8.
      </p>
      <p>The figure shows the base structure with one measure, which is time of the query
execution and four dimensions – workstation, relation (table), application (where the
query came from) and standard datetime dimension. It is obvious, that we can easily
expand the quantity of dimensions by adding, for example, hierarchy of fields and
tuples into the “relation” dimension, or user, department and branch into workstation
dimension. Also, if queries profiler application allows to accumulate other useful data,
for example i/o bytes volume, it also can be added to the fact table.</p>
      <p>The results of the research were tested while the structure of the database of the
automated retail store were designing. It also was used in the implementation of
synchronization subsystem with the central database. Using local and remote DB versions,
some indicators were collected. The results are shown in Fig. 9.</p>
      <p>Therefore, the result is an increase of the read/write operations speed at the remote
node information system and unloading of the central database. This result is achieved
by optimizing the DDB structure, minimizing the number of distributed transactions,
and using asynchronous data replication mode.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Summary and conclusion</title>
      <p>While the research was done, several achievements were made. For the first time, it is
proposed to use multilevel catalogs of analytical characteristics of user SQL queries
with data relevance level estimation. The SQL query parsing subsystem has been
developed. It allows to present user’s query as a set of relations, with the sets of used
attributes and tuples. For the first time, to analyze user’s SQL queries, the
multidimensional data model was used and based on it, the decision support system was
implemented. While designing the remote workplace DB structure, the decision support
system allows to perform on-line analytical analysis of SQL queries by dimensions,
quickly and efficiently perform operations of consolidation, detail, rotation and slice.
When modifying the central database data, it is suggested to use the neural network to
solve the classifying task of new or changed data according to their value for the remote
node of the distributed database.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Mahon</surname>
          </string-name>
          , E.:
          <article-title>Transitioning the Enterprise to the Cloud: A Business Approach</article-title>
          , Cloudworks Publishing Company,
          <volume>178</volume>
          p. (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Tamer</given-names>
            <surname>Özsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Valduriez</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>Principles of Distributed Database Systems</article-title>
          . 3rd ed., Springler, ,
          <volume>845</volume>
          p. (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kimball</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caserta</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning</article-title>
          , Conforming, and Delivering Data/ Willey Publishing, Inc., 528 p (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Fisun</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shved</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nezdoliy</surname>
            ,
            <given-names>Yu.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davydenko</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>The experience in application of information technologies for teaching of disabled students</article-title>
          .
          <source>In: Proc. of the 8th IEEE Int. Conf. on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS)</source>
          . Warsaw, Poland, pp.
          <fpage>935</fpage>
          -
          <lpage>939</lpage>
          ,
          <year>September 2015</year>
          IEEE (
          <year>2015</year>
          ). DOI:
          <volume>10</volume>
          .1109/IDAACS.
          <year>2015</year>
          .
          <volume>7341441</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Garcia-Molina</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ullman</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Widom</surname>
          </string-name>
          , J.:
          <source>Database Systems: The Complete Book. Upper Saddle River</source>
          , New Jersey: Prentice Hall; Dep. of Computer Science Stanford University (
          <year>2002</year>
          )
          <article-title>(in Russian)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Transaction Management in ORACLE: http://www.novsu.ru/file/96492/, last accessed 10/11/
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Dvoretskyi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Davydenko</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Creating the structure of DDB based on SQL-queries pasring results. Science works "MNU named by P.Mohyla"</article-title>
          . Series: Computer technologies, (
          <year>2016</year>
          ).
          <article-title>(in Ukrainian)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Fisun</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dvoretskyi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shved</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davydenko</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Query Parsing in Order to Optimize Distributed DB Structure</article-title>
          .
          <source>In: Proc. of the 9th IEEE Int. Conf. on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications</source>
          ,
          <string-name>
            <surname>IDAACS</surname>
          </string-name>
          <year>2017</year>
          . Vol.
          <volume>1</volume>
          , pp.
          <fpage>172</fpage>
          -
          <lpage>178</lpage>
          . IEEE (
          <year>2017</year>
          ). DOI:
          <volume>10</volume>
          .1109/IDAACS.
          <year>2017</year>
          .
          <volume>8095071</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Larichev</surname>
            ,
            <given-names>O. I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petrovskiy</surname>
            ,
            <given-names>A. V.</given-names>
          </string-name>
          :
          <article-title>Decision support systems. The current state and prospects for their development</article-title>
          . Vol.
          <volume>21</volume>
          ., p.
          <fpage>131</fpage>
          -
          <lpage>164</lpage>
          , (
          <year>1987</year>
          ).
          <article-title>(in Russian)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Tikhonov</surname>
            ,
            <given-names>A. N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsvetkov</surname>
          </string-name>
          , V.Y.:
          <article-title>Methods and systems of decision support</article-title>
          . MAX Press,
          <volume>312</volume>
          p., (
          <year>2001</year>
          ).
          <article-title>(in Russian)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Oreshkov</surname>
            ,
            <given-names>V.I.</given-names>
          </string-name>
          :
          <article-title>Intellectual data analysis as the most important tool for the formation of the intellectual capital of organizations</article-title>
          .
          <source>Creative Economy</source>
          ,
          <volume>12</volume>
          ,
          <fpage>84</fpage>
          -
          <lpage>89</lpage>
          (
          <year>2011</year>
          ).
          <article-title>(in Russian)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Barsegyan</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kupriyanov</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stepanenko</surname>
            ,
            <given-names>V. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holod</surname>
            ,
            <given-names>I. I.</given-names>
          </string-name>
          :
          <article-title>Methods and models of data analysis: OLAP and Data Mining</article-title>
          .
          <source>St. Petersburg: BHV-Petersburg</source>
          ,
          <volume>336</volume>
          p, (
          <year>2004</year>
          ).
          <article-title>(in Russian)</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>