<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>COLINS-</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Computational Technology and Software Tool for Translation of Business Rules into Database Creation Scripts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrii Kopp</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dmytro Orlovskyi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Technical University “Kharkiv Polytechnic Institute”</institution>
          ,
          <addr-line>Kyrpychova str. 2, Kharkiv, 61002</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>7</volume>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>Almost all modern software systems from small mobile applications to large enterprise suites are using databases and database management systems to store and manage data collections and ensure their persistence and security. However, dominating Agile software development methodologies do not assume engineers focus only on database design or any other specific field. In Agile projects, database design usually belongs to the scope of software developers responsible for server-side programming. Hence, a lack of specialized skills and experience may lead to database design shortcomings or even errors causing time and money expenses at the later stages of software development projects. When software requirements are elicited, database entities and relations are captured in business rules - brief and precise descriptions of a considered domain. Therefore, in this paper, we propose a computational technology and a software tool for the translation of business rules into database creation scripts to reduce time and cost expenses at the development or maintenance stages of software engineering projects. In this study we use the finite-state machine based approach to parse business rules, association rules learning, and naming conventions to detect data types using attribute names, and the most widely-used database management systems to perform experiments.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Business Rules</kwd>
        <kwd>Database Design</kwd>
        <kwd>Relational Model</kwd>
        <kwd>Database Schema</kwd>
        <kwd>Code Generation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Today is hard to imagine software applications that do not use a database (DB) to keep user data.
In general, a database is the collection of raw facts (i.e. end-user data), which is stored in a structured
manner, and metadata that describes this structure and how end-user data is managed. Databases are
handled by the specialized system software called the Database Management System (DBMS). The
DBMS is responsible for controlling the database structure’s persistence and security [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Currently, software applications of different complexity and data storage requirements, from food
delivery mobile applications to large enterprise information systems (IS), keep their data collections
in databases managed by DBMS. As a result, database design is a crucial component of practically
any software engineering project and may call for specialized engineers with strong expertise in data
modeling and DB development. This approach may work for waterfall or iterative projects when task
delegation to various specialists or even teams and big project teams are common things [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        However, according to Agile practices dominating in software engineering projects, DB design is
done by the same software engineers responsible for the back-end (i.e. server-side) development, or
even by full-stack developers responsible for both front-end and back-end programming [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        This makes engineers responsible for DB design and development replaceable, while the whole
software engineering process is more flexible. Nevertheless, the lack of database design skills, proper
training, and experience, as well as the time and staff resources to focus more precisely on database
schema correspondence to a domain, may lead to design mistakes and further time and monetary
expenses for fixing errors made at the DB design stage. It is well-known though, that the cost of error
fixing increases exponentially during the project and may be several times higher at the development
stage than at the design stage [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Business rules are used as sources of DB design requirements, helping to detect necessary entities,
attributes, and relationships. According to [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], business rules (BR) are textual explanations of policies,
procedures, and concepts within a certain organization that are compact, exact, and unambiguous. For
example, when the IS or its certain components are planned for development, or when the
out-of-thebox enterprise software is planned for customization, BR could be taken from company employees or
documents [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Database BR are definitions of data model components, including entities, attributes, relationships,
and constraints [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A similar definition of DB BR is given in Wiegers and Beatty’s book devoted to
software requirements elicitation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. According to their taxonomy, there are six types of BR, such as
facts, constraints, action enablers, inferences, computations, and terms. Whereas the other types of BR
describe the software system’s behavior in action, facts describe data model entities and relationships.
Thus, in the rest of the paper, we will consider BR as facts that describe the static structure of data
models according to Wiegers and Beatty [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Therefore, this paper aims at reducing time losses and related costs occurring at the late stages of
software engineering projects when the DB design is handled by inexperienced or untrained engineers
responsible for server-side or full-stack development. The goal could be achieved by developing the
computational technology and software tool for the translation of BR into DB creation scripts, such as
Data Definition Language (DDL) statements of Structured Query Language (SQL).</p>
      <p>The rest of this paper is structured as follows: the state-of-the-art analysis is given in sub-section
1.1, the formal problem statement is given in 1.2, materials and methods for DB schema generation
from BR are outlined in section 2 and its respective sub-sections, while the software tool development
and usage to solve the considered problem is described in section 3. The analysis of obtained results
includes validation of SQL scripts generated from BR for compliance with the most popular DBMS.
In the last section conclusion and future work in this field are formulated.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        According to the “DB-Engines Ranking” website that ranks DBMSs according to their popularity,
the most widely used are relational database management systems: Oracle, MySQL, Microsoft SQL
Server, and PostgreSQL [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Even though these DBMSs also support other data models, they still use
relational models as their primary mechanisms. It is well-known, that relational databases consist of
tables, table columns, and relationships that link tables together [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Database tables represent entities
of a certain domain, table columns represent entity attributes, primary keys uniquely identify records
in tables, and foreign keys establish references between tables [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. As can be seen, the main principles
of relational database design did not change much since they were introduced in the 1970s.
      </p>
      <p>
        Existing studies in the field of automatic DB schema generation from business rules demonstrate
different approaches. In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] authors proposed the automatic generation of an extended
entityrelationship (ER) diagram using natural language processing (NLP). This approach by Šuman et. al
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] parses sentences and tags them as various parts of the speech by then mapping these words to ER
elements: entities, attributes, and relationships [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Another study [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] proposes an algorithm for the
ER diagram transformation into relational database schema, however, it uses ER models partitioning
into “ER-construct-units” to validate transformations but not to produce SQL code [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. As for
NLPbased approaches, authors of [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] propose a system that detects and extracts necessary information
from a given natural language text scenario and builds the ER diagram and SQL queries. However,
the system proposed in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is focused on generating SQL queries for the data access (i.e. SELECT
statements), not database creation. Similarly, authors of [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] proposed a novel natural language
interface to databases named SQL Query Generation Engine, which is used to translate natural
language into SQL queries to read data from databases [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. One more interesting paper [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which is
focused on database creation, proposes a new method that uses NLP techniques for UML (Unified
Modeling Language) class diagram generation from requirement specifications to allow software
developers to analyze requirements in an efficient way [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Approach</title>
    </sec>
    <sec id="sec-4">
      <title>3.1. Business Rules Processing and Formalization</title>
      <p>
        Let us also represent fact business rules using controlled language structures of limited grammar
and complexity. Hence, the structure of fact business rules [
        <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
        ] can be formally described using the
Extended Backus-Naur Form (EBNF) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] as following:
      </p>
      <p>
         fact :: {each}  entity _ name  {has}  attribute _ name  . (1)
Hence, a deterministic finite-state machine (DFSM) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is proposed to parse a sequence of tokens
Tokens included into each business rule:
      </p>
      <p>Tokens, S, s0 , , F , (2)
where:


</p>
      <p>Tokens is the finite non-empty set of tokens of each business rule;
S is the finite non-empty set of states;
s0 is the initial state, s0  S ;
  : S Tokens  S is the state-transition function;
 F is the set of final states, F  S .</p>
      <p>
        The proposed DFSM is given in Fig. 1 below using the UML state diagram.
Besides the initial state s0  S , the set of states S includes the following ones:
s1 is the state to which the DFSM transfers after the “each” token is passed as input;
s2 is the state to which the DFSM transfers after the token, which corresponds the following
conditions, is passed as input:
token "has"token StopWords,
(3)
where StopWords is the set of stop words, which can be filtered using the NLTK (Natural Language
Toolkit) package [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ];
 s3 is the state to which the DFSM transfers after the “has” token is passed as input;
s4 is the state to which the DFSM transfers after the token, which corresponds the following
conditions, is passed as input:
token "."token StopWords;
(4)
s5 is the final state to which the DFSM transfers after the “.” is passed as input, s5  F .
Let us describe the additional behavior of the proposed DFSM:
 when the state machine transfers to s2 , the given token is added to a set that forms the name
of entity EntityName ;

      </p>
      <p>when the state machine transfers to s4 , the given token is added to a set that forms the name
of attribute AttributeName .</p>
      <p>As given above, the proposed DFSM (Fig. 1) takes fact business rules given according to (1) and
produces ER model components: entity names and attribute names. The further problem includes the
detection of relationships between extracted entities, as well as data types assignment to attributes.</p>
      <p>Let us formally describe a relational database structure using the following equations:</p>
      <p>Entities  Entityi ,i  1,2,, n,
i  1,2,, n : Entityi  EntityNamei ,Attributeki , k  1,2,, q ,
i  1,2,, q : Attributeki  AttributeNameki , AttributeTypeki ,
 : Attributes  DataTypes  DataTypej , j  1,2,, m,</p>
      <p>AttributeTypeki  DataTypes,
 : Attributes   Entity PK , Entity FK , Entity PK  Entities  Entity FK  Entities,</p>
      <sec id="sec-4-1">
        <title>Attributes   Attributes,</title>
        <p>(5)
Entities is the set of entities;
Entityi is the i -th entity of a database structure, i 1,2,, n ;
n is the number of entities in a database structure;
Attributes is the set of attributes;
EntityName i is the name of i -th entity, i 1,2,, n ;
Attributeki is the k -th attribute of i -th entity, k  1,2,, q ;
AttributeNameki is the name of k -th attribute of i -th entity, k  1,2,, q ;
AttributeTypeki , is the data type of k -th attribute of i -th entity, k  1,2,, q ;
q is the number of attributes in a database structure;
DataTypes is the set of all data types that can be used in a database structure;</p>
        <sec id="sec-4-1-1">
          <title>DataTypej is the j -th data type, j  1,2,, m ;</title>
          <p>m is the number of data types used in a database structure;
 is the function that defines a data type DataTypej , j  1,2,, m for each attribute;
Entity PK , Entity FK  is the pair of entities that form a relationship;
Attributes  is the sub-set of attributes that form relationships, Attributes  Attributes ;</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Entity PK is the parent entity;</title>
          <p>Entity FK is the child entity;
where:

















</p>
          <p> is the function that defines a pair of parent and child entities for each common attribute
given in the sub-set Attributes  .</p>
          <p>Using the proposed formal description of a relational database structure (5), the input data in the
form of business rules (1) processed by the DFSM (Fig. 1) will be structured to generate SQL scripts.
3.2.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Preliminary Extraction of Entities and Attributes Using DFSM</title>
      <p>Let us consider the example of the following business rules describing a domain of students and
courses:</p>
      <p>“Each student has a student ID. Each student has a full name. Each course has a course number.
Each course has a title. Each course has a number of credits. Each enrollment has a student ID. Each
enrollment has a course number. Each enrollment has a grade.”</p>
      <p>Using the proposed DFSM, we obtain the following entities and their attributes. All input tokens
combined to obtain entity names and attribute names values should be merged using the underscore
(i.e., “_”) separator to simplify the further generation of syntactically correct SQL code. Stop words
should be removed according to equations (3) and (4).</p>
      <p>The initial information on the database structure, which is extracted from the business rules given
above, is demonstrated in Table 1.</p>
      <p>Relationships between entities should be created using common attributes – primary keys (PK) in
parent entities and foreign keys (FK) in child entities. To detect relationships, we propose to check all
pairs of entities and find intersections of their attributes. An entity that firstly mentions the common
attribute should be considered as parent, while other entities – as child (Fig. 2).</p>
      <p>Using extracted entities and their respective attributes (Table 1), we can apply the algorithm given
above (Fig. 2) to detect relationships between entities (Table 2).</p>
    </sec>
    <sec id="sec-6">
      <title>Attribute Data Types Suggestion using Apriori Algorithm</title>
      <p>
        The following essential problem includes the data types suggestion for attributes using their names
only. In this study we use the “Spider 1.0 NLP Dataset” created and supported by Yale students. This
dataset includes 166 databases with various tables that cover 138 different domains [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        Using the metadata descriptions of database tables from the “Spider” dataset, we prepared the set
of attribute names and corresponding data types. These pairs are used by the association rule learning
Apriori algorithm (Fig. 3) [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
where:
 SupportAttribute  DataType is the probability of observing the attribute and the data type
together, # PairsAttribute, DataType is the number of pairs containing both the attribute and the
data type, and # Pairs is the number of all pairs;
 ConfidenceAttribute  DataType is the probability of having the data type for attribute;
 LiftAttribute  DataType is the times probability of choosing the data type increases for a
given attribute.
      </p>
      <p>Hence, Lift  1 means the data type depends on the attribute and this association rule can be used
with a certain level of confidence. The data type suggestion algorithm is given in Fig. 4 below.
DataType default (if pairs with a given attribute do not exist) should be suggested:</p>
      <p>Lift max  max LiftAttribute  DataType j .</p>
      <p>j1,2,,m
The suggested data type should be chosen if the following condition is satisfied:</p>
      <p>LiftAttribute  DataTypej  Lift max   , j  1,2,, m,
where  is some tolerance limit or stopping criteria, let us use   106 .</p>
      <p>The “Spider” dataset includes the following data types assigned to different database columns:</p>
      <p>DataTypes  "number","time","text","boolean" . (11)
The default data type is chosen to be a text type, DataTypedefault "text".</p>
      <p>As it is demonstrated in Table 3, attributes “course_number” and “number_credits” are not present
in the “Spider” data set. Therefore, an alternative technique should be proposed to augment the used
association rules learning algorithm.
3.5.</p>
    </sec>
    <sec id="sec-7">
      <title>Attribute Data Types Suggestion using Naming Conventions</title>
      <p>Let us consider the following sets of keywords that will be used to augment the association rules
algorithm by checking how attribute names match naming conventions:</p>
      <p>Knum, Ktime , Kbool ,</p>
      <sec id="sec-7-1">
        <title>Knum is the set of keywords used to recognize number columns;</title>
      </sec>
      <sec id="sec-7-2">
        <title>Ktime is the set of keywords used to recognize time columns;</title>
        <p>(9)
(10)</p>
        <sec id="sec-7-2-1">
          <title>Knum</title>
        </sec>
        <sec id="sec-7-2-2">
          <title>Ktime</title>
        </sec>
        <sec id="sec-7-2-3">
          <title>Kbool</title>
          <p>Therefore, the mapping between keywords and data types they describe can be formally described
as the following function:
 : Knum  Ktime  Kbool  DataTypes .
(13)</p>
          <p>The data type suggestion algorithm based on attribute naming conventions is shown in Fig. 5.</p>
          <p>Hence, the proposed algorithm (Fig, 5) allowed to suggest the “number” data type for attributes
“course_number” and “number_credits” previously identified as text attributes (see Table 3).
3.6.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Proposed Computational Technology Design</title>
      <p>The proposed computational technology includes steps described in sub-sections above and given
formally using the following IDEF0 diagram (Fig. 6).</p>
      <p>The final step shown in Fig. 6 includes SQL code generation using the developed software tool.</p>
      <p>
        A  A1, A2 , A3 is the set of interacting algorithms [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ];
      </p>
      <sec id="sec-8-1">
        <title>A1 is the relationships detection algorithm;</title>
        <p>A2 is the attribute data types suggestion algorithm based on association rules;</p>
        <p>
          A3 is the attribute data types suggestion algorithm based on naming conventions;
 R  A  A describes connections between algorithms [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
        </p>
        <p>To perform experiments and validate the proposed computational technology, the software tool for
business rules translation into SQL data definition scripts should be developed.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>4. Results</title>
    </sec>
    <sec id="sec-10">
      <title>4.1. Software Tool Development</title>
      <p>The software tool should read a business rules text file as the input, when started by a user, and
generate the SQL scripts file as the output. Also users should be able to modify keywords depending
on used attributes naming convention (see Table 4).</p>
      <p>The UML use case diagram demonstrating the main functional capabilities of the software tool for
business rules translation into database creation scripts is demonstrated in Fig. 7 below.</p>
      <p>As it is shown in the use case diagram (Fig. 7) above, some usage scenarios are gray-colored. This
means such activities are tightly related to the software tool but executed independently using other
software:
 the business rules text file is created using any third-party text editor;
 obtained SQL scripts are adjusted and deployed to DBMS using ant third-party client tool.</p>
      <p>The developed computational technology in terms of software components is shown in Fig. 8.</p>
      <p>The component diagram above (Fig. 8) demonstrates structural elements mentioned in equations
(14) and (15), as well as describes connections between DFSM, algorithms, and SQL code generation
software module (for better perception, these components are colored differently).</p>
      <p>As for the SQL code generation module, this part of the software tool should solve two tasks:
 build statements according to the SQL syntax;
 substitute generic data types (11) by specific SQL data types.</p>
      <p>
        According to the “DB-Engines Ranking” resource [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the most popular relational DBMS are:
 Oracle – score: 1247.52;
 MySQL – score: 1195.45;
 Microsoft SQL Server – score: 929.09;
 PostgreSQL – score: 616.50.
      </p>
      <p>
        Therefore, the corresponding data types used in these DBMS [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] are given in Table 5 below.
      </p>
      <p>
        The software is created using the Python programming language and works as a command-line
tool, which takes several arguments, including the name of the business rules text file, the name of the
output SQL scripts file, and the name of the target DBMS. It also uses third-party Python modules:
Pandas [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] for data processing and Natural Language Toolkit (NLTK) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] for sentence and word
tokenization, and stop-words filtering.
      </p>
    </sec>
    <sec id="sec-11">
      <title>Software Tool Usage Analysis</title>
      <p>Let us use business rules we introduced in sub-section 2.1 and put them into the input file (Fig. 9).</p>
      <p>As the result, the software produces the SQL scripts file. The sample output is shown in Fig. 10.</p>
      <p>
        SQL code was produced for various DBMS and tested using “SQL Fiddle” service (Fig. 11) [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>As it is demonstrated in Fig. 11 above, input business rules (see Fig. 9) were translated into SQL
scripts used to successfully create database schemas for Oracle 11g R2, MySQL 5.6, Microsoft SQL
Server 2017, and PostgreSQL 9.6. Obviously, obtained scripts require adjustments of data types and,
possibly some other constraints, before being deployed to DBMS servers.</p>
      <p>The software tool performance was tested using different amounts of entities and attributes written
in business rules (Table 6).</p>
      <p>These results were also depicted in a chart (Fig. 12) of processing time (in sec.) and SQL scripts
size (in LOC) growth as the number of attributes increases.</p>
      <p>
        The exploratory data analysis measures [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] of performed experiments are given in Table 7 below.
It demonstrates minimum, first quartile (Q1), median, third quartile (Q3), and maximum measures of
entities, attributes, respective lines of code (LOC), and processing time.
      </p>
      <p>The statistical results (see Table 7) demonstrate the minimum processing time of 3 entities and 8
attributes is 0.04 sec. with the produced 22 lines of SQL code. The maximum processing time of 3000
entities and 8000 attributes is 39.13 sec. with the produced 22000 lines of SQL code.</p>
      <p>The obtained results (see Table 7) also show the processing time of 25% of business rules is below
0.17 sec. for the input data below 19 entities and 50 attributes, and the output data below 138 lines of
SQL code. The processing time of another 25% of business rules is above 5.87 sec. for the input data
above 525 entities and 1400 attributes, and the output data above 3850 lines of SQL code.</p>
      <p>The remaining 50% of business rules have the median processing time of 1.08 sec. for the input
data of 105 entities and 280 attributes, and the output data of 770 lines of SQL code.</p>
      <p>The statistical analysis of performed experiments proves the proposed computational technology
and software tool are capable of processing large volumes of business rules and generate SQL DDL
statements for various popular DBMS in reasonable time. The performance measurements were done
on the medium-tier workstation with the Intel Core i5-10210U processing unit of 1.60GHz frequency,
16 GB memory, and 64-bit Windows 10 operating system.</p>
    </sec>
    <sec id="sec-12">
      <title>5. Discussion</title>
      <p>The software tool is developed as the command-line utility, which takes the input business rules
from a text file and generates the output database schema creation scripts according to the proposed
computational technology (Fig. 6). The developed software tool is capable of generating SQL scripts
compatible with different popular DBMS using various SQL data types supported by these database
management systems (see Table 5).</p>
      <p>
        In this study, we considered the most popular relational DBMS: Oracle, Microsoft SQL Server,
MySQL, and PostgreSQL [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The experimental results (Fig. 9) outline the sample business rules used
as the input for the software tool were successfully translated into corresponding DDL code, including
table and column definitions, and primary and foreign keys (Fig. 10). The obtained DDL script was
successfully tested for database schema generation using the mentioned database engines (Fig. 11).
The performance testing of the created software tool demonstrates the capability of solving the large
computational problems of SQL code generation from business rules in a reasonable time (Fig. 12).
      </p>
      <p>
        The proposed technique has several limitations. The current implementation of the proposed idea
assumes the usage of only controlled language statements with a restricted structure, defined for fact
business rules by equation (1). Also, the detection of relationships between entities could fail if do not
mention primary and foreign keys as common entity attributes explicitly. However, it is a common
practice to name key attributes as we consider in our approach – among 166 databases present in the
“Spider” dataset [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], 163 databases (98%) are using shared names for primary and foreign keys.
      </p>
    </sec>
    <sec id="sec-13">
      <title>6. Conclusion</title>
      <p>In this study, we considered the problem of database script generation from natural language text
descriptions given as controlled language structures – fact business rules. Fact business rules are brief
and precise statements that describe entities of a certain domain. Thus, we formalized their processing
using the approach based on a deterministic finite-state machine (Fig. 1). However, extracted entities
and attributes are still not sufficient for database design. Therefore, we have proposed an algorithm
for relationship detection using the identification of common attributes among different entities (Fig.
2). This algorithm is based on basic concepts of relational databases – primary and foreign keys. To
suggest data types for each of the detected attributes, we used the algorithm based on association rules
learning (Fig. 4). In addition, we used the other algorithm (Fig. 5) based on naming conventions of
attributes to cover those attribute names did not mentioned in the training data set for association rules
learning. The proposed computational technology includes the DFSM, proposed algorithms, and the
SQL code generation toolkit, and describes the interaction of all of the mentioned components.</p>
      <p>In the future, the proposed computational technology should be augmented by algorithms that can
detect constraints in business rules (such as null arguments, unique indexes, allowed value ranges for
columns, etc.). Also, a web-based software solution should be developed for a better user experience.</p>
    </sec>
    <sec id="sec-14">
      <title>7. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Coronel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Morris</surname>
          </string-name>
          , Database Systems: Design, Implementation, &amp;
          <string-name>
            <surname>Management</surname>
          </string-name>
          , Cengage Learning,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Tsui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Karam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bernal</surname>
          </string-name>
          , Essentials of software engineering,
          <source>Jones &amp; Bartlett Learning</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>D. M. Anderson</surname>
          </string-name>
          ,
          <article-title>Design for manufacturability: How to use concurrent engineering to rapidly develop low-cost, high-quality products for lean production</article-title>
          , CRC Press,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Meyer</surname>
          </string-name>
          , J.
          <string-name>
            <surname>-M. Bruel</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ebersold</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Galinier</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Naumchev</surname>
          </string-name>
          ,
          <source>Towards an Anatomy of Software Requirements</source>
          , volume
          <volume>11771</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2019</year>
          , pp.
          <fpage>10</fpage>
          -
          <lpage>40</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -29852-
          <issue>4</issue>
          _
          <fpage>2</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>DB-Engines</surname>
            <given-names>Ranking</given-names>
          </string-name>
          ,
          <year>February 2023</year>
          . URL: https://db-engines.com/en/ranking.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bressoud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>White</surname>
          </string-name>
          , Introduction to Data
          <source>Systems: Building from Python</source>
          , Springer Nature,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Javed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          , iMER:
          <article-title>Iterative process of entity relationship and business process model extraction from the requirements</article-title>
          ,
          <source>Information and Software Technology</source>
          <volume>135</volume>
          (
          <year>2021</year>
          )
          <article-title>106558</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.infsof.
          <year>2021</year>
          .106558
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Šuman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Candrlic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jakupovic</surname>
          </string-name>
          ,
          <article-title>A Corpus-Based Sentence Classifier for EntityRelationship Modelling</article-title>
          ,
          <source>Electronics</source>
          <volume>11</volume>
          (
          <issue>6</issue>
          ) (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          . doi:
          <volume>10</volume>
          .3390/electronics11060889
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Pieris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Wijegunesekera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. G. J.</given-names>
            <surname>Dias</surname>
          </string-name>
          ,
          <source>ER model Partitioning: Towards Trustworthy Automated Systems Development</source>
          , arXiv (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2007</year>
          .00999. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2007</year>
          .
          <volume>00999</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hettiarachchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sugandhika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kathriarachchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahangama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. T.</given-names>
            <surname>Weerasuriya</surname>
          </string-name>
          ,
          <article-title>A Scenario-based ER Diagram and Query Generation Engine</article-title>
          ,
          <source>In 2019 4th International Conference on Information Technology Research (ICITR)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICITR49409.
          <year>2019</year>
          .
          <volume>9407793</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sugandhika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahangama</surname>
          </string-name>
          ,
          <article-title>Heuristics-Based SQL Query Generation Engine</article-title>
          ,
          <source>in: 2021 6th International Conference on Information Technology Research (ICITR)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICITR54349.
          <year>2021</year>
          .
          <volume>9657317</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Almazroi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Abualigah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Alqarni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Houssein</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Q. M. AlHamad</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Elaziz</surname>
          </string-name>
          ,
          <article-title>Class Diagram Generation from Text Requirements: An Application of Natural Language Processing</article-title>
          ,
          <source>in: Deep Learning Approaches for Spoken and Natural Language Processing</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>55</fpage>
          -
          <lpage>79</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -79778-
          <issue>2</issue>
          _
          <fpage>4</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>V.</given-names>
            <surname>Tietz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Annighoefer</surname>
          </string-name>
          ,
          <article-title>A formally defined and formally provable EBNF-based constraint language for use in qualifiable software</article-title>
          ,
          <source>in: Proceedings of the 25th International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>862</fpage>
          -
          <lpage>871</lpage>
          . doi:
          <volume>10</volume>
          .1145/3550356.3561552.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>G. O'Regan</surname>
          </string-name>
          ,
          <article-title>Guide to Discrete Mathematics An Accessible Introduction to the History</article-title>
          ,
          <source>Theory, Logic and Applications</source>
          , Springer International Publishing,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>NLTK - Natural Language Toolkit</surname>
          </string-name>
          . URL: https://www.nltk.org/index.html.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <article-title>Yale University's Spider 1.0 NLP Dataset</article-title>
          . URL: https://www.kaggle.com/datasets/jeromeblanchet/yale-universitys-spider-10
          <string-name>
            <surname>-</surname>
          </string-name>
          nlp-dataset.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Kumbhare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. V.</given-names>
            <surname>Chobe</surname>
          </string-name>
          ,
          <article-title>An overview of association rule mining algorithms</article-title>
          ,
          <source>International Journal of Computer Science and Information Technologies</source>
          <volume>5</volume>
          (
          <issue>1</issue>
          ) (
          <year>2014</year>
          )
          <fpage>927</fpage>
          -
          <lpage>930</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>V. L.</given-names>
            <surname>Lysytskyi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. Y.</given-names>
            <surname>Morhun</surname>
          </string-name>
          ,
          <article-title>Development of software for effective enterprise product policy creation</article-title>
          ,
          <source>Bulletin of National Technical University KhPI Series System Analysis Control and Information Technologies</source>
          <volume>21</volume>
          (
          <year>2018</year>
          )
          <fpage>59</fpage>
          -
          <lpage>64</lpage>
          . doi:
          <volume>10</volume>
          .20998/2079-
          <fpage>0023</fpage>
          .
          <year>2018</year>
          .
          <volume>21</volume>
          .11.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>B.</given-names>
            <surname>Brumm</surname>
          </string-name>
          , SQL Data Types: Oracle, SQL Server, MySQL, PostgreSQL,
          <year>2022</year>
          . URL: https://www.databasestar.com/sql-data-types/.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>SQL</given-names>
            <surname>Fiddle</surname>
          </string-name>
          . URL: http://sqlfiddle.com/.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <article-title>pandas - Python Data Analysis Library</article-title>
          . URL: https://pandas.pydata.org/.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>D. T.</given-names>
            <surname>Bennett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Scala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. L.</given-names>
            <surname>Goethals</surname>
          </string-name>
          , Mathematics in Cyber Research, CRC Press,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>