Natural Language Interface to Database Approach in the Task of Relational Databases Design

Natural Language Interface to Database Approach in the Task of Relational Databases Design KaterynaYalova yalovakateryna@gmail.com Dniprovsky State Technical University

Dniprobydivska str.2 51918 Kamyanske Ukraine

KseniiaYashyna yashinaksenia85@gmail.com Dniprovsky State Technical University

Dniprobydivska str.2 51918 Kamyanske Ukraine

Abdel-BadeehMSalem abmsalem@yahoo.com Ain Shams University

El-Khalyfa El-Mamoun Street Abbasya Cairo Egypt

International Workshop on Intelligent Information Technologies and Systems of Information Security

March 22-24 2023 Khmelnytskyi Ukraine

Natural Language Interface to Database Approach in the Task of Relational Databases Design F63B14BFAAE07BC2B8615F94D1E174D4 GROBID - A machine learning software for extracting information from scholarly documents Databases Natural Language Interface to Databases (NLIDB) Structured Query Language (SQL) pattern-based approach

Developing general-purpose solutions that could be migrated to a variety of databases, companies, and data domains has been a big challenge for researchers for years. Today, most of the data is stored in relational databases, and extracting useful information from these databases requires special knowledge of the structured query language. Recent advances in natural language understanding and processing have led to renewed interest in natural language interfaces to database (NLIDB) approach, which provides a simple mechanism for nontechnical users to access data from data storages. That is why NLIDB have attracted attention as useful tools that simplify interaction between users and computer systems. The paper describes the results of the NLIDB system design and software implementation. Created system allows to generate users-system dialogue and process its results to obtain formalized data about the data domain. The NLIDB pattern matching system approach that allows to set a template for control and clarification questions for users and form an algorithm for transforming knowledge about the data domain into a database structure was used in the research. Depending on the users' responses, the system outputs the domain business rules, database diagram, SQL queries, and database file to users. The results of the developed NLIDB system testing are presented in the paper. The testing of the system was carried out to create a fragment of the database for storing invoice data. In order to check the adequacy of the proposed project solutions, an entity relation diagram for the document was used. The main problem in generating a database structure is the implementation of data normalization rules and checking for anomalies in the created database. Checking and optimizing of the developed database is carried out after populating a database with an initial data, by using a dialogue with users and applying queries to the created database tables.

Introduction

Nowadays almost every software, regardless of the field of use, whether it is a mobile application for exchanging messages or services for managing companies, processes a large amount of information that needs to be stored for long periods of time. Databases are usually used to store, process and analyze these data. In order to manage the design and processing of databases, database management systems (DMS) are used. DMS is a set of software and linguistic tools for general or special purposes. Most of the data of modern information systems and programs store data in relational databases (RDB) as a set of the tables. RDB design is a time-consuming process involving domain analysis, studying the data necessary for storage, structuring and presenting data as entities and their characteristics, creating the relations between them. The result of RDB design is its logical and physical model. A logical database model is a universal model that is independent of a specific DMS, presented in terms of the data domain. A physical database model is a model developed by the means of a specific DMS. It this case the knowledge of the structured query language (SQL) is required to provide information operations of the addition, deletion, updating and data searching in RDBs. Shortcomings in the stage of RDBs design lead to the impossibility or complexity of fixing them in the next stages of software development life cycle. Despite the existence of data normalization rules, the process of RDBs developing is poorly formalized and largely depends on the experience and knowledge of the database architect, so the task of improving the efficiency of the RDB design process remains a pressing scientific and practical task.

Achievements in Natural Language Processing (NLP) enable interaction with RDBs in natural language. The idea of using natural language instead of SQL contributed to the development of a new type of NLP called Natural Language Interface to Database (NLIDB) [1]. The concept of NLIDB is to make data stores more accessible to a wide range of non-technological users, reduce the knowledge requirements of database developers [2]. The ultimate aim of the Natural Language Interface to Database is to organize the users' communication with the database (almost) as with a person. That is, NLIDB is used to simplify work and reduce the time spent to develop and process RDB data. Software tools that use the concept of NLIDB exclude the stage in which knowledge of the SQL language, data normalization rules or RDB design rules are directly used. Using their own software mechanisms, they allow users unfamiliar with database development rules [3]:

• perform queries to create, update, delete the database and its tables in the Data Definition Language (DDL); • get actual, complete and reliable information from the database as a result of executing the generated SQL queries. The main goal of the paper is to present the results of the NLIDB system development, which allows to organize a dialogue with the user in such a way that a database is designed as a result of this interaction. To achieve this goal, the following design objectives have been identified and successfully completed:

1. Development of the NLIDB system architecture.

2. Design a dialogue model as a tool to acquire data domain knowledge.

3. Creating an algorithm to transform a verbal domain description into commands to create RD. 4. Implement a NLIDB system with a high-level programming language. 5. Using a data domain document as a complex sample data to test a developed system. Integrating the NLIDB as an abstraction layer between users and databases provides the main benefit of simplified natural language-based interaction [4]. The main difficulties in solving this problem arising from linguistic failures and the impossibility of general-purpose solutions developing, motivate researchers to continue searching for the effective methodologies and approaches to create NLIDB systems applicable to various data domains.

Related works

Over the past 50 years, many attempts have been made to create an NLIDB to generate database queries. The first NLIDBs appeared in the late sixties and early seventies. Among the first authors which described an approach of NLIDB implementing were W. Woods, R. Kaplan and B. Webber who in 1972 described the Lunar Sciences Natural Language Information System. Over the next 40 years, scientists such as: І. Androutsopoulos, G. D. Ritchie, P. Thanisch, E. Buchholz, A. Düsterhöft, B. Huang, G. Zhang, J. Mark, N. Nihalani, A. Kumar, A. Kaur, I. S. Bajwa, N. Stratica, S. Maisonneuve devoted their works to the development of effective mathematical and software mechanisms and systems for translating natural language queries into SQL queries.

Two main approaches are used to implement the NLIDB concept: 1. Rules based approaches are the approaches that aim is to design models for automatically mapping natural language semantics into SQL-queries. 2. Neural approaches [5] are approaches that form the neural architecture consists of an encoder and a decoder component for translating natural language sentences into SQL-queries and verse vise. In turn, rules based approaches can be divided into:

• Pattern-based. The main idea of the approach is to use keywords that allow you to respond to complex requests in natural language. The goal of the approach is to provide the ability to process specific language templates to identify commands. The works of M. D. Agatonovic, H. Cunningham, Zheng [6], X. Xu, C. Liu, D. Song [3] are devoted to the implementation of the pattern-based approach and its application in software systems;

• Parsing-based. The main idea of this approach is to use a natural language parser to analyze the input query and analyze the grammatical structure of the query. Then the grammatical structure can be used to improve understanding of dependencies between query markers. Such scientists as F. Li, H. V. Jagadish, D. Saha, A. Floratou, K. Sankaranarayanan [7], A. Kopp, D. Orlovskyi, S. Orekhov [8] devoted their works to using the parsing-based idea.

• Grammar-Based. The main idea of this approach is to use a set of rules and grammar, to determine how the system can build, understand and answer questions. These rules can then be used to help users enter their queries by auto-completing. Scientists such as D. Song, F. Schilder, C. Smiley, C. Brew, C. Zielund, S. Ferre [9], A. K. Deshpande, P. R. Devale, B. Huang, J. Guo, Z. Zhan [2] created semantic trees to implement the Grammar-Based approach. Recently, a lot of attention has been paid to the use of conversational interfaces [4] for data analysis, which allows business owners and non-technical users to quickly get database views.

Also, an interesting approach of the NLIDB implementing is presented in [10], where authors describe a method for effectively automating of the queries conversion from a natural language to SQLquery, using data storage objects of an online analytical processing (OLAP) cube.

The works, listed above, are mainly devoted to the creation of NLIDB systems for generating queries for already created databases. The task of designing and creating RDBs based on the verbal description of the data domain transmitted by the user by the means of dialogue with the program remains relevant and has high practical value. The paper, unlike the ones mentioned above, presents the results of the development of the system based on NLIDB approach for converting user commands submitted in natural language to commands for creating RDB. For users, dialog with developed system looks like communication with a colleague about a future database, answering short questions in natural language. The results of user interaction with the system are:

• list of the data domain business rules;

Proposed methodology

NLIDB eliminates the need to know the database design rules and the specific syntax of SQL [11][12][13]. Implementation of NLIDB requires understanding the context of natural language sentences, as well as converting these sentences into meaningful executable queries. The paper suggests using rules based approach in the form of the pattern matching system to create a physical model of the RDB. Pattern-based NLIDB is an extension of rules based systems that allow to form more complex templates in native language instead of single keywords [14][15][16]. The basic idea is to provide the ability to process specific language patterns to identify and generate queries. The main advantage of the pattern matching approach is its simplicity, there is no need for complex parsing and interpretation modules, and systems can be easily implemented [17,18]. The problem with implementing NLIDB is how to deal with the ambiguity of user information submitted in natural language. Author propose to implement the pattern matching NLIDB system so that the user-system interaction is minimized. It is proposed to form a pattern of control and clarifying questions, the answers to which allow to describe the main business rules of the data domain and convert them into SQL commands. This approach is valid for creating queries to create the database needed to store domain data.

At the request of the user, the created database can be populated with data. After populating the database with an initial data it can also be checked for anomalies or violations of normalization rules. To minimize anomalies in the developed databases, data normalization rules will be automatically used by the system. The developed RDB will be normalized to the third normal form.

Results

Since the ultimate goal of the NLIDB system is to generate a RDB, it is necessary to describe the generalized mathematical model as follow. RDB is a data structure with the following mandatory components [19]

𝑅𝑅𝑅𝑅𝑅𝑅 =< 𝑇𝑇, 𝑅𝑅 >,(1)

where T={T1,…Tn} is a set of database tables; R={R1, R2} is a set of connections presented in in the database, R1 -1:1 connection, R2 -1:N connection.

Then the database table set can be described as:

𝑇𝑇 =< 𝐴𝐴, 𝑅𝑅, 𝐾𝐾 >,(2)

where А={а1,…аi} is a set of attributes of the n-th table in the database; D={d1,…,dj} is a set domains of the n-th table containing all the values for each i-th attribute; K={k1,…,km} is a set of tuples of the nth table that combine all attribute values for m-th row.

For each i-th attribute there is a corresponding j-th domain. Let us suppose that the data described by i-th attribute is defined as t(ai), then the following constraint shall be fulfilled: t(ai)∈dj.

Further, the attribute set A can be described as:

А =< 𝑁𝑁, 𝑇𝑇𝐴𝐴, 𝑇𝑇𝑅𝑅 >,(3)

where N={n1,…nl} is attribute name set of the n-th table; TA={ta1, ta2, ta3, ta4} is a set of attribute types: ta1={ta11, ta12} -key attribute: primary and foreign key, respectively; ta2 -dynamic attribute, for which it is not possible to prevent the value, ta3 -static attribute, which value can be repeatedly used in the nth table , ta4 -attribute that can be calculated on the values of other attributes, according to the normalization rules such attributes should not be stored in a database; TD={td1,…,tdc} is a set of attribute datatypes for n-th table in the database.

A synthetic attribute ak ∈ А of type ta11 is created as the primary key, so that for any data tp of attribute ak the condition tp(ak) ≠ tp+1(ak) is true, which provides the following advantages:

• unambiguous identification of table data; • prevention of duplicated key values; • speeding up database query execution; • generation of referral integrity constraints. To design a qualitative RDB, data normalization rules were used. Normalization rules are a systematic approach to develop database tables in such a way to eliminate data repeatability and avoid insertion, update, and deletion anomalies. Normalization of data in the developed system will be carried out up to the third normal form.

The first normal form (1NF) sets the following main rules:

• attributes must not be multivalued. The most common multi-valued attributes are: addresses, name, dates, phone number, etc. All these attributes can be divided into several single value attributes;

• values in the column must belong to one domain; • all columns must have unique names; • the order of data storage is not important. The second normal form (2NF) states that the table should not be partially dependent on the key fields. The rules of the 2NF allow to form rules for choosing or creating a primary key in each database table.

The third normal form (NF3) dictates the rule that tables should not have transitive dependencies. All non-key fields which contents can refer to multiple external table must be placed in a separate table. NF3 allow to form rules for dividing data between tables and creating correct relationships between them.

Each of the listed rules is important for the full functioning of the database. The relationship between the attribute and the table, as seen from the business rules perspective, describes how the table contains the attribute, for example the conjunction "has a" or "contains".

The architecture of the developed NLIDB system can be represented as a combination of two components: a user dialogue module and a SQL query generation module. Authors propose to supplement the user dialogue module with a pattern of questions, with the help of which a managed user dialogue will be formed in the system. A generalized architecture diagram is shown in Figure 1. The generalized algorithm of users' interaction with NLIDB system modules includes the following main steps:

1. Acquiring information about the desired database as a whole.

2. Getting information about the described entity as a whole.

3. Gets information about the set of attributes of the selected entity. 4. Perform a managed dialogue to find dependencies and relationships between entity attributes, establish their type and a multiplicity of these relation. On the system side, the received data is normalized, primary keys are added if necessary, new tables are formed from subsets of attributes. 5. Performing a managed dialogue, the purpose of which is to find dependencies between entities and establish a multiplicity of these relations. If it is necessary, the system adds foreign keys and additional tables. 6. Obtaining a data domain abstraction of an entry-level. 7. Check, edit, adjust business rules and inputted data. 8. Generate SQL queries and create a database file.

Dialogue model

The input to the NLIDB system is the verbal knowledge of the user about the data domain. To create a conversation with users, a glossary of terms is used to simplify mutual understanding between the user and the system. The main terms of the glossary of the system are:

• A data domain is a part of the real world to be studied in order to organize management and, ultimately, automation of it.

•

A data domain entity is the result of abstracting a real object by highlighting and fixing a set of its properties (attributes).

•

A domain entity attribute is a characteristic of an entity that can be represented as a numeric value (a quantitative property) or a description (an indicative property). The goal of a dialogue with a user is to obtain information about the data domain for which the database will be created, to specify the entities and relations between them. In computer systems where any typing errors can lead to significant problems, it is recommended to minimize the amount of text entered by the user. Instead of prompting the user to enter entire sentences, only new information, such as entity and attribute names, was left to be entered. All other interaction with the system should be done through markers, drop-down lists, and other components that do not require manual input of information [20]. Therefore, the dialogue model was implemented in the form of a question-answer interaction. The model of the dialog was based on the pattern of the control and clarifying questions and the algorithm for transitions between them, depending on the already received answers. Tree types of questions were defined to create a pattern:

1. Qcon: content question. The users' response for the content question is the text entered from the keyboard. Obtained data allows to form a set of tables T, a set of attributes A, a set of attribute names N, a set of values K of the RBD.

2. Qclas: classification questions. The answer to the classification questions allows to determine the type of connection between attributes inside entity or between entities and establish the multiplicity of such connection (R1 or R2), determine candidates for foreign keys, form a set of relationships R of the RBD, bring the data to normal forms. The system receives the users' response through drop-down lists.

3. Qchar: characteristic questions. The characteristic questions are aimed at obtaining information on the attributes data type, amount of stored data, obligation of their filling. The response to the characteristic question makes it possible to determine candidates for primary keys, allows to form a set of domains D, a set of attributes types TA and their data types TD.

Table 1 provides an abbreviated pattern of the main control questions that are used to form a usersystem dialogue. In addition to describe attributes and objects by the user, the system has a mechanism for adding key attributes, references to external entities in accordance with the data normalization rules in automatic mode. Key attributes are added depending on the answers to questions 1Qchar, 2Qchar. If no primary key candidates are found, a synthetic primary key is added to the table according to the NF2. If the answers to the questions 1Qclass = = yes & 4Qclass = = repeating value for ith attribute, the system will change the attribute name to the foreign key to external entity.

In the NLIDB system, users can choose the level of abstraction for the described data domain:

• list of data domain business rules. Business rules are the verbal definitions of database objects with their attributes, relationships and constraints.

•

database diagram. In this case, the user is given a database diagram on the screen, presented in the form of a set of tables with shown relations.

• generated SQL queries will be presented in the form of text with the ability to save as the MSSQLServer query file.

•

.mdf database file created by MSSQLServer. The entry-level of data domain abstraction is the set of business rules formed based on the users' answers to the questions presented in the Table 1. Converting acquired information into business rules, and demonstrating these business rules to the user, makes the users' interaction with the system more convenient and understandable, because it is possible to analyze business rules of the data domain and if there are some mistakes or inaccuracies user can return to the dialogue with system and make the corrections. After verifying the correctness of the generated business rules, the user can request a database diagram from the system. The system outputs the database diagram in order to allow the user to make sure that the data domain description is correct. At this stage, the database diagram is not created by DMS, but rendered on the form as a graphical element. The business rules list and the database diagram are the logical level database model. After that, the system generates queries and creates a physical database model.

NLIDB system implementation

The C # language is selected as the programming language. The system is designed in accordance with an object-oriented approach. All dialogue data will be temporary stored in the format of class instances. The SQL language and database access tool will be used to interact with the DMS and execute queries by means of the ADO.NET technology. For the interface part of the system, WPF application will be used as a flexible and universal way to create a convenient simple interface. The system generates SQL queries, as well as a database file. The MSSQLServer DMS is used for this purpose. Although the user can use the created queries in various DMSs that support relational approaches and can use SQL.

Architecture of the NLIDB system classes

The following classes are used to temporarily storage of the database structure:

•

Attribute class, that contains information about the attribute name, its data type, size, and other properties. The attribute class has a method returns string with information about the attribute to construct database queries.

•

Inherited Key attribute and Calculated attribute classes. These classes contain specific attributes and methods and provide the ability to store primary and foreign keys, as well as to store the calculated attributes.

•

Database class for working with the entire database. The class contains information about database name and the value of the connection string. The class diagram of the NLIDB system is shown in Figure 2. Information stored as instances of classes is used to create business rules, construct queries and create a database file. For example, in the last phase of the users' interaction with the system it sends queries for the DMS to create a database. An example of a database table creation function for already existed database is as follows:

1. Create a query string. To do this, use the word "use" and add the corresponding name of the database. Then add the statement "create table". Extract the name of the table from the user inputted data stored as a Tables class instance. Extract the data about attribute from the list of attributes. Complete the "create table" query statement.

2. Open the connection to the database. 3. Execute the command created in the first step. 4. Output a message to the user that the table has been created.

NLIDB system user interface

The application interface is built by means of the WPF graphics subsystem from .NET, which uses the XAML language. It is used to create GUI modules.

Since most of the system is transferred to classes, the application has one main window with different tabs. The transition between tabs is implemented by using additional buttons, based on the algorithm of interaction with the user. That is, the transition to the next step of the user interaction algorithm occurs only after filling the data in the fields, and their storage for further work with them. Direct navigation between tabs as provided in WPF is blocked to prevent unexpected pathway of the users.

Test example

To demonstrate the work of the developed NLIDB system, it is proposed to consider the result of the development of a database fragment for storing invoice document. Invoice is a primary document used in the transfer of inventory from one person/ company to another. The invoice document was selected as test data because:

• it is a composite entity that contains and references external entities; • contains data to be normalized, for example: address and phone attributes; • contains attributes of different types: static (customer name, supplier name, good name, etc.), dynamic (date of creation number, amount, etc.), calculated (tax, subtotal, total), keys (references for the external data domain entities).

In order to present the object model of the invoice document the entity-relation diagram of an invoice document was designed. The developed entity relation diagram is shown in Figure 3 and used to assess the adequacy of the data obtained from the system.

Figure 3: Entity relation diagram for an invoice document

The structure of the document is complex, so the dialogue with user will have several iterations with control and clarifying questions. It allows to test the pattern fully and check the adequacy of the system in a whole.

To generate data domain business rules, a template and algorithm converting user responses to a domain business rule were developed and used. The business rule is a brief, structural formalization of the data domain, presented verbally in natural language. The business rule template used verbs:

• has a -to describe a situation when an entity contains or references another entity. Such a verb is used if the question 1Qclass = = yes & 4Qclass = = repeating value. The answer to the question 2Qclass makes it possible to determine the type of relationship between entities: composition or aggregation. Composition is a situation when external entity cannot be performed as independent part of the data domain. Aggregation demonstrates a possibility to refer and use attributes of the external entity.

• contain -to describe all the attributes that characterize the current entity. In this case business rules are formed using answers to questions 2Qcon, 3Qcon, 5Qcon , 6Qcon, 7Qcon.

• is calculated as -to describe characteristics that can be defined based on other attributes by using specified calculation formulas. The answer to the question 3Qclass = = yes, makes it possible to form business rules with such a verb. Table 2 shows the list of business rules generated during the users-system dialogue for describing the Invoice document. Invoice product is described by <Product, Amount, Price> Attribute

The generated list of business rules provides a possible to obtain information about the structure of the data domain as the entry level of abstraction on the way to the database. If desired, the list can be saved and imported. There is no feedback in interaction with the list of business rules. If users need to make changes to the rules, it is necessary to update the entered information through the main dialogue with the system.

If the user agrees with the presented description of business rules, he can proceed to the graphical display of the future database. The NLIDB system implements TextBlok objects that are created and used to map the data of each entity to database diagrams in a separate application window. Figure 4 shows an example of a database diagram created to display the structure of invoice document in terms of the future database. This is not a database diagram generated by DMS tools, it is only a graphical representation of the database content, created by the NLIDB system to improve visibility and for the verification. After receiving information about data domain entities and attributes, the user is prompted to set values for each entity attribute. The system uses the Insert command to add values to the creating database. The entered values are also used by the system to check the correctness of the generated database structure or to check the presence of anomalies in the database. For example, the presence of multiple space characters in the values of a string data type can signal that the user has created a multivalued field. And the system will issue a clarifying question by using the pattern:

Attribute <name> + 5Qсlass. Does not it? If the user answers yes, the user is prompted to repeat the information entry process for each part of the multivalued attribute separately.

Another important clarifying question is the question with the following pattern: Attribute <name> + 1Qchar + 2Qchar. Does not it? Such a clarifying question will be generated by the system in the case when the validation mechanism finds repeated values in the field declared as a candidate for the primary key. In this case, the user will be given information about the inadmissibility of duplicating the value in the specified field, otherwise a synthetic primary key with an auto-increment mechanism will be added to the table instead of the selected candidate. The system also provides a mechanism for validating the values entered by the user in accordance with the selected data type.

Absence of system comments means the finish of the users-system dialogue. As a result, the user can save SQL commands or load the generated .mdf file. The system constructs the query to the MSSQLServer based on the CREATE, INSERT commands to create a database and its tables, as well as populate it with data if necessary. The choice of this particular DBMS is not a limitation for the system, since the generated SQL queries are universal and can be used to create a physical database model in different DMS.

If users answer the questions correctly, the system adequately generates results. The figure 4 shows a database diagram generated by the system to discuss the shortcomings of the system using a concrete example. The main barrier for the efficient use of the NLIDB system is a significant dependence on the user. Despite the implemented templates and validation mechanisms, the user may not understand the question or enter deliberately untrue data. In this case, the inaccuracies will be appeared in the results generated by the system. Such inaccuracies can be seen by comparing the entity relation diagram (Figure 3) and the constructed database diagram (Figure 4). The address and phone fields were marked by the user as indivisible attributes and only at the data validation stage, after the database population, the system identifies the problem of 1NF violation. Another example is a misunderstanding of the user when answering 6Qclas question. The incorrect user answer leads to the fact that the system does not optimize the structure of tables and in the database diagram, there is a subsets of attributes that can be brought into a separate table.

Among the advantages of proposed approach and the NLIDB system, the authors distinguish the following:

• unnecessary to have knowledge of the syntax SQL, due to the fact that the system operates only with natural language; • convenient, intuitive, simple interface; • possibility of obtaining an intermediate level of abstraction on the way to the database to make it possible to check and update inputted data;

• minimizing the amount of the entering data from the keyboard operations, by using short and clear instructions to the user through the question-answer dialogue.

Conclusion

NLIDB is a system that allows users to communicate with database by entering queries expressed in some natural language. Integrating the NLIDB as an abstraction layer between users and databases provides the main benefit of simplified natural language-based interaction. The purpose of the NLIDB is to enable users to compose questions in the natural language and to receive answers in the natural language as well. The authors propose to apply an NLIDB approach and develop the NLIDB system. In this case the aim of the user-system interaction is to acquire verbal knowledge about the data domain and transforming it into a corresponding database structure.

The developed NLIDB system allows to convert the description of the data domain presented in natural language to the commands for creating RDB. RDB is created in accordance with the data normalization rules. Users enter detailed information about the entities, attributes and relations in natural language, and template of controlling and clarifying questions allow to convert this description into components of SQL queries. In order to ensure the efficiency of the process of converting the data domain verbal description into SQL statements the intermediate-level abstractions are generated by the system. Intermediate-level abstractions are presented as a list of the data domain business rules, database diagram and SQL queries. The final result is the database physical model. Each abstraction can be evaluated and, if necessary, improved by user through the main dialogue with the system.

The presented approach, unlike the existing ones, allows reducing the threshold for users' knowledge in database theory, reducing the process of database development and increasing the efficiency of this process as a whole. Developed NLIDB system architecture, designed dialogue model with pattern for controlling and clarifying questions are displayed in the paper. The NLIDB system was implemented as desktop application with interface built by means of the WPF from .NET. ADO.NET technology and SQL are used to provide operations with database inside MSSQLServer MDS. The results of the NLIDB system are also described in the paper. As the test sample data invoice document was used. Generated results were compared with the entity relation diagram. Various system decision making results have been described to reflect the advantages and disadvantages of the proposed NLIDB system.

The scope of the following researches is to implement NLIDB module that propose users to interact with an already created database to construct queries in natural language and obtain the results of these queries execution.

Figure 1 :1Figure 1: Architecture of the NLIDB system

Figure 2 :2Figure 2: Class diagram of the NLIDB system

Figure 4 :4Figure 4: Database diagram generated by the NLIDB system

: • Table -a set of rows and columns filled with values; • Attribute -a named table column; • Domain -a set of acceptable values for one or several attributes; • Tuple -table row. In generalized form, the relational database model RBD can be presented in the form of tuple:

Table 11Pattern of the main control questions

TypeQuestionUser activity1QconWhat is the data domainЕnter2QconWhat is the data domain entity nameЕnter3QconWhat is the name of the entity attributeЕnter4QconDo you want to set a set of values?Choose: Yes/No5QconWhat is the set of values for the current attribute?Enter6QconDo you want to add another attribute of the selected entityChoose: Yes/No7QconDo you want to add another entityChoose: Yes/No1QclasWhether an entity consists of other entitiesChoose: Yes/No2QclasHow many instance of the shown entity belongs to theChoose: 1 / manyselected entity3QclasIs it possible to calculate the attribute value from values ofChoose: Yes/Noanother attribute?4QclasCan the same entity attribute values be repeatedChoose: Yes/No5QclasWhether the value of the attribute can be divided intoChoose: Yes/Noseparate parts6QclasWhether the same attribute is present in another entityChoose: Yes/No1QcharCan the attribute value be empty/missingChoose: Yes/No2QcharWhether the attribute values are unique/unrepeatedChoose: Yes/No3QcharWhether the entity characteristic is quantitative orChoose: Yes/Noindicative4QcharWhat data type can the selected attribute acquireChoose: numeric,String, DateTime5QcharCan a numeric value contain fractional valuesChoose: Yes/No6QcharWhat is the maximum number of characters can beЕntercontained7QcharWhich date format is suitable for attribute valuesСhoose DateTimeformat

•Table class, that contains the name of the table and a list of attributes of this table. The table class contains methods for interacting with the database. For example, methods that generate a query to create a table in the database.

Table 22Generated data domain business rules

RuleDescriptionTypeBR1Invoice has productsComposition 1:NBR2Invoice has a customerAggregation 1:1BR3Invoice has a supplierAggregation 1:1BR4Invoice product has a productAggregation 1:1BR5Invoice is described by <Number, Date, Customer, Supplier,AttributeList<Product>, Amount, Price, Subtotal, Tax, Total >BR6Subtotal is calculated as SUM(Amount*Price)Calculated AttributeBR7Tax is calculated as Subtotal*0,20Calculated AttributeBR8Total is calculated as Subtotal + TaxCalculated AttributeBR9Customer is described by <Name, Address, Phone, Director>AttributeBR10Supplier is described by <Name, Address, Phone, Director>AttributeBR11Product is described by <Name, Description, ISO>AttributeBR12

Introducting natural language interface to database for data-driven small and medium enterprises DRadovanovic 10.1007/978-3-658-32182-6_3 Data Science -Analytics and Applications PHaber TLampoltshammer MMayr KPlankensteiner

Wiesbaden

Springer -Vieweg 2021 Towards complex text-to-sql in cross-domain database with intermediate representation JGuo ZZhan YGao YXiao JLou TLiu DZhang 10.18653/v1/P19-1444 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, AMACL '19, Association for Computational Linguistics the 57th Annual Meeting of the Association for Computational Linguistics, AMACL '19, Association for Computational Linguistics

Florence, Italy

2019 Sqlnet: Generating structured queries from natural language without reinforcement learning XXu CLiu DSong 10.48550/arXiv.1711.04436 Commutating and language 11 2017 Natural Language Interfaces to Data AQuamar VEfthymiou CLei FÖzcan 10.1561/1900000078 Foundations and trends in databases 11 2022 Neural approaches for natural language interfaces to databases: a survey RCIacob FBrad ESApostol CO I ATruică THosu Rebedea 10.18653/v1/2020.coling-main.34 Proceedings of the 28th International Conference on Computational Linguistics, ICCL '20, International Committee on Computational Linguistics the 28th International Conference on Computational Linguistics, ICCL '20, International Committee on Computational Linguistics

Barcelona, Spain

2020 Natural language question/answering: let users talk with the knowledge graph WZheng HCheng LZou JYu KZhao 10.1145/3132847.3132977 Proceedings of the Conference on Information and Knowledge Management, CIKM '17 the Conference on Information and Knowledge Management, CIKM '17

Singapore, Singapore

Association for Computing Machinery 2017 ATHENA: an ontology-driven system for natural language querying over relational data stores DSaha AFloratou KSankaranarayanan FMinhas ARMittal FÖzcan 10.14778/2994509.2994536 VLDB Endowment 9 12 2016 An approach and software prototype for translation of natural language business rules into database structure AMKopp DLOrlovskyi SVOrekhov Proceedings of the 5th Internation Conference on Computational Linguistics and Intelligent Systems, COLINS '21, CEUR WS the 5th Internation Conference on Computational Linguistics and Intelligent Systems, COLINS '21, CEUR WS

Lviv, Ukraine

2021 Sparklis: an expressive query builder for SPARQL endpoints with guidance in natural language AFerre 10.3233/SW-150208 Semantic web 8 3 2017 A Natural Language Interface to Relational Databases Using an Online Analytic Processing Hypercube FHHazboun MOwda AYOwda 10.3390/ai2040043 Artificial Intelligence 2 4 providing natural language interface to database using artificial intelligence AHKumar ARKunar PHarshitha DNSachin International journal of scientific and technology research 8 2019 Rule based domain specific semantic analysis for natural language interface for database PAnand ZFarooqui 10.5120/ijca2017913502 International journal of computer applications 164 2017 Natural Language Interface to database for data retrieval and processing CTapsai PMeesad 10.14416/j.asep.2020.05.003 Applied science and engineering progress 14 2021 Modelling and Designing of NLP Interface to Database for Afaan Oromoo SWalelign DTesfaye TKebebew 10.21276/ijircst.2021.9.2.2 International Journal of Innovative Research in Computer Science & Technology 9 2021 A Transfer-Learnable Natural Language Interface for Databases WWang YTian HXiong HWang W.-SKu 10.48550/arXiv.1809.02649 Computer Science 1 2018 Pattern based approach for natural language interface to database NChoudhary SGore International journal of engineering research and applications 5 1 2015 A natural language interface for database: achieving transfer-learnability using adversarial method for question understanding WWang YTian HWang WKu 10.1109/ICDE48307.2020.00016 Proceedings of the 36th International Conference on Data Engineering, ICDE '20 the 36th International Conference on Data Engineering, ICDE '20

Dalas, USA

IEE ICDE 2020 An approach for building natural language database using pattern matching technique FNartey FAmavi INyameamah 10.5120/ijca2019919720 International Journal of Computer Applications 177 2019 Natural language interface to relational database (NLI-RDB) trought object relational mapping (ORM) AAlghamda MOwda KCrockett 10.1007/978-3-319-46562-3_29 Advances in Computational Intelligence Systems 1 2017 FHazboun MOwda A YOwda settings Order Article Reprints A Comparative Survey of Recent Natural Language Interfaces for Databases KAffolter KStockinger ABernstein 10.1007/s00778-019-00567-8 Very large databases 28 2019