1. Introduction

Polhul, T., & Yarovyi, A. Development of a method for fraud detection in heterogeneous data during installation of mobile applications. Eastern-European Journal of Enterprise Technologies

10.1109/CSIT49958.2020.9322036

Method of Multi-purpose Text Analysis Based on a Combination of Knowledge Bases for Intelligent Chatbot

Andrii Yarovyi

Dmytro Kudriavtsev

dmytro.kudriavtsev@vntu.edu.ua 0 0 Vinnytsia national technical university , Khmelnytsky highway 95, Vinnytsia, 21021 , Ukraine

2020

1 2 65 75

The issues related to the text message recognition for intelligent chat-bot information system were analyzed. The issue of multi-purposed text message recognition was noted and resolved. Opportunities for use multiple knowledge bases were presented. Experiments related to use combinations of knowledge bases of different subject areas were completed and shown. Comparison table of effectively usage with analogs was presented. Text processing, semantic text analysis, chatbot, terminological knowledge base, intelligent Chatbot - is an intelligent information system (IIS), mainly in the form of a program with the ability to process incoming data and provide on its basis the necessary information. The main area of application is support and assistance of the users of mainly one or more selected subject areas. Ability to use elements of artificial intelligence has led to the rapid development of algorithms for processing information in chatbot systems. The processing of the input data in the IIS data is usually implemented in the form of specially designed algorithms, machine learning technology, neural networks, deep learning [1]. Providing user support during an IIS session, the chatbot may have certain usage restrictions that arise when installing the chat messenger in a more global IIS. This chatbots requires reliable and highly skilled processing of input data. Providing this data processing is the provision of data mining tools, data processing algorithms (Big Data), and monitoring of the processing process to improve analysis and reduce the error [2]. Applying all the above-mentioned facilities and functions, most modern chatbots have a high functional potential and recognize up to 99% of incoming information in preselected subject areas. If you consider the chatbot as a fully autonomous intelligent information system, the problem is to apply for a wide range of tasks and compatibility with third-party software.

1. Introduction

2021 Copyright for this paper by its authors. ability to automatically update the repository with relevant data. Based on these features, the processing of such an IIS requires detailed analysis and the use of reliable tools and data processing tools [ 4 ].

2. Theoretical preparation

As for the functionality of the chatbot, they may include support for the user of the IIS, the capabilities of third-party services used by the chatbot, as well as the intellectual analysis of the user's actions. User support is the exchange of information packets between the user and the IIS in the form of text, image, audio, and video messages.

Each of these types of packets of information must have its own device for analysis of incoming data of the IIS and be independent on the analysis of other types of data in the created IIS. In accordance with the minimum requirements for the functioning of the IIS chatbot, the system must adequately and timely recognize the type of information packet coming from the user.

Another important criterion for chatting is the information environment in which the IIS chatbot works. By this criterion, they are divided into open, which involves the use of IIS by several users at the same time, and closed, which can be used by only one user, regardless of the duration of the session.

The possibility of simultaneous operation of the IIS at once with several users is at the same time the most difficult stage of implementation of the IIS chatbot. It usually uses parallel computing and distributed systems technologies. The session format for this type of chatbot is missing. Fixing the duration of the user with the IIS is usually analyzed by the duration of the breaks between the presentation of input information by the user. By monitoring the processes of information exchange, reports and metadata are formed. Analyzing these reports, it is possible to conclude on the level of processing of information, uncertain situations, when the most relevant information in the data storage IIS does not exceed the permissible threshold and other forms of work IIS chatbot.

Uncertain situations arise in case of insufficient storage space or its low informational value, as well as in cases where the type of incoming information is incompatible with the possible identification of the selected IIS chatbot. But the most frequent problem of occurrence of uncertain situations is the essential difference of the input information from the terminology database of the chosen subject area or areas. Observing these cases, one can conclude that the implementation of additional functions is necessary.

Also, the structure of stored data for each subject area must be the same for fast search with a minimum of the getting data logic changes. Formatting data needs to be simple and clearly understandable. The most preferable structures are the hash table and key-value dictionary. Implementing part is related to end-user development. Pay attention to the size of each table in the terminology base. In the experiment section will be presented several examples of terminology knowledge bases for different subject areas.

When using several terminological databases, optionally related subject areas, there is a need to enter the threshold of sensitivity. After all, when using multiple data stores at the same time to select the best result, each repository will provide its own best answer with the corresponding recognition factor. In this case, the data storage response that is not related to the subject of the session should be rejected as inappropriate. It is for this purpose that this threshold is introduced, which aims to reject such answers and to avoid such situations in general. The results of adding the threshold of sensitivity are presented in table 1.

For input data were used several data sources with text sentences, average size of each sentence is equal to 15-17 words. Number of sentences for this comparison is equal to 18536. Total number of keywords for this comparison is equal to 62737. Correctness calculated as = 2 −

( = ( − ℎ − ℎ ), if ( ), if ( − ℎ − ℎ ) ) > 1 else

As presented on the Table 1, the best threshold of sensitive is near 20%. Phantom keywords are attending to the words group which is not belonging to the semantic kernel but accepting by standard semantic analysis algorithm as keyword. Semantic kernel in this case calculated by improved semantic text analysis [ 4 ]. If we will use threshold below the 20 percent, the threshold value of frequency for semantic text analysis needs to not depends on the threshold of sensitive. In this case correctness is too small to be selected.

When we tried to use the terminological knowledge bases of the two subject areas instead of the one, the accuracy of recognition from the user was slightly reduced by an average of 2-3% with a threshold value of 20% for correspondence of information. In the last research, three terminological knowledge bases were also used and were found situations when information belonged to both terminological knowledge bases, which in turn caused the uncertainty of the topic of dialogue [ 4 ]. In this case, the accuracy of recognition of textual information of the user decreased by 8-10% in the worst case, when the subject areas had significant similarity in terminological knowledge bases. The datasets used in this research were taken from the Kaggle informational resource [5-7]. Due to the decrease in recognition accuracy, there was a problem checking the quality of the terminological knowledge base and its similarity to others used in chat-bot IIS. Additionally, was found that the total processing time of input information requires detailed analysis and optimization if more than two terminological knowledge bases are used.

The main purpose of this work is proceeding the research on the use of multiple subject areas using a recurrent neural network. As well as resolving found issues and compare results of using different methods of text analysis.

The high-level structure of the chatbot as a prototype of multi-purpose text analysis is presented below on Figure 2.

3. Main research

Comparing the development of natural language processing with the development of artificial intelligence, it is worth emphasizing the affinity of the directions and the direct application of artificial intelligence for the semantic parsing of text information from the flow for the allocation of important elements (constructions, terms). The selection of intellectually valuable elements from the information stream is formed by searching among the information of structures already identified by the repository as constructs which belonging to the terminology database of the chosen subject area or regions. The described search is carried out by applying the intellectual means of information analysis. For IIS chatbot such a tool was chosen neural network for the task of intellectual analysis of natural language. For the problem of intellectual analysis of the text, two most common types of the neural network were considered. The first type is a machine learning algorithms, the feature of which is the minimum preprocessing of the data before use [8]. The disadvantage of this type is the complexity of learning and minor effectiveness if the source data will be changed in the future. The second type is recurrent neural networks, which also belong to the class of deep neural networks and whose tasks include recognition of natural text and speech recognition. Unlike machine learning algorithms, the connection between nodes in recurrent neural networks forms an oriented cycle [8, 9]. This creates the internal state of the network, which allows it to manifest a dynamic behavior in time, based on which internal memory is formed. Due to the effect of internal memory, the recurrent neural network is resistant to the dynamic change of the input stream of information and the processing of arbitrary input sequences [10-12]. Focusing on the problem of natural language recognition, the recurrent neural network was selected.

The recurrent neural network (RNN) has undergone many modifications and methods for a period from its creation to the present, leading to a complete set of neural network variants for each of the tasks. The most well-known RNNs are the Elman, Jordan networks, the echo-state network, the network using the long-term LSTM method, the two-directional RNN, RNN of continuous time, the Hierarchical RNN, the second order RNN, the RNN of several time scales, the Turing neural machines. Among the presented modifications of RNN, the greatest attention was focused on the use of the method of longterm short-term memory, which best proved itself in the tasks of recognition of natural language. This method was developed and published in 1997 by Hochreiter and Schmidguber [13]. This method avoids the problem of gradient disappearance and prevents the disappearance and occurrence of reversible errors. This is due to the reciprocal propagation through an unlimited number of layers deployed in the space of the RNN. Based on the reciprocal distribution through an arbitrary number of layers, RNN is using a long-term memory method to withstand time gaps and can process text information of any length. Thanks to the capabilities of such an RNN, its widest use was in the field of natural language recognition, where this model of RNN began to outperform traditional recognition models. Examples of its effectiveness are the use of the Baidu search giant since 2014, the use of Google Android and Google Voice Search [14, 15]. Thanks to the power of this RNN, it is successfully used to recognize context-sensitive languages, to model languages and multilingual language processing.

Considering the model of RNN using the LSTM method as a modification of the usual RNN, the LSTM elements serve as nodes, representing the simplest RNN with the possibility of reciprocal distribution.

The main advantages of using LSTM method for RNN. • Withstand time gaps • Not depends on the text length • Simple structure

By the experiment method for IIS chatbot, this RNN was selected with a set of characteristics: number of entrance inputs is equal to the size of words vector for analysis (inputs of the input layer), outputs number is equals to the number of used terminology databases (last layer neurons), 3 hidden layers of 1024 neurons in each. The activation function of the neuron is a sigmoid and number of epochs is individual for each experiment, but as you will see in the experiments section, the average epochs for training is about 16-18 thousand.

4. Experiments

After performing theoretical calculations, considered the results of practical implementation in the form of graphs and tables for different input data. For a detailed analysis of the results, more than 30 experiments were conducted on combined data from 5 subject areas and more than 20 terminological databases. Each of the terminological databases is presented in the figures below. A graph of the dependence of processing time on the number of subject areas is given in Figure 1. Part of the experiments is shown in sections 4.2-4.4. Each experiment contains few screens with diagrams and table of input data and results. 4.1.

Terminological knowledge base structure

The data organization are presented as key-value dictionary with sub-levels of value which store in shared distributed and document-oriented database – MongoDB [17]. The performance of this database is not included as valuable parameter of research experiments. For reproduce the results you may use another database. Each terminology knowledge base includes up to five database which differ by the impact of relations of the subject area. Each database includes 26 collections of words (1 collection for each letter of English alphabet). The model of this storing is presented on the Figure 4.

In this structure you can fast search needed keywords for selected words and create a tree structure solution for the whole searching text with all crossed links between keywords. The value of the analysis is increasing with every newfound relation between keywords of the parts of sentences. Duplication in keywords will be avoided by organizing the words in the keyword set in the small groups (up to ten related words, in order of decreasing the compatibility).

4.2. Experiment (two knowledge bases)

For this experiment was used two knowledge bases: cars and cooking. Each of knowledge base contains two databases with up to 80000 terms for both knowledge bases. Number of epochs for training is 16396. Detailed information is presented on the Table 2.

Buses (Car)

Cars (Car) Motorcycle (Car) Baking (Cooking) Salat (Cooking)

Vegetables (Cooking)

4.3. Experiment (three knowledge bases)

For this experiment was used three knowledge bases: cars, sports, and business. Each of knowledge base contains three databases with up to 130000 terms for all knowledge bases. Number of epochs for training is 17992. Detailed information is presented on the Table 3.

4.4. Experiment (four knowledge bases)

For this experiment was used four knowledge bases: cars, history, banking, and business. Each of knowledge base contains three databases with up to 215000 terms for all knowledge bases. Number of epochs for training is 18463. Detailed information is presented on the Table 4.

5. Comparison results

The experiments of using different combinations of subject areas and sizes of data are successfully presented that the main goal of multi-purposed text analysis was reached and for now the comparison table with similar methods of text analysis is presented below on Table 5.

6. Conclusion

During the research, the main goal of which is to create a method for multi-purpose text analysis was reached. Text analysis for different subject areas can be used for checking the spam messages, for checking the posts on the social network to find necessary information. The main advantages of using the multi-purposed text analysis method find the keywords which can be different by subject areas, the ability to scan user preference in the marketing sphere, making an analysis of the sense of the entered text, and provide a better semantic text analysis. In experiments section was shown that increasing the subject areas count creates an issue relating to the performance of the search, and the best solution is to use up to four subject areas. The correctness level is better than in popular Word2vec method on 2-6 percent.

For the next research, we will make more experiments with subject areas and try to find the relation between different groups of subject areas.

7. References

Tests, N 150000

[1] Serban

I. V.

et al. “ A deep reinforcement learning chatbot” arXiv preprint arXiv:1709.02349 ( 2017 )

[2] Werder

, Heckmann

C. S.

Ambidexterity in Information Systems Research: Overview of Conceptualizations, Antecedents, and Outcomes , Journal of Information Technology Theory and Application . ( 2019 ) Т . 20 . №. 1.

[3] Pearlson

K. E.

, Saunders

C. S.

, Galletta

D. F.

Managing and using information systems: A strategic approach . John Wiley & Sons, 2019 .

[4]

Andrii

Yarovyi , Dmytro Kudriavtsev, Serhii Baraban, Volodymyr Ozeranskyi, Liudmyla Krylyk, Andrzej Smolarz, and Gayni Karnakova "Information technology in creating intelligent chatbots" , Proc. SPIE 11176 , Photonics Applications in Astronomy, Communications, Industry, and HighEnergy Physics Experiments 2019 , 1117627 ( 6 November 2019 ); https://doi.org/10.1117/12.2537415.