=Paper= {{Paper |id=Vol-2870/paper91 |storemode=property |title=Method of Multi-Purpose Text Analysis Based on a Combination of Knowledge Bases for Intelligent Chatbot |pdfUrl=https://ceur-ws.org/Vol-2870/paper91.pdf |volume=Vol-2870 |authors=Andrii Yarovyi,Dmytro Kudriavtsev |dblpUrl=https://dblp.org/rec/conf/colins/YarovyiK21 }} ==Method of Multi-Purpose Text Analysis Based on a Combination of Knowledge Bases for Intelligent Chatbot== https://ceur-ws.org/Vol-2870/paper91.pdf
Method of Multi-purpose Text Analysis Based on a Combination
of Knowledge Bases for Intelligent Chatbot
Andrii Yarovyi and Dmytro Kudriavtsev
Vinnytsia national technical university, Khmelnytsky highway 95, Vinnytsia, 21021, Ukraine


                 Abstract
                 The issues related to the text message recognition for intelligent chat-bot information system
                 were analyzed. The issue of multi-purposed text message recognition was noted and resolved.
                 Opportunities for use multiple knowledge bases were presented. Experiments related to use
                 combinations of knowledge bases of different subject areas were completed and shown.
                 Comparison table of effectively usage with analogs was presented.

                 Keywords 1
                 Text processing, semantic text analysis, chatbot, terminological knowledge base, intelligent
                 systems.

1. Introduction
    Chatbot – is an intelligent information system (IIS), mainly in the form of a program with the ability
to process incoming data and provide on its basis the necessary information. The main area of
application is support and assistance of the users of mainly one or more selected subject areas. Ability
to use elements of artificial intelligence has led to the rapid development of algorithms for processing
information in chatbot systems. The processing of the input data in the IIS data is usually implemented
in the form of specially designed algorithms, machine learning technology, neural networks, deep
learning [1]. Providing user support during an IIS session, the chatbot may have certain usage
restrictions that arise when installing the chat messenger in a more global IIS. This chatbots requires
reliable and highly skilled processing of input data. Providing this data processing is the provision of
data mining tools, data processing algorithms (Big Data), and monitoring of the processing process to
improve analysis and reduce the error [2]. Applying all the above-mentioned facilities and functions,
most modern chatbots have a high functional potential and recognize up to 99% of incoming
information in preselected subject areas. If you consider the chatbot as a fully autonomous intelligent
information system, the problem is to apply for a wide range of tasks and compatibility with third-party
software.
    When choosing the type of chatbot for a more global IIS, the focus is on the subject area in which
should be understood chatbot. Accordingly, the degree of understanding and complexity of the
implementation of the data storage is directly proportional to the terminology base of the subject area.
When combining several subject areas or choosing an interdisciplinary domain, the degree of
understanding falls significantly and increases the complexity of maintaining the relevance of the data
storage chatbot. The use of several repositories simultaneously or in parallel is a rather promising
solution to this problem, since it ensures the division of the terminology database into smaller
aggregates, which is easier to accompany and provide input requests to the user of the IIS [3]. But
despite the benefits, the key disadvantage of this method is to increase the complexity of processing
and providing information to the user due to possible conflicts between data storages. When choosing
a stand-alone IIS chatbot, the impact on its characteristics is largely independent of other IIS’s and the


COLINS-2021: 5th International Conference on Computational Linguistics and Intelligent Systems, April 22–23, 2021, Kharkiv, Ukraine
EMAIL: a.yarovyy@vntu.edu.ua (A. Yarovyi); dmytro.kudriavtsev@vntu.edu.ua (D. Kudriavtsev);
ORCID: 0000-0002-6668-2425 (A. Yarovyi); 0000-0001-7116-7869 (D. Kudriavtsev);
              ©️ 2021 Copyright for this paper by its authors.
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)
ability to automatically update the repository with relevant data. Based on these features, the processing
of such an IIS requires detailed analysis and the use of reliable tools and data processing tools [4].

2. Theoretical preparation
    As for the functionality of the chatbot, they may include support for the user of the IIS, the
capabilities of third-party services used by the chatbot, as well as the intellectual analysis of the user's
actions. User support is the exchange of information packets between the user and the IIS in the form
of text, image, audio, and video messages.
     Each of these types of packets of information must have its own device for analysis of incoming
data of the IIS and be independent on the analysis of other types of data in the created IIS. In accordance
with the minimum requirements for the functioning of the IIS chatbot, the system must adequately and
timely recognize the type of information packet coming from the user.
    Another important criterion for chatting is the information environment in which the IIS chatbot
works. By this criterion, they are divided into open, which involves the use of IIS by several users at
the same time, and closed, which can be used by only one user, regardless of the duration of the session.
    The possibility of simultaneous operation of the IIS at once with several users is at the same time
the most difficult stage of implementation of the IIS chatbot. It usually uses parallel computing and
distributed systems technologies. The session format for this type of chatbot is missing. Fixing the
duration of the user with the IIS is usually analyzed by the duration of the breaks between the
presentation of input information by the user. By monitoring the processes of information exchange,
reports and metadata are formed. Analyzing these reports, it is possible to conclude on the level of
processing of information, uncertain situations, when the most relevant information in the data storage
IIS does not exceed the permissible threshold and other forms of work IIS chatbot.
    Uncertain situations arise in case of insufficient storage space or its low informational value, as well
as in cases where the type of incoming information is incompatible with the possible identification of
the selected IIS chatbot. But the most frequent problem of occurrence of uncertain situations is the
essential difference of the input information from the terminology database of the chosen subject area
or areas. Observing these cases, one can conclude that the implementation of additional functions is
necessary.
    Also, the structure of stored data for each subject area must be the same for fast search with a
minimum of the getting data logic changes. Formatting data needs to be simple and clearly
understandable. The most preferable structures are the hash table and key-value dictionary.
Implementing part is related to end-user development. Pay attention to the size of each table in the
terminology base. In the experiment section will be presented several examples of terminology
knowledge bases for different subject areas.
    When using several terminological databases, optionally related subject areas, there is a need to enter
the threshold of sensitivity. After all, when using multiple data stores at the same time to select the best
result, each repository will provide its own best answer with the corresponding recognition factor. In
this case, the data storage response that is not related to the subject of the session should be rejected as
inappropriate. It is for this purpose that this threshold is introduced, which aims to reject such answers
and to avoid such situations in general. The results of adding the threshold of sensitivity are presented
in table 1.
    For input data were used several data sources with text sentences, average size of each sentence is
equal to 15-17 words. Number of sentences for this comparison is equal to 18536. Total number of
keywords for this comparison is equal to 62737. Correctness calculated as

                           (𝐾𝑒𝑦𝑤𝑜𝑟𝑑𝑠−𝑃ℎ𝑎𝑛𝑡𝑜𝑚)          (𝐾𝑒𝑦𝑤𝑜𝑟𝑑𝑠−𝑃ℎ𝑎𝑛𝑡𝑜𝑚)
                 𝐶 =2−        𝑇𝑜𝑡𝑎𝑙 𝑘𝑒𝑦𝑤𝑜𝑟𝑑𝑠
                                                , if      𝑇𝑜𝑡𝑎𝑙 𝑘𝑒𝑦𝑤𝑜𝑟𝑑𝑠
                                                                            > 1 else
                           (𝐾𝑒𝑦𝑤𝑜𝑟𝑑𝑠−𝑃ℎ𝑎𝑛𝑡𝑜𝑚)          (𝐾𝑒𝑦𝑤𝑜𝑟𝑑𝑠−𝑃ℎ𝑎𝑛𝑡𝑜𝑚)
                                                                                                      (1)
                      𝐶=      𝑇𝑜𝑡𝑎𝑙 𝑘𝑒𝑦𝑤𝑜𝑟𝑑𝑠
                                                , if     𝑇𝑜𝑡𝑎𝑙 𝑘𝑒𝑦𝑤𝑜𝑟𝑑𝑠
                                                                            <1
Table 1
Influence of adding threshold
       Threshold, %        Keywords found, N          Phantom found, N              Correctness, %
             5                 120528                       8920                        22,101
            10                  94877                       3105                        53,719
            15                  81059                       1007                        72,401
            20                  56343                       259                         89,396
            25                  53952                       227                         85,635
            35                  49013                       193                         77,817

    As presented on the Table 1, the best threshold of sensitive is near 20%. Phantom keywords are
attending to the words group which is not belonging to the semantic kernel but accepting by standard
semantic analysis algorithm as keyword. Semantic kernel in this case calculated by improved semantic
text analysis [4]. If we will use threshold below the 20 percent, the threshold value of frequency for
semantic text analysis needs to not depends on the threshold of sensitive. In this case correctness is too
small to be selected.
    When we tried to use the terminological knowledge bases of the two subject areas instead of the one,
the accuracy of recognition from the user was slightly reduced by an average of 2-3% with a threshold
value of 20% for correspondence of information. In the last research, three terminological knowledge
bases were also used and were found situations when information belonged to both terminological
knowledge bases, which in turn caused the uncertainty of the topic of dialogue [4]. In this case, the
accuracy of recognition of textual information of the user decreased by 8-10% in the worst case, when
the subject areas had significant similarity in terminological knowledge bases. The datasets used in this
research were taken from the Kaggle informational resource [5-7]. Due to the decrease in recognition
accuracy, there was a problem checking the quality of the terminological knowledge base and its
similarity to others used in chat-bot IIS. Additionally, was found that the total processing time of input
information requires detailed analysis and optimization if more than two terminological knowledge
bases are used.




Figure 1: Model of a user's session in the chat-bot IIS
   The main purpose of this work is proceeding the research on the use of multiple subject areas using
a recurrent neural network. As well as resolving found issues and compare results of using different
methods of text analysis.
   The high-level structure of the chatbot as a prototype of multi-purpose text analysis is presented
below on Figure 2.




Figure 2: Architecture model of IIS chatbot

3. Main research
    Comparing the development of natural language processing with the development of artificial
intelligence, it is worth emphasizing the affinity of the directions and the direct application of artificial
intelligence for the semantic parsing of text information from the flow for the allocation of important
elements (constructions, terms). The selection of intellectually valuable elements from the information
stream is formed by searching among the information of structures already identified by the repository
as constructs which belonging to the terminology database of the chosen subject area or regions. The
described search is carried out by applying the intellectual means of information analysis. For IIS
chatbot such a tool was chosen neural network for the task of intellectual analysis of natural language.
For the problem of intellectual analysis of the text, two most common types of the neural network were
considered. The first type is a machine learning algorithms, the feature of which is the minimum pre-
processing of the data before use [8]. The disadvantage of this type is the complexity of learning and
minor effectiveness if the source data will be changed in the future. The second type is recurrent neural
networks, which also belong to the class of deep neural networks and whose tasks include recognition
of natural text and speech recognition. Unlike machine learning algorithms, the connection between
nodes in recurrent neural networks forms an oriented cycle [8, 9]. This creates the internal state of the
network, which allows it to manifest a dynamic behavior in time, based on which internal memory is
formed. Due to the effect of internal memory, the recurrent neural network is resistant to the dynamic
change of the input stream of information and the processing of arbitrary input sequences [10-12].
Focusing on the problem of natural language recognition, the recurrent neural network was selected.
    The recurrent neural network (RNN) has undergone many modifications and methods for a period
from its creation to the present, leading to a complete set of neural network variants for each of the
tasks. The most well-known RNNs are the Elman, Jordan networks, the echo-state network, the network
using the long-term LSTM method, the two-directional RNN, RNN of continuous time, the Hierarchical
RNN, the second order RNN, the RNN of several time scales, the Turing neural machines. Among the
presented modifications of RNN, the greatest attention was focused on the use of the method of long-
term short-term memory, which best proved itself in the tasks of recognition of natural language. This
method was developed and published in 1997 by Hochreiter and Schmidguber [13]. This method avoids
the problem of gradient disappearance and prevents the disappearance and occurrence of reversible
errors. This is due to the reciprocal propagation through an unlimited number of layers deployed in the
space of the RNN. Based on the reciprocal distribution through an arbitrary number of layers, RNN is
using a long-term memory method to withstand time gaps and can process text information of any
length. Thanks to the capabilities of such an RNN, its widest use was in the field of natural language
recognition, where this model of RNN began to outperform traditional recognition models. Examples
of its effectiveness are the use of the Baidu search giant since 2014, the use of Google Android and
Google Voice Search [14, 15]. Thanks to the power of this RNN, it is successfully used to recognize
context-sensitive languages, to model languages and multilingual language processing.
   Considering the model of RNN using the LSTM method as a modification of the usual RNN, the
LSTM elements serve as nodes, representing the simplest RNN with the possibility of reciprocal
distribution.
   The main advantages of using LSTM method for RNN.
   •     Withstand time gaps
   •     Not depends on the text length
   •     Simple structure




Figure 3: Model of a LSTM cell in RNN [16]

    By the experiment method for IIS chatbot, this RNN was selected with a set of characteristics:
number of entrance inputs is equal to the size of words vector for analysis (inputs of the input layer),
outputs number is equals to the number of used terminology databases (last layer neurons), 3 hidden
layers of 1024 neurons in each. The activation function of the neuron is a sigmoid and number of epochs
is individual for each experiment, but as you will see in the experiments section, the average epochs for
training is about 16-18 thousand.
4. Experiments
   After performing theoretical calculations, considered the results of practical implementation in the
form of graphs and tables for different input data. For a detailed analysis of the results, more than 30
experiments were conducted on combined data from 5 subject areas and more than 20 terminological
databases. Each of the terminological databases is presented in the figures below. A graph of the
dependence of processing time on the number of subject areas is given in Figure 1. Part of the
experiments is shown in sections 4.2-4.4. Each experiment contains few screens with diagrams and
table of input data and results.

4.1.    Terminological knowledge base structure
    The data organization are presented as key-value dictionary with sub-levels of value which store in
shared distributed and document-oriented database – MongoDB [17]. The performance of this database
is not included as valuable parameter of research experiments. For reproduce the results you may use
another database. Each terminology knowledge base includes up to five database which differ by the
impact of relations of the subject area. Each database includes 26 collections of words (1 collection for
each letter of English alphabet). The model of this storing is presented on the Figure 4.




Figure 4: Example of storage the terminology knowledge base

    In this structure you can fast search needed keywords for selected words and create a tree structure
solution for the whole searching text with all crossed links between keywords. The value of the analysis
is increasing with every newfound relation between keywords of the parts of sentences. Duplication in
keywords will be avoided by organizing the words in the keyword set in the small groups (up to ten
related words, in order of decreasing the compatibility).
4.2.    Experiment (two knowledge bases)
    For this experiment was used two knowledge bases: cars and cooking. Each of knowledge base
contains two databases with up to 80000 terms for both knowledge bases. Number of epochs for training
is 16396. Detailed information is presented on the Table 2.

Table 2
Results of text analysis for two subject areas
         Title             Average        Average      Terms, N      Correctness, %       Total
                         found times processing                                         processed
                           per text        time, s                                       texts, N
     Buses (Car)            0,193           0,487       11034             78,3            40000
      Cars (Car)            0,375           0,450       23006             82,0            40000
  Motorcycle (Car)          0,036           0,503       5974              74,9            40000
  Baking (Cooking)          0,241           0,488       22406             71,4            40000
   Salat (Cooking)          0,156           0,421       3244              81,2            40000
     Vegetables             0,289           0,447       15779             93,2            40000
      (Cooking)




Figure 5: Diagrams of training the RNN with LSTM
4.3.    Experiment (three knowledge bases)
    For this experiment was used three knowledge bases: cars, sports, and business. Each of knowledge
base contains three databases with up to 130000 terms for all knowledge bases. Number of epochs for
training is 17992. Detailed information is presented on the Table 3.

Table 3
Results of text analysis for three subject areas
         Title             Average        Average      Terms, N      Correctness, %       Total
                         found times processing                                         processed
                           per text        time, s                                       texts, N
     Buses (Car)            0,032           0,604       11034             76,1           100000
      Cars (Car)            0,194           0,725       23006             79,5           100000
  Motorcycle (Car)          0,015           0,536        5974             74,2           100000
   Tennis (Sports)          0,083           0,703        4306             80,3           100000
     Cybersport             0,252           0,596       31053             79,6           100000
       (Sports)
  Biathlon (Sports)         0,099            0,54       6210              76,3           100000
     Documents              0,288           0,782       27083             82,2           100000
     (Business)
   Laws (Business)          0,221           0,746       14007             78,7           100000
    Government              0,186           0,658        5384             85,8           100000
     (Business)




Figure 6: Diagrams of training the RNN with LSTM
4.4.    Experiment (four knowledge bases)
   For this experiment was used four knowledge bases: cars, history, banking, and business. Each of
knowledge base contains three databases with up to 215000 terms for all knowledge bases. Number of
epochs for training is 18463. Detailed information is presented on the Table 4.

Table 4
Results of text analysis for four subject areas
         Title             Average        Average     Terms, N      Correctness, %      Total
                         found times processing                                       processed
                           per text        time, s                                     texts, N
     Buses (Car)            0,057           0,735      11034             76,1          150000
      Cars (Car)            0,152           0,791      23006             79,5          150000
  Motorcycle (Car)          0,013           0,701       5974             78,6          150000
   Rome (History)           0,183           0,853      25796             85,2          150000
  Austria (History)         0,297           0,820      16402             79,3          150000
 Germany (History)          0,268           0,874      38671             84,7          150000
  Banks (Banking)           0,186           0,815      12057             81,4          150000
  Credit (Banking)          0,239           0,916      18539             77,8          150000
    Transactions            0,106           0,975      18663             79,1          150000
      (Banking)
     Documents              0,183           0,862      27083             81,6          150000
     (Business)
   Laws (Business)          0,174           0,819      14007             69,2          150000
    Government              0,153           0,825       5384             82,1          150000
     (Business)




   Figure 7: Diagrams of training the RNN with LSTM
5. Comparison results
   The experiments of using different combinations of subject areas and sizes of data are successfully
presented that the main goal of multi-purposed text analysis was reached and for now the comparison
table with similar methods of text analysis is presented below on Table 5.

Table 5
Comparison of different text analysis methods
        Title              Average            Found               Found         Correctness,     Tests, N
                      processing time, s keywords, N           phantoms, N           %
  Semantic text              0,602          381064              50973             74,22           150000
      analysis
  Word2vec [18]              0,449          389082              29407              83,43          150000
Multi-purpose text           0,376          393176              17704              85,37          150000
   analysis (two
  subject areas)
Multi-purpose text           0,481          395362              14056              86,52          150000
  analysis (three
  subject areas)
Multi-purpose text           0,619          395163              12953              89,03          150000
   analysis (four
  subject areas)



6. Conclusion
   During the research, the main goal of which is to create a method for multi-purpose text analysis
was reached. Text analysis for different subject areas can be used for checking the spam messages, for
checking the posts on the social network to find necessary information. The main advantages of using
the multi-purposed text analysis method find the keywords which can be different by subject areas, the
ability to scan user preference in the marketing sphere, making an analysis of the sense of the entered
text, and provide a better semantic text analysis. In experiments section was shown that increasing the
subject areas count creates an issue relating to the performance of the search, and the best solution is to
use up to four subject areas. The correctness level is better than in popular Word2vec method on 2-6
percent.
   For the next research, we will make more experiments with subject areas and try to find the relation
between different groups of subject areas.

7. References
[1] Serban I. V. et al. “A deep reinforcement learning chatbot” arXiv preprint arXiv:1709.02349
    (2017)
[2] Werder K., Heckmann C. S. Ambidexterity in Information Systems Research: Overview of
    Conceptualizations, Antecedents, and Outcomes, Journal of Information Technology Theory and
    Application. (2019) Т. 20. №. 1.
[3] Pearlson K. E., Saunders C. S., Galletta D. F. Managing and using information systems: A strategic
    approach. John Wiley & Sons, 2019.
[4] Andrii Yarovyi, Dmytro Kudriavtsev, Serhii Baraban, Volodymyr Ozeranskyi, Liudmyla Krylyk,
    Andrzej Smolarz, and Gayni Karnakova "Information technology in creating intelligent chatbots",
    Proc. SPIE 11176, Photonics Applications in Astronomy, Communications, Industry, and High-
    Energy        Physics    Experiments         2019,    1117627        (6    November        2019);
    https://doi.org/10.1117/12.2537415.
[5] Bhat G., “Chatbot Data” dataset, 2018, URL: https://www.kaggle.com/fungusamongus/chatbot-
     data
[6] Liling Tan, “Old Newspapers” dataset, 2018, URL: https://www.kaggle.com/alvations/old-
     newspapers
[7] Jeet J., “US Financial News Articles” dataset, 2018, URL: https://www.kaggle.com/jeet2016/us-
     financial-news-articles
[8] Polhul, T., & Yarovyi, A. Development of a method for fraud detection in heterogeneous data
     during installation of mobile applications. Eastern-European Journal of Enterprise Technologies,
     2019, T.1(2), 65–75. https://doi.org/10.15587/1729-4061.2019.155060
[9] Guthrie D. Unsupervised detection of anomalous text, University of Sheffield, 2008.
[10] The multiple dimensions of information quality, 2017, URL: https://www.researchgate.
     net/publication/ 242929284_The_Multiple_Dimensions_of _Information_Quality
[11] Andrii Yarovyi, Raisa Ilchenko, Ihor Arseniuk, Yevhene Shemet, Andrzej Kotyra, Saule Smailova,
     "An intelligent system of neural networking recognition of multicolor spot images of laser beam
     profile," Proc. SPIE 10808, Photonics Applications in Astronomy, Communications, Industry, and
     High-Energy        Physics     Experiments      2018,    108081B       (1    October      2018),
     URL: https://doi.org/10.1117/12.2501691;
[12] A. Yarovii, D. Kudriavtsev and O. Prozor, "Improving the Accuracy of Text Message Recognition
     with an Intelligent Chatbot Information System," 2020 IEEE 15th International Conference on
     Computer Sciences and Information Technologies (CSIT), Zbarazh, Ukraine, 2020, pp. 76-79, doi:
     10.1109/CSIT49958.2020.9322036.
[13] S. Hochreiter and J. Schmidhuber, Recurrent Neural Networks and LSTM, 1997, URL:
     https://www.researchgate.net/publication/13853244_Long_Short-term_Memory
[14] Google AI Blog, Neural Networks behind Google Voice, 2015, URL:
     https://ai.googleblog.com/2015/08/the-neural-networks-behind-google-voice.html
[15] M. Patwary, S. Narang, E. Undersander, J. Hestness, and G. Diamos, Neural Networks in Baidu
     search engine, 2018, URL: http://research.baidu.com/Blog/index-view?id=103
[16] O. Vinyals, G. Corrado and J. Shlens, Understanding LSTM Networks, 2015, URL:
     https://colah.github.io/posts/2015-08-Understanding-LSTMs/
[17] Mongo DB Official website, Mongo DB Features and documentation, URL:
     https://docs.mongodb.com/
[18] V. Bhanawat, Word2vec algorithm, 2019, URL: https://medium.com/@vishwasbhanawat/the-
     architecture-of-word2vec-78659ceb6638