<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Polhul, T., &amp; Yarovyi, A. Development of a method for fraud detection in heterogeneous data
during installation of mobile applications. Eastern-European Journal of Enterprise Technologies</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1109/CSIT49958.2020.9322036</article-id>
      <title-group>
        <article-title>Method of Multi-purpose Text Analysis Based on a Combination of Knowledge Bases for Intelligent Chatbot</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrii Yarovyi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dmytro Kudriavtsev</string-name>
          <email>dmytro.kudriavtsev@vntu.edu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Vinnytsia national technical university</institution>
          ,
          <addr-line>Khmelnytsky highway 95, Vinnytsia, 21021</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>1</volume>
      <issue>2</issue>
      <fpage>65</fpage>
      <lpage>75</lpage>
      <abstract>
        <p>The issues related to the text message recognition for intelligent chat-bot information system were analyzed. The issue of multi-purposed text message recognition was noted and resolved. Opportunities for use multiple knowledge bases were presented. Experiments related to use combinations of knowledge bases of different subject areas were completed and shown. Comparison table of effectively usage with analogs was presented. Text processing, semantic text analysis, chatbot, terminological knowledge base, intelligent Chatbot - is an intelligent information system (IIS), mainly in the form of a program with the ability to process incoming data and provide on its basis the necessary information. The main area of application is support and assistance of the users of mainly one or more selected subject areas. Ability to use elements of artificial intelligence has led to the rapid development of algorithms for processing information in chatbot systems. The processing of the input data in the IIS data is usually implemented in the form of specially designed algorithms, machine learning technology, neural networks, deep learning [1]. Providing user support during an IIS session, the chatbot may have certain usage restrictions that arise when installing the chat messenger in a more global IIS. This chatbots requires reliable and highly skilled processing of input data. Providing this data processing is the provision of data mining tools, data processing algorithms (Big Data), and monitoring of the processing process to improve analysis and reduce the error [2]. Applying all the above-mentioned facilities and functions, most modern chatbots have a high functional potential and recognize up to 99% of incoming information in preselected subject areas. If you consider the chatbot as a fully autonomous intelligent information system, the problem is to apply for a wide range of tasks and compatibility with third-party software.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        2021 Copyright for this paper by its authors.
ability to automatically update the repository with relevant data. Based on these features, the processing
of such an IIS requires detailed analysis and the use of reliable tools and data processing tools [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Theoretical preparation</title>
      <p>As for the functionality of the chatbot, they may include support for the user of the IIS, the
capabilities of third-party services used by the chatbot, as well as the intellectual analysis of the user's
actions. User support is the exchange of information packets between the user and the IIS in the form
of text, image, audio, and video messages.</p>
      <p>Each of these types of packets of information must have its own device for analysis of incoming
data of the IIS and be independent on the analysis of other types of data in the created IIS. In accordance
with the minimum requirements for the functioning of the IIS chatbot, the system must adequately and
timely recognize the type of information packet coming from the user.</p>
      <p>Another important criterion for chatting is the information environment in which the IIS chatbot
works. By this criterion, they are divided into open, which involves the use of IIS by several users at
the same time, and closed, which can be used by only one user, regardless of the duration of the session.</p>
      <p>The possibility of simultaneous operation of the IIS at once with several users is at the same time
the most difficult stage of implementation of the IIS chatbot. It usually uses parallel computing and
distributed systems technologies. The session format for this type of chatbot is missing. Fixing the
duration of the user with the IIS is usually analyzed by the duration of the breaks between the
presentation of input information by the user. By monitoring the processes of information exchange,
reports and metadata are formed. Analyzing these reports, it is possible to conclude on the level of
processing of information, uncertain situations, when the most relevant information in the data storage
IIS does not exceed the permissible threshold and other forms of work IIS chatbot.</p>
      <p>Uncertain situations arise in case of insufficient storage space or its low informational value, as well
as in cases where the type of incoming information is incompatible with the possible identification of
the selected IIS chatbot. But the most frequent problem of occurrence of uncertain situations is the
essential difference of the input information from the terminology database of the chosen subject area
or areas. Observing these cases, one can conclude that the implementation of additional functions is
necessary.</p>
      <p>Also, the structure of stored data for each subject area must be the same for fast search with a
minimum of the getting data logic changes. Formatting data needs to be simple and clearly
understandable. The most preferable structures are the hash table and key-value dictionary.
Implementing part is related to end-user development. Pay attention to the size of each table in the
terminology base. In the experiment section will be presented several examples of terminology
knowledge bases for different subject areas.</p>
      <p>When using several terminological databases, optionally related subject areas, there is a need to enter
the threshold of sensitivity. After all, when using multiple data stores at the same time to select the best
result, each repository will provide its own best answer with the corresponding recognition factor. In
this case, the data storage response that is not related to the subject of the session should be rejected as
inappropriate. It is for this purpose that this threshold is introduced, which aims to reject such answers
and to avoid such situations in general. The results of adding the threshold of sensitivity are presented
in table 1.</p>
      <p>For input data were used several data sources with text sentences, average size of each sentence is
equal to 15-17 words. Number of sentences for this comparison is equal to 18536. Total number of
keywords for this comparison is equal to 62737. Correctness calculated as
 = 2 −</p>
      <p>(
 = (




− ℎ
− ℎ  
), if (
), if (




− ℎ
− ℎ
)
)
&gt; 1 else</p>
      <p>
        As presented on the Table 1, the best threshold of sensitive is near 20%. Phantom keywords are
attending to the words group which is not belonging to the semantic kernel but accepting by standard
semantic analysis algorithm as keyword. Semantic kernel in this case calculated by improved semantic
text analysis [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. If we will use threshold below the 20 percent, the threshold value of frequency for
semantic text analysis needs to not depends on the threshold of sensitive. In this case correctness is too
small to be selected.
      </p>
      <p>
        When we tried to use the terminological knowledge bases of the two subject areas instead of the one,
the accuracy of recognition from the user was slightly reduced by an average of 2-3% with a threshold
value of 20% for correspondence of information. In the last research, three terminological knowledge
bases were also used and were found situations when information belonged to both terminological
knowledge bases, which in turn caused the uncertainty of the topic of dialogue [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In this case, the
accuracy of recognition of textual information of the user decreased by 8-10% in the worst case, when
the subject areas had significant similarity in terminological knowledge bases. The datasets used in this
research were taken from the Kaggle informational resource [5-7]. Due to the decrease in recognition
accuracy, there was a problem checking the quality of the terminological knowledge base and its
similarity to others used in chat-bot IIS. Additionally, was found that the total processing time of input
information requires detailed analysis and optimization if more than two terminological knowledge
bases are used.
      </p>
      <p>The main purpose of this work is proceeding the research on the use of multiple subject areas using
a recurrent neural network. As well as resolving found issues and compare results of using different
methods of text analysis.</p>
      <p>The high-level structure of the chatbot as a prototype of multi-purpose text analysis is presented
below on Figure 2.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Main research</title>
      <p>Comparing the development of natural language processing with the development of artificial
intelligence, it is worth emphasizing the affinity of the directions and the direct application of artificial
intelligence for the semantic parsing of text information from the flow for the allocation of important
elements (constructions, terms). The selection of intellectually valuable elements from the information
stream is formed by searching among the information of structures already identified by the repository
as constructs which belonging to the terminology database of the chosen subject area or regions. The
described search is carried out by applying the intellectual means of information analysis. For IIS
chatbot such a tool was chosen neural network for the task of intellectual analysis of natural language.
For the problem of intellectual analysis of the text, two most common types of the neural network were
considered. The first type is a machine learning algorithms, the feature of which is the minimum
preprocessing of the data before use [8]. The disadvantage of this type is the complexity of learning and
minor effectiveness if the source data will be changed in the future. The second type is recurrent neural
networks, which also belong to the class of deep neural networks and whose tasks include recognition
of natural text and speech recognition. Unlike machine learning algorithms, the connection between
nodes in recurrent neural networks forms an oriented cycle [8, 9]. This creates the internal state of the
network, which allows it to manifest a dynamic behavior in time, based on which internal memory is
formed. Due to the effect of internal memory, the recurrent neural network is resistant to the dynamic
change of the input stream of information and the processing of arbitrary input sequences [10-12].
Focusing on the problem of natural language recognition, the recurrent neural network was selected.</p>
      <p>The recurrent neural network (RNN) has undergone many modifications and methods for a period
from its creation to the present, leading to a complete set of neural network variants for each of the
tasks. The most well-known RNNs are the Elman, Jordan networks, the echo-state network, the network
using the long-term LSTM method, the two-directional RNN, RNN of continuous time, the Hierarchical
RNN, the second order RNN, the RNN of several time scales, the Turing neural machines. Among the
presented modifications of RNN, the greatest attention was focused on the use of the method of
longterm short-term memory, which best proved itself in the tasks of recognition of natural language. This
method was developed and published in 1997 by Hochreiter and Schmidguber [13]. This method avoids
the problem of gradient disappearance and prevents the disappearance and occurrence of reversible
errors. This is due to the reciprocal propagation through an unlimited number of layers deployed in the
space of the RNN. Based on the reciprocal distribution through an arbitrary number of layers, RNN is
using a long-term memory method to withstand time gaps and can process text information of any
length. Thanks to the capabilities of such an RNN, its widest use was in the field of natural language
recognition, where this model of RNN began to outperform traditional recognition models. Examples
of its effectiveness are the use of the Baidu search giant since 2014, the use of Google Android and
Google Voice Search [14, 15]. Thanks to the power of this RNN, it is successfully used to recognize
context-sensitive languages, to model languages and multilingual language processing.</p>
      <p>Considering the model of RNN using the LSTM method as a modification of the usual RNN, the
LSTM elements serve as nodes, representing the simplest RNN with the possibility of reciprocal
distribution.</p>
      <p>The main advantages of using LSTM method for RNN.
• Withstand time gaps
• Not depends on the text length
• Simple structure</p>
      <p>By the experiment method for IIS chatbot, this RNN was selected with a set of characteristics:
number of entrance inputs is equal to the size of words vector for analysis (inputs of the input layer),
outputs number is equals to the number of used terminology databases (last layer neurons), 3 hidden
layers of 1024 neurons in each. The activation function of the neuron is a sigmoid and number of epochs
is individual for each experiment, but as you will see in the experiments section, the average epochs for
training is about 16-18 thousand.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>After performing theoretical calculations, considered the results of practical implementation in the
form of graphs and tables for different input data. For a detailed analysis of the results, more than 30
experiments were conducted on combined data from 5 subject areas and more than 20 terminological
databases. Each of the terminological databases is presented in the figures below. A graph of the
dependence of processing time on the number of subject areas is given in Figure 1. Part of the
experiments is shown in sections 4.2-4.4. Each experiment contains few screens with diagrams and
table of input data and results.
4.1.</p>
    </sec>
    <sec id="sec-5">
      <title>Terminological knowledge base structure</title>
      <p>The data organization are presented as key-value dictionary with sub-levels of value which store in
shared distributed and document-oriented database – MongoDB [17]. The performance of this database
is not included as valuable parameter of research experiments. For reproduce the results you may use
another database. Each terminology knowledge base includes up to five database which differ by the
impact of relations of the subject area. Each database includes 26 collections of words (1 collection for
each letter of English alphabet). The model of this storing is presented on the Figure 4.</p>
      <p>In this structure you can fast search needed keywords for selected words and create a tree structure
solution for the whole searching text with all crossed links between keywords. The value of the analysis
is increasing with every newfound relation between keywords of the parts of sentences. Duplication in
keywords will be avoided by organizing the words in the keyword set in the small groups (up to ten
related words, in order of decreasing the compatibility).</p>
    </sec>
    <sec id="sec-6">
      <title>4.2. Experiment (two knowledge bases)</title>
      <p>For this experiment was used two knowledge bases: cars and cooking. Each of knowledge base
contains two databases with up to 80000 terms for both knowledge bases. Number of epochs for training
is 16396. Detailed information is presented on the Table 2.</p>
      <p>Buses (Car)</p>
      <p>Cars (Car)
Motorcycle (Car)
Baking (Cooking)
Salat (Cooking)</p>
      <p>Vegetables
(Cooking)</p>
    </sec>
    <sec id="sec-7">
      <title>4.3. Experiment (three knowledge bases)</title>
      <p>For this experiment was used three knowledge bases: cars, sports, and business. Each of knowledge
base contains three databases with up to 130000 terms for all knowledge bases. Number of epochs for
training is 17992. Detailed information is presented on the Table 3.</p>
    </sec>
    <sec id="sec-8">
      <title>4.4. Experiment (four knowledge bases)</title>
      <p>For this experiment was used four knowledge bases: cars, history, banking, and business. Each of
knowledge base contains three databases with up to 215000 terms for all knowledge bases. Number of
epochs for training is 18463. Detailed information is presented on the Table 4.</p>
    </sec>
    <sec id="sec-9">
      <title>5. Comparison results</title>
      <p>The experiments of using different combinations of subject areas and sizes of data are successfully
presented that the main goal of multi-purposed text analysis was reached and for now the comparison
table with similar methods of text analysis is presented below on Table 5.</p>
    </sec>
    <sec id="sec-10">
      <title>6. Conclusion</title>
      <p>During the research, the main goal of which is to create a method for multi-purpose text analysis
was reached. Text analysis for different subject areas can be used for checking the spam messages, for
checking the posts on the social network to find necessary information. The main advantages of using
the multi-purposed text analysis method find the keywords which can be different by subject areas, the
ability to scan user preference in the marketing sphere, making an analysis of the sense of the entered
text, and provide a better semantic text analysis. In experiments section was shown that increasing the
subject areas count creates an issue relating to the performance of the search, and the best solution is to
use up to four subject areas. The correctness level is better than in popular Word2vec method on 2-6
percent.</p>
      <p>For the next research, we will make more experiments with subject areas and try to find the relation
between different groups of subject areas.</p>
    </sec>
    <sec id="sec-11">
      <title>7. References</title>
      <p>Tests, N
150000</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Serban</surname>
            <given-names>I. V.</given-names>
          </string-name>
          et al. “
          <article-title>A deep reinforcement learning chatbot”</article-title>
          <source>arXiv preprint arXiv:1709.02349</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Werder</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heckmann</surname>
            <given-names>C. S.</given-names>
          </string-name>
          <article-title>Ambidexterity in Information Systems Research: Overview of Conceptualizations, Antecedents, and Outcomes</article-title>
          ,
          <source>Journal of Information Technology Theory and Application</source>
          . (
          <year>2019</year>
          )
          <article-title>Т</article-title>
          .
          <fpage>20</fpage>
          . №. 1.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Pearlson</surname>
            <given-names>K. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saunders</surname>
            <given-names>C. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Galletta</surname>
            <given-names>D. F.</given-names>
          </string-name>
          <article-title>Managing and using information systems: A strategic approach</article-title>
          . John Wiley &amp; Sons,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Andrii</given-names>
            <surname>Yarovyi</surname>
          </string-name>
          , Dmytro Kudriavtsev, Serhii Baraban, Volodymyr Ozeranskyi, Liudmyla Krylyk, Andrzej Smolarz, and
          <article-title>Gayni Karnakova "Information technology in creating intelligent chatbots"</article-title>
          ,
          <source>Proc. SPIE 11176</source>
          , Photonics Applications in Astronomy, Communications, Industry, and
          <source>HighEnergy Physics Experiments</source>
          <year>2019</year>
          ,
          <volume>1117627</volume>
          (
          <issue>6 November 2019</issue>
          ); https://doi.org/10.1117/12.2537415.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>