<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Development of a method and software system for dialogue in real time*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrey Tarasiev</string-name>
          <email>andrew4800@mail.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Egor Talancev</string-name>
          <email>i.spyric@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Konstantin Aksyonov</string-name>
          <email>bpsim.dss@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olga Aksyonova</string-name>
          <email>wiper99@mail.ru</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Igor Kalinin</string-name>
          <email>igor_kalinin@hotmail.com</email>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Margarita Filippova</string-name>
          <email>rituly_22@mail.ru</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Radiolectronics and, Information Technologies - RTF, Ural Federal University named after, the first President of Russia</institution>
          ,
          <addr-line>B.N.Yeltsin, Yekaterinburg, Russian Federation</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Radiolectronics and, Information Technologies - RTF, Ural Federal University named after, the first President of Russia</institution>
          ,
          <addr-line>B.N.Yeltsin, Yekaterinburg, Russian Federation</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Radiolectronics and, Information Technologies - RTF, Ural Federal University named after, the first President of Russia</institution>
          ,
          <addr-line>B.N.Yeltsin, Yekaterinburg, Russian Federation</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Institute of Radiolectronics and, Information Technologies - RTF, Ural Federal University named after, the first President of Russia</institution>
          ,
          <addr-line>B.N.Yeltsin, Yekaterinburg, Russian Federation</addr-line>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Institute of Radiolectronics and, Information Technologies - RTF, Ural Federal University named after, the first President of Russia</institution>
          ,
          <addr-line>B.N.Yeltsin, Yekaterinburg, Russian Federation</addr-line>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>LLC "UralInnovation</institution>
          ,
          <addr-line>Yekaterinburg, Russian Federation</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we propose a method for recognizing the audio stream in real time by cleaning the input signals from noise, as well as speech recognition using various third-party services. At the same time, the results of testing and analysis of the quality of speech recognition by these systems are presented. Based on the obtained test results, improvements and modifications to the recognition system are proposed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Automatic speech recognition is one of the key tasks in the construction of human-machine interaction systems based
on the speech interface.</p>
      <p>The development of theoretical foundations and applied developments of question-answer systems, as well as intelligent
systems with a voice interface, is an urgent scientific and technical task. Theoretical and practical approaches used in
question-answer systems are actively used in search engines and application software, as well as for tasks supporting the
context of dialogue with the end user.</p>
      <p>Various software systems for developing and operating voice robots (Akkulab, Zvonbot, CallOffice, Infobot, IVONA)
are presented on the market. The main disadvantages of these systems include the lack of integration with the enterprise
corporate system; the use of a static dialogue scenario, the high cost of maintenance and work.</p>
      <p>The main goal of this project is to develop a flexible and adaptive platform for the development of voice agents and
text, which allow you to use various proprietary and third-party services to solve the following tasks of automatic calling
and information support of context-sensitive dialogs:
1. Collection, processing and storage of dialogs;
2. Recognition of Russian speech using Google and Yandex;
3. Constructing a dialogue script;
4. Dynamic learning based on imported history of dialogs based on a neural network;
5. Control the progress of the dialogue of the voice robot.</p>
      <p>Also an urgent task of this project is to develop an integrated approach using various methods that improve the quality
of work of question-answer systems: from the stages of speech synthesis recognition to conducting a flexible and
contextsensitive dialogue.</p>
      <p>At the same time, today the most popular and at the same time the most difficult to implement systems are recognition
of spontaneous speech. The complexity of constructing such systems is caused by such features as significant variability
in the rate of speech and the psychophysical state of the speaker (the manner of pronouncing phrases, emotional state,
coughing, stuttering), the presence of accents, or a large number of word forms used.</p>
      <p>The task is complicated by the presence of pauses, repetitions, non-lexical inserts, parasitic words, etc. To date, a large
number of speech recognition methods have been developed taking into account the described limitations of spontaneous
speech, and there are also a large number of open source and commercial speech engines that can serve as the basis for
such systems.</p>
      <p>However, existing speech recognition systems have disadvantages in recognizing both speech in general and individual
language units.</p>
      <p>Thus, when constructing the speech recognition module of the developed real-time voice dialogue system “TWIN”,
decisions were made based on the idea of using, adapting and finalizing existing and well-established approaches, rather
than creating conceptually new algorithms.</p>
      <p>
        Today, Google and Yandex systems demonstrate the highest recognition accuracy of continuous Russian speech - about
85% [
        <xref ref-type="bibr" rid="ref1 ref2">1-2</xref>
        ]. This recognition quality is provided, firstly, by huge sets of acoustic data for training (thousands of hours of
speech), and, secondly, by the presence of many requested phrases and word forms from text search queries on which
language models were trained. By integrating the data of the two systems, our own speech recognition system was
implemented.
      </p>
      <p>The recognition module of the TWIN system consists of three main subsystems:
1. Virtual PBX - Implements the functionality of making a call and routing traffic to the speech recognition subsystem;
2. Speech Recognition Subsystem - a software package whose main task is to redirect traffic to the required recognition
system;</p>
      <p>3. The decision-making module is a software package consisting of copyright algorithms for processing text
information. Equipped with a decision-making routine and a speech synthesis routine.</p>
      <p>The choice of recognition system can be pre-configured, or determined dynamically using the decision support module.</p>
      <p>For this it is necessary to carry out minimal preparation of incoming streaming audio from the point of view of noise
purification.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Audio Stream Preparation</title>
      <p>
        The use of telephone voice signals as a direct source for recognition leads to a deterioration in the quality of the speech
recognition module, which significantly reduces the effectiveness of the dialogue system. These limitations include a small
bandwidth, the presence of hospitals (for example, white and pink noise) and non-linear distortions, as well as loss of
information as a result of encoding a speech signal. In addition, if the person receiving the call is on the street or in a
moving car, then an enormous amount of extraneous noise may be present in the audio signal, which reduces the quality
of replica recognition. Therefore, in order to reduce recognition errors, a noise cleaning system was introduced. To separate
the useful signal in difficult acoustic conditions, we used instruments developed at the Center for Speech Technologies
LLC (MDG) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and described in [
        <xref ref-type="bibr" rid="ref4 ref5">4-5</xref>
        ]. The main component of noise reduction is the VAD algorithm (modification of
the algorithm based on the statistics of the fundamental tone [
        <xref ref-type="bibr" rid="ref4 ref5">4-5</xref>
        ]), which distinguishes voiced portions of speech [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The
main idea of highlighting these sections of speech is to use vowels and nasalized consonants. On the one hand, the
disadvantage is the loss of some consonants, on the other hand, explosive consonants and affricates have less identification
value. Then it can be assumed that the loss of some part of insignificant speech material will be compensated by the
qualitative removal of non-speech sections.
      </p>
      <p>
        This allows, for example, to reduce the dependence of speaker identification quality on channel distortions in pauses.
The developed VAD algorithm is based on the spectral analysis of a speech signal. On each frame of the spectrogram, the
positions of the maxima corresponding to the harmonics of the fundamental tone are searched for, according to which the
value of its frequency is estimated. In this case, the signal may lack the lower harmonics of the fundamental tone, which is
typical for a telephone channel with a frequency band of 300 ... 3400 Hz. As noted in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], such a detector has the following
advantages: the speech signal is extracted, including in relatively noisy areas (signal-to-noise ratio of 10 dB and below);
the continuity of the value of the fundamental tone and the belonging of this value to the range of frequency values typical
of speech.
      </p>
      <p>To verify the module’s operability, records of previously made conversations were used, on which the system gave an
incorrect answer during recognition. And at the same time, it should be noted that in some cases of the functioning of the
system (low signal quality, the presence of external noise or extraneous conversations) even such methods of dealing with
interference may not provide acceptable speech recognition quality.</p>
      <p>To further improve the quality of recognition, we can distinguish a number of methods that will be implemented in the
system and, in our opinion, will be able to increase the quality of the developed product. These include:
1. Dividing the audio stream into segments depending on the speaker;
2. Recognition based on context (topic and history of conversation, emotional state of the speaker, etc.);
3. Accounting for semantic errors (the meaning of the spoken phrase as a whole), and not the number of mistakenly
recognized words.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Recognition Module Description</title>
      <p>The decision to use both popular speech recognition systems is caused by several factors.
1. These systems are closed, which makes it impossible to rely unambiguously on the quality of recognition of each.
2. The recognition quality of individual language structures varies for these systems.</p>
      <p>3. These systems use various internal recognition mechanisms, as a result of which they can generate the final result in
different ways, which can be used for different subject areas (in the case of explicitly setting up the recognition system at
the stage of designing dialogue scenarios).</p>
      <p>4. These systems offer varying additional functionality that can also be variably applied to different areas of use.
5. These systems vary in cost of use, which allows in some subject areas to use cheaper solutions with a simpler
infrastructure.</p>
      <p>Of the items listed, the most controversial and requiring attention is the assertion about the different quality of
recognition of various language structures. As a result of this, it is necessary to test the recognition quality of some basic
speech structures for both systems.</p>
      <p>For this testing, the traditional approach of analyzing the quality of recognition of isolated phrases independent of the
speaker cannot be used, since such analytics cannot be representative. This is primarily due to the peculiarities of language
models, the presence of paronyms, variable pronunciation of words in various situations or by different people, the presence
of noise, long, difficult phrases, the presence of emotional coloring, etc.</p>
      <p>Thus, to solve the problem, it is necessary to use an integrated approach based on a large number of experiments using
simulation.</p>
      <p>In this case, real dialogue scenarios over time will be simulated.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experiment setup</title>
      <p>Based on the foregoing, the following criteria for the quality of speech recognition can be distinguished for tests:
• Percentage of recognized short emotional phrases.
• Percentage of recognized long phrases.
• Percentage of recognized domain-specific terms.
• Percentage of recognized proper names.
• Percentage of recognized simple numbers.
• Percentage of recognized complex numerals.
• Percentage of recognized dates, addresses, and other audio information containing numerals.
• Percentage of recognized speech in noise and other distortions.</p>
      <p>The problematic issue in the context of this task is the way to build models based on the implementation of specific
dialogue scenarios. Traditional modeling systems do not have such functionality in their implementation, due to the narrow
focus of this problem.</p>
      <p>The TWIN system has in its implementation a module for visual configuration of dialog scripts - scripts (Figure 1)
[7</p>
      <p>To solve the problem of organizing simulation experiments, a specialized complex was developed for generating
(reproducing dialogs) on the basis of technologies already existing in the system.</p>
      <p>The idea was to organize automatic dialogs according to previously described possible scenarios between robots. That
is, it is necessary to compose several pairs of scenarios for the study of each selected criterion.</p>
      <p>Based on these technologies, several scripts have been developed for modeling and testing selected lexical units.
Particular attention was paid to the quality of recognition of numerals, dates and addresses.</p>
      <p>The various question-answer speech recognition systems under consideration were connected to this module.</p>
      <p>The modeling process consisted of multiple automatic runs of pre-prepared audio files for the first script, which were
pre-tested by experts on selected lexical categories. At the same time, recognition settings were changed. As the dialogue
progressed according to the scenario based on the recognition of the second participant in the conversation, the dialogue
went into the desired predicted scenario branch or not. Based on statistics at the end of the simulation, results were recorded.
5</p>
    </sec>
    <sec id="sec-5">
      <title>The discussion of the results</title>
      <p>Based on the information received, it can be concluded that the Yandex company system better recognizes short
expressive phrases, as well as numerals, while the Google API better recognizes long phrases and terms.</p>
      <p>At the same time, both systems have problems with recognition in noise, hectic speech rate and voice defects of the
interlocutor.</p>
      <p>
        This is due to the fact that both systems better recognize phonemes and phrases, and worse individual sounds, especially
in the case of noise and other factors that distort the quality of the transmitted audio message. This observation is confirmed
by the conclusions obtained by other independent researchers [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Based on the obtained data, an algorithm for the operation of the speech recognition system that is part of the TWIN
complex was formed. The system also includes a module for the subsequent processing of recognized text - normalization,
highlighting keywords, etc. The use of this module greatly simplifies the final perception by the robot of the phrase uttered
by the interlocutor and the choice of subsequent actions (pronouncing the corresponding remarks) provided for by the given
script [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>The proposed method was implemented in the modules of preliminary and subsequent third-party recognition of
processing. The resulting speech recognition module was also tested using previously used models.</p>
      <p>The recognition quality in this case has increased in terms of the tested indicators. Table 1 shows the statistics for phrase
recognition by the TWIN system of phrases by category.</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>The speech recognition module in the TWIN system uses integration in its work by the two most developed currently
existing solutions YandexSpeechKit and GoogleSpeech API.</p>
      <p>Based on the use of simulation, testing of the used speech recognition systems of the speech recognition module was
carried out. For this, an additional dialog playback module was implemented.</p>
      <p>Based on the information received, we can conclude that the Yandex company system better recognizes short expressive
phrases, as well as numerals. In contrast, the Google API recognizes longer phrases and terms better.</p>
      <p>Based on the information received, pre-processing modules, dynamic selection of the recognition system, and
subsequent processing of the recognized text were created. The recognition quality of the integrated solution - the speech
recognition module of the TWIN system has increased significantly.</p>
      <p>System development involves the development and implementation of additional functions, such as sending statistics
and creating tips when compiling a script. Analysis of statistics will help identify priority areas for improving the interface.</p>
      <p>The range of use of the system can be expanded due to the initial design flexibility.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. “Speech Kit Cloud”,
          <source>Speech Kit Cloud</source>
          ,
          <year>2019</year>
          . [Online]. - URL: https://tech.yandex.ru/speechkit/cloud/ (Accessed:
          <fpage>03</fpage>
          .
          <fpage>11</fpage>
          .
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. “SpeechKi”,
          <source>Tech.yandex.ru</source>
          ,
          <year>2018</year>
          . [Online]. - URL: https://tech.yandex.ru/speechkit/ (Accessed:
          <fpage>03</fpage>
          .
          <fpage>11</fpage>
          .
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.S.</given-names>
            <surname>Corrado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>Efficient Estimation of Word Representations in Vector Space</article-title>
          . {
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>I.P</given-names>
            <surname>Medennikov</surname>
          </string-name>
          .
          <article-title>Methods, algorithms, and software for recognizing Russian telephone spontaneous speech: dissertation of a candidate of technical</article-title>
          <source>sciences: 05.13</source>
          .11 / Medennikov Ivan Pavlovich {Place of defense: St. Petersburg State University,
          <year>2016</year>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>I. B Tampel.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Karpov</surname>
          </string-name>
          . Auto Speech Recognition {
          <volume>138</volume>
          ,
          <string-name>
            <surname>St</surname>
          </string-name>
          . Petersburg: ITMO University,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>E.D.</given-names>
            <surname>Loseva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.V.</given-names>
            <surname>Lipinsky</surname>
          </string-name>
          .
          <article-title>Recognition of human emotions by spoken using intelligent data analysis methods { Actual problems of aviation and astronautics</article-title>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>K.</given-names>
            <surname>Aksyonov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Antipin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Afanaseva</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kalinin</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Evdokimov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shevchuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karavaev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Aksyonova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Chiryshe</surname>
          </string-name>
          .
          <article-title>Testing of the speech recognition systems using Russian language models</article-title>
          {5th
          <source>International Young Scientists Conference on Information Technologies, Telecommunications and Control Systems</source>
          ,
          <string-name>
            <surname>ITTCS</surname>
          </string-name>
          <year>2018</year>
          . Yekaterinburg, Russian Federation,
          <year>December 2018</year>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>K.</given-names>
            <surname>Aksyonov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Bykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Aksyonova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goncharova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nevolina</surname>
          </string-name>
          .
          <article-title>Extension of the multi-agent resource conversion processes model: Implementation of agent coalitions {5th</article-title>
          <source>International Conference on Advances in Computing, Communications and Informatics</source>
          , 2016
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>K.</given-names>
            <surname>Aksyonov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Bykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sysoletin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Aksyonova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nevolina</surname>
          </string-name>
          .
          <source>Integration of the Real-time Simulation Systems with the Automated Control System of an Enterprise {International Conference on Social Science, Management and Economics</source>
          , 2015
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>K.</given-names>
            <surname>Aksyonov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Kalinin</surname>
          </string-name>
          , E. Tabatchikova,
          <string-name>
            <given-names>U.</given-names>
            <surname>Chiryshev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Aksyonova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Talancev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tarasiev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kanev</surname>
          </string-name>
          .
          <article-title>Development of decision making software agent for efficiency indicators system of IT-specialists {5th</article-title>
          <source>International Young Scientists Conference on Information Technologies, Telecommunications and Control Systems</source>
          ,
          <string-name>
            <surname>ITTCS</surname>
          </string-name>
          <year>2018</year>
          . Yekaterinburg, Russian Federation,
          <year>December 2018</year>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>D.V.</given-names>
            <surname>Bobkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.Y.</given-names>
            <surname>Zhigalov</surname>
          </string-name>
          .
          <article-title>The study of the reliability of speech recognition by the system Google Voice Search</article-title>
          . {Volume 2 / Cloud of Science. 2015
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>A.</given-names>
            <surname>Tarasiev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Talancev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Aksyonov</surname>
          </string-name>
          , I. Kalinin,
          <string-name>
            <given-names>U.</given-names>
            <surname>Chiryshev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Aksyonova</surname>
          </string-name>
          .
          <article-title>Development of an Intelligent Automated System for Dialogue and Decision-Making in Real Time {2nd European Conference on Electrical Engineering</article-title>
          &amp; Computer
          <string-name>
            <surname>Science</surname>
          </string-name>
          (EECS
          <year>2018</year>
          ). Bern, Switzerland, December 2018
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>