<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluation of syllable intelligibility through recognition in speech rehabilitation of cancer patients</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tomsk</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lenina</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Russia</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomsk</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lenina</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomsk</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kooperativniy</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>balatskaya@oncology.tomsk.ru</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>The approach to the evaluation of speech quality through recognition of spoken syllables in speech rehabilitation of cancer patients after combined treatment of tongue cancer is applied in the work. A neural network of in-depth training is used to assess the pronunciation of syllables during speech rehabilitation. The structure of the neural network has been selected. Estimations of the recognition quality for normal speech (syllables) and the speech after operative intervention during the process of speech rehabilitation are obtained. A conclusion about the applicability of this approach is made and speci c recommendations on the choice of the neural network parameters, taking into account the limited volume of records during its training and obvious dependence from speaker were obtained.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Copyright c by the paper's authors. Copying permitted for private and academic purposes.
work presents one of the options for obtaining objective quantitative characteristics based on the replacement of
auditors by a neural network of deep learning, operating both in the simple recognition mode (in fact, a direct
automated analog of the algorithm based on GOST R 50840-95 Voice over paths of communication. Methods
for assessing the quality, legibility and recognition [GOST95]), or focusing on speci c values outputs for quality
assessment.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Current state of the problem</title>
      <p>The technique from GOST R 50840-95 was considered in more detail, since in fact this work is its direct
automation. In comparison with GOST R 51061-97 [GOST97], which also uses tables from GOST R 50840-95,
the standard allows the speech therapist to use more "understandable" estimates. In the framework of the
study of speech quality assessment, methods for evaluating syllabic and phrase intelligibility were selected and
implemented. As such estimates, the proportion of correctly heard syllables and, accordingly, phrases chosen
from special tables were taken. These tables are formed in such a way as to cover all possible combinations of
phonemes that arise in real speech. This technique can also be used to assess the quality of a speech source
if the in uence of the communication channel is absent or negligible. To exclude the technical communication
channel, a direct evaluation of the heard syllables by the speech therapist conducting the lesson is used. As part
of the assessment, syllable intelligibility is chosen as criterion of pronunciation quality, because it characterizes
the quality of phoneme pronunciation without dependence on the context [Kos14].</p>
      <p>This work is aimed at replacing the auditor with a deep neural network in this technique [Kip16]. Based on
the reference records made before the operation, its training is conducted. Trained neural network is used to
recognize syllables in the process of rehabilitation. Due to this, the e ect of dependence on the speaker appears,
which in this case has a positive e ect. This is because all the features of the utterance characteristic of a
particular speaker and the parameters of the speech-forming tract are preserved.</p>
      <p>In previous works [Kos17] considered the work with the time form of the speech signal to extract parameters.
This work implements a fundamentally di erent approach to quality assessment, proposed, in particular, in
[Nik02]. The aggregation of these approaches to obtain characteristics that is most sensitive to changes in the
speech quality is assumed in the future.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Description of the proposed approach</title>
      <p>The application of deep neural networks for assessing the quality of speech
In this work, for the process of assessing the quality of speech through recognition, an approach based on the
application of deep neural networks for image analysis applied to the spectrogram is applied. The recognition
procedure uses a Fourier spectrogram in log-Bark scale (40 bands), taking into account the features of perception.
For its construction, a speech signal with a sampling frequency of 16,000 Hz is used, the size of the analysis
window cut out from the spectrogram, is 10 ms (120 samples). Since in this work the speed of the algorithm is
fundamentally not interesting, but the applicability of the approach is important, the step between the windows
is chosen to be 1 sample. Further, the resulting matrix of 120 40 size was fed to the input of the neural network
for learning and subsequent recognition.
3.2</p>
      <p>The principal features of training and limitations under the problem to be solved
The approach implemented within the framework of this task has several limitations, some of which are introduced
arti cially.</p>
      <p>1. Dependence on the speaker. The model for assessing the quality of speech is built every time for a speci c
speaker. There is no task to improve the quality of speech in relation to the already established manner
of pronouncing phonemes and the presence of speech defects. The task in the rehabilitation process is to
maximize the speech to the existing standard, the corresponding speech of this particular patient before the
procedure of operative treatment of the disease. This limitation signi cantly simpli es the task from the
point of view of speech recognition, there is no need to use a large database of records from a lot of speakers
for training.
2. Limited amount of phonemes. We primarily are interested in the quality of pronouncing the phonemes that
are most susceptible to change after the operation. By this reason the table of syllables oriented speci cally
to these problematic phonemes was chosen. The list of these phonemes was compiled at the rst stage of the
study [Kos16]. It would be possible to use the complete classical table of syllables from GOST R 50840-95
(5 tables, 250 syllables according to the method of evaluating syllabic intelligibility), however, recording 250
syllables per session is quite problematic for the patient, therefore, in agreement with physicians engaged in
speech rehabilitation, it was decided to limit to 90 syllables, but focused speci cally on the main problematic
phonemes ([k], [s], [t] and their soft implementations). The most problematic phoneme [r] is excluded from
consideration, because the mechanism of its utterance changes in principle and direct comparison with the
standard is meaningless.
3. The orientation on obtaining a quality assessment as quickly that does not create a problems for patient in
the process of training. Now the quality evaluation takes 3 seconds per syllable, the learning time is not
important, but takes lesser that one hour.
4. Within the framework of this work, syllable intelligibility refers to the proportion of correctly recognized
syllables. In the future, the values of the output layer of the neural network will be used to assess the degree
of proximity to the correct phoneme for implementing the biofeedback mechanism in the rehabilitation
process. However, in this paper, it was precisely the applicability of the approach to evaluate the quality of
phoneme pronunciation in process of speech rehabilitation.
5. It is known in advance which syllable is pronounced. There is no need to interpret the sequence of recognized
phonemes, transforming it into a syllable, it is only necessary to estimate the proportion of correct phonemes
in this sequence.
3.3</p>
      <p>The current state of the database for learning and assessing pronunciation quality
To assess the applicability of this approach, two databases are used. The rst is the database of healthy speakers,
who pronounce syllables with and without the use of tongue. In this database, there are records of 3 speakers
participating in 3 recording sessions (2 sessions with using the tongue for assessing the variation in pronunciation,
and 1 session without using the tongue). A small number of speakers in the database relates to the veri cation
of the applicability of the approach. After that, the test was carried out on real patients. The number of patients
with records before and after the operation was 79 people.
4</p>
      <p>Construction of a deep training neural network and its training
To implement the deep neural network for recognition of syllables in the framework of assessing the quality
of their pronunciation, computing environment MATLAB 2018a [Mat18] containing a package Neural Network
Toolbox was used, which allows to design exible deep neural networks without deepening their low-level design.
The internal architecture of the neural network (30 layers) was chosen based on the recommendations of the
Matlab test pattern for command recognition and looks like this: the input layer, 2 (Convolutional Layer, Batch
Normalization Layer, Recti ed Linear Unit Layer, Max Pooling Layer), 2 (Dropout Layer, Convolutional Layer,
Batch Normalization Layer, Recti ed Linear Unit Layer), Max Pooling Layer, 2 (Dropout Layer, Convolutional
Layer, Batch Normalization Layer, Recti ed Linear Unit Layer) , Max Pooling Layer, Fully Connected Layer,
Softmax Layer and Weighted Cross EntityLayer.</p>
      <p>The outputs have the following structure: vocalization output, softness output, 21 classes for phoneme
identi cation - total 23 outputs. Input layer contains 4800 neurons.</p>
      <p>The total volume was more than 28480 sets (according to an example, the question of selecting the best
structure and sampling will be considered in the future), 25000 sets were selected for the training sample.</p>
      <p>The nal accuracy of training for the 25 epochs and 4875 iterations was 95.75%. Accuracy was calculated as
a ratio of problematic phonemes that was correctly detected by neural network on the validation part of dataset.
5</p>
      <p>The results obtained for the evaluation of syllable intelligibility
At testing at this stage, the phoneme is correctly recognized if more than 50% of the correct samples were present.
The results of testing for assessing intelligibility using experts and the proposed approach for healthy speakers
with and without the use of tongue in pronunciation and patients before and after surgery are presented in Table
1. The table was compiled for 3 speakers and 3 patients. PersonN are healthy speakers, PatientN are patients
that proceed rehabilitation. "Normal" - standard speech for healthy speaker and speech before operation for
patients. "Without tongue" - speech without using of tongue to pronunciation for healthy speaker and speech
after operation for patients. Records contains syllables with problematic phonemes ([t], [k], [s], [t'], [k'], [s']
[Kos16]). List of records contains 90 syllables.</p>
      <p>Expert score is the ratio of the number of correctly recognized syllables (phonemes) to the total number of
pronounced syllables (phonemes). Network score are the same, but for neural network instead of expert. The
diagram of the neural network training is shown in Figure 1. The gure represents typical increasing of accuracy
(and decreasing of loss) depending from the time.
1. Even for a healthy speaker, the intelligibility did not signi cantly reach 100%, thus spreading with the
opinion of experts. However, mistakes mostly arose in "non-problematic" phonemes, which is explained by
their small share in the syllable table, in particular, some of the phoneme implementations in the table are
missing, since recognition was not the ultimate goal of the system.
2. On the other hand, for problematic phonemes, the results of which are presented in Table 2, the di erence
is statistically insigni cant when using the Student's test and the signi cance level of 0.95. In the future,
it is possible to increase this value due to the variation in the structure of the neural network used and its
adaptation to the problem being solved.
3. The intelligibility in comparing qualitative assessments at the level "more" - "less" corresponds to expert
estimates, which allows to talk about the applicability of the proposed approach for solving the problem of
assessing the quality of speech in the process of speech rehabilitation. This fact con rms the consistency at
the ranks level of the previously used classical expert method for estimating syllabic intelligibility and the
proposed method using neural networks.
6</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this paper, the application of speech recognition based on a deep neural network for the problem of estimating
syllabic intelligibility according to the method of GOST R 50840-95 Voice over paths of communication. Methods
for assessing the quality, legibility and recognition is considered. In the framework of this method, the nal deep
neural network can act as an auditor and issue an appropriate quantitative estimate at the output. The received
values allows to speak about absence of obvious contradictions between received results and the estimations
received by experts. In addition for correct estimates obtaining it is necessary to have the opinion of 5 experts.
This signi cantly reduces the practical applicability of the method with direct experts participation. The use
of a neural network instead of experts solves this problem. It is also possible to formulate several points for a
more accurate study of the proposed approach for improving the results obtained, additional con rmation of
their reliability and implementation within the version of the speech quality assessment complex in the process
of speech rehabilitation.</p>
      <p>1. Veri cation of the operation of the system using several trained neural networks that can act as separate
auditors (in accordance with GOST R 50840-95, it is planned to use 5 neural networks).
2. The use of a fraction of correctly recognized phonemes on a time interval, as well as the use of quantitative
outputs of a neural network to increase the exibility of the values obtained, currently at the level of a
correctly / incorrectly recognized syllable.
3. Veri cation of the obtained approach on the full extent of available data for the process of rehabilitation of
real patients.
6.1</p>
      <p>Acknowledgements
Supported by a grant from the Russian Science Foundation (project 16-15-00038)
[Kap18] A. Kaprin, V. Starinskiy, G. Petrova /em Status of cancer care the population of Russia in 2016. P.</p>
      <p>A. Hertsen Moscow Oncology Research Center - branch of FSBI NMRRC of the Ministry of Health of
Russia, Moscow, 2018
[Kap17] A. Kaprin, V. Starinskiy, G. Petrova Malignancies in Russia in 2014 (Morbidity and mortality). P.</p>
      <p>A. Hertsen Moscow Oncology Research Center - branch of FSBI NMRRC of the Ministry of Health of
Russia, Moscow, 2017
[Kor15] N. Korotkikh, N. Mitin, D. Mishin, E. Ponomarev The Speech Rehabilitation of Patients After Surgical</p>
      <p>Operations. Modern problems of science and education. 1(1), 2015.
[GOST95] Standard GOST R 50840-95 Voice over paths of communication. Methods for assessing the quality,
legibility and recognition. Publishing Standards, Moscow, 1995
[Kos14] E. Kostyuchenko, R. Meshcheryakov, L. Balatskaya, E. Choinzonov Structure and database of software
for speech quality and intelligibility assessment in the process of rehabilitation after surgery in the
treatment of cancers of the oral cavity and oropharynx, maxillofacial area. SPIIRAS Proceedings 1(32):
116{124, 2014</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [GOST97]
          <string-name>
            <surname>Standard</surname>
            <given-names>GOST R</given-names>
          </string-name>
          51061
          <article-title>-97 Low biterate speech transmission systems. Speech quality characteristics and their evaluation</article-title>
          .
          <source>Publishing Standards</source>
          , Moscow,
          <year>1997</year>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Kip16]
          <string-name>
            <given-names>I.</given-names>
            <surname>Kipyatkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karpov</surname>
          </string-name>
          ,
          <article-title>Variants of Deep Arti cial Neural Networks for Speech Recognition Systems</article-title>
          .
          <source>SPIIRAS Proceedings</source>
          .
          <volume>6</volume>
          (
          <issue>49</issue>
          ):
          <volume>80</volume>
          {
          <fpage>103</fpage>
          ,
          <year>2016</year>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Kos17]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kostyuchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Meshcheryakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ignatieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pyatkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Choynzonov</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Balatskaya</surname>
          </string-name>
          <article-title>Correlation normalization of syllables and comparative evaluation of pronunciation quality in speech rehabilitation</article-title>
          .
          <source>19th Interna-tional Conference on Speech and Computer (SPECOM</source>
          <year>2017</year>
          ), LNCS
          <volume>10458</volume>
          : 262{
          <fpage>271</fpage>
          ,
          <year>2017</year>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Nik02]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nikolaev</surname>
          </string-name>
          <article-title>Mathematical models and a complex of programs for automatic evaluation of the quality of a speech signal The thesis of a Cand</article-title>
          .
          <source>Tech.Sci .: 05.13</source>
          .18., Ekaterinburg,
          <year>2002</year>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Kos16]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kostyuchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ignatieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Meshcheryakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pyatkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Choynzonov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Balatskaya</surname>
          </string-name>
          .
          <article-title>Model of system quality assessment pronouncing phonemes</article-title>
          .
          <source>2016 Dynamics of Systems, Mechanisms and Machines</source>
          , Omsk,
          <year>2016</year>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Mat18] Mathworks Homepage https://www.mathworks.com/.
          <source>Last accessed 30 Apr 2018</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>