<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Analysis of the Intelligibility of Phonemes at Different Mid- frequency Intervals</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anton Konev</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Evgeny Kostyuchenko</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Shelupanov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Evgeny Choynzonov</string-name>
          <email>choynzonov@tnimc.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrey Nikolenko</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cancer Research Institute of Tomsk National Research Medical Center of the Russian Academy of Sciences (Tomsk NRMC)</institution>
          ,
          <addr-line>5 Kooperativny Street, Tomsk, 634009</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Tomsk State University of Control Systems and Radioelectronics</institution>
          ,
          <addr-line>40 Lenina Prospect, Tomsk, 634050</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1800</year>
      </pub-date>
      <fpage>36</fpage>
      <lpage>42</lpage>
      <abstract>
        <p>The article analyzes the intelligibility of phonemes at various mid-frequency intervals. As part of the work, 6 vowel phonemes were considered, a table of their intelligibility was formed and analyzed. The correctness of the work was assessed by listening to the changed sound track by 7 persons. The results obtained allow us to identify parts in the spectrum that most influencing the intelligibility of phonemes. That parts can be used to assess the quality of speech and the intelligibility of phonemes during the rehabilitation of patients after surgical treatment of the organs of the speech-forming tract.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Speech intelligibility</kwd>
        <kwd>speech quality</kwd>
        <kwd>frequency ranges</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        To date, automatic speech recognition systems have achieved significant results, investing and
promoting commercial applications in this area is beneficial. So, according to BCC Research, by 2021
the world market for speech recognition technologies will be estimated at $ 184.9 billion [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Continuous speech recognition [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and person identification by voice [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] are especially difficult
problems for an open set of speakers. The main disadvantages of existing software products are: the
need for long-term training of the system and insufficient quality of work with spontaneous speech. It
is known that the recorded speech signal differs to one degree or another from the original one. This
difference, first of all, is explained by the presence of interference and distortions in the composition
of the speech signal recorded at the source.
      </p>
      <p>
        Another important area where speech recognition can be used is the assessment of its quality, in
particular, intelligibility. To assess the quality, standard approaches [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] can be used, provided that
they replace the auditor with a recognition system [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ].
      </p>
      <p>One of the important aspects in speech recognition is data preparation. Analysis methods can be
resource-intensive, therefore, preparing an optimal data set in terms of volume is an urgent task. In
terms of optimizing the size of the parameters, the potential reduction in the frequency analysis ranges
is important. Filtering out areas that do not affect legibility can have a significant impact on system
performance or reduce the amount of resources it consumes.</p>
      <p>
        It is also important that the application of many existing methods for assessing intelligibility
requires the participation of experts and, as a result, does not claim to be completely objective.
However, such methods claim to be objective by increasing the number of experts, for example, up to
five people [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The question arises: is there a need for signal preprocessing, the formation of a data
set for building and evaluating a speech recognition and intelligibility system, or in individual tasks
(for example, related to assessing intelligibility during speech rehabilitation, when the flow of patients
is relatively small, 1-2 per day, and the number of records does not exceed two hundred [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]), is it
possible to use expert assessments without problems?
      </p>
      <p>Based on these considerations, preliminary data analysis with the aim of forming the most
informative features for further analysis using machine learning methods is an urgent task.</p>
      <p>In order to increase the capabilities of continuous speech systems, it is necessary to consider
sounds separately. In this work, the features of the behavior of phonemes at various intervals of
midfrequencies are considered and analyzed.</p>
      <p>
        The basic unit of the phonetic level of the language is the phoneme. The concept of a phoneme is
associated with the development of understanding of language as an integral system. Professor of
Kazan University I. A. Baudouin de Courtenay, who was the first to develop the concept of a
phoneme, emphasized that the allocation of a phoneme is possible only when the entire system of
phonemes of a given language is taken into account [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>The phoneme is the minimal meaningful unit of the language, which does not independently have
lexical or grammatical meaning, but serves to distinguish morphemes and words.</p>
      <p>The phoneme as an abstract unit of language corresponds to the sound of speech as a concrete unit.</p>
      <p>
        The spectrum of speech sound can be decomposed into tone (periodic) and noise (non-periodic)
components. Tone sounds are formed with the participation of the vocal cords, noise sounds - by
obstacles in the oral cavity. By the presence of these components, the first classification of speech
sounds can be made:
 Vowels – tone
 Voiceless consonants – noise
 Sonorous consonants - tones with a slight admixture of noise
 Voiced consonants - noise with tone participation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
      </p>
      <p>Differential features of phonemes are associated with the difference in acoustic features of sounds,
which, in turn, is associated with differences in their articulation, that is, with a difference in the work
of the speech organs. Voicedness - the presence of not only noise in the sound, but also the tone
created by the work of the vocal cords; softness - a large pitch of sound caused by a change in the
shape of the oral cavity as a result of additional articulation - the rise of the middle part of the back of
the tongue to the hard palate.</p>
      <p>But for the selection of phonemes, it is not the articulatory and acoustic aspects of these signs
themselves that are important, but their opposition, their use to distinguish other linguistic units. The
vowel sounds [a], [o], [i] can be pronounced in Russian with different durations (compare the
extension of vowels in words when expressing surprise, doubt, indignation, etc.: [ta: m?], [ kn'iga?],
[vo: n!] and under.), but the duration of pronunciation in Russian is not used to distinguish between
words and forms of words, and therefore, the difference between sounds [o] and [o:], [i] and [i:] do
not develop into phonemic differences.</p>
      <p>Phonemes that differ in only one differential feature are called paired. Paired in Russian are the
phonemes [b] and [b'], [b] and [p], [d] and [d'], [d] and [t] and so on. For example, the phonemes [ts]
and [ch] are unpaired, since there is no phoneme that would differ from [ts] or from [ch] just one
feature.</p>
      <p>
        Different languages have different types of syllables. The types of syllables differ according to the
ratio of the syllable (G) and non-syllable (S) element. When taking into account the end of a syllable,
open - the syllable ends with a syllable element (SG) - and closed - the syllable ends with a
nonsyllable element (GS) - syllables. When taking into account the beginning of a syllable, they
distinguish between covered (first sound of a non-syllable) and naked (first sound of a syllable)
syllables. In the word [o\kno], both syllables are open, but the first is open and the second is covered,
in the word [go\rod] both syllables are covered, but the first is open and the second is closed [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>For the Russian language, open syllables are more characteristic, consisting of a consonant and a
vowel, they make up more than half of all syllables found in speech. Based on this, we can conclude
that the analysis of vowel phonemes is an important component of speech recognition and assessment
of its quality.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Description of the data used</title>
      <p>In this work, audio recordings are taken from the base of vowel phonemes of the male and female
speaker, vowel phonemes are recorded, and also taken from the database of audio recordings of the
syllables of the female and male speaker. Audio files in Russian were used for the analysis.</p>
      <p>Number of syllables (recordings) from a speaker: 50.</p>
      <p>Vowel phonemes: [a], [i], [o], [u], [ɨ], [e].</p>
      <p>Before use, audio files were converted to wav format, 16 bit, mono.</p>
      <p>The number of auditors is 7 people.</p>
      <p>Total number of assessments received from auditors: 350.</p>
      <p>
        Since one of the goals in the work is to demonstrate the problems in assessing intelligibility by
expert methods even on small amounts of data, it was not the goal to form a large set intended solely
for the application of automated analysis methods based on machine learning methods. It was
necessary to identify and evaluate the problems associated precisely with the accuracy and objectivity
of the obtained assessments of intelligibility when using standard expert assessments [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Research methods</title>
      <p>
        The Butterworth filter [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] was applied to the recordings to highlight the frequency range of
interest. Further, all the data were assessed in the form of a questionnaire, for this 7 persons were
selected who had not previously listened to the audio recording data for a more accurate assessment.
The results were compared with the baseline data to assess intelligibility.
      </p>
      <p>The following division of the total frequency range of 200-3000 Hz into sub-ranges was used,
presented in tables 1-3.</p>
      <p>
        This division is associated with the distribution of the resonant frequencies of the speech-forming
tract (formant) for the first formant (table 1) and the second formant of various phonemes (tables 2
and 3) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The lack of division of the last frequency range is due to the fact that even with its full
use, the intelligibility turned out to be equal to 0.
      </p>
      <p>For each of the subranges, the intelligibility was assessed as the proportion of correctly defined
phonemes averaged over all speakers.</p>
      <p>Further, on the basis of the obtained intelligibility values, the most important parts of the spectrum
of vowel phonemes influencing the intelligibility of vowels were found.
4.</p>
    </sec>
    <sec id="sec-4">
      <title>Description of the experiment</title>
      <p>Let's consider the order of research using the example of the phoneme "a".
1. Select the phonemes of interest
2. We carry out filtering to select the frequency range of interest
3. Save the resulting file
4. Submitting to the experts for assessing intelligibility</p>
      <p>Signal spectra for the ranges 200-600 Hz, 200-400 Hz, 300-500 Hz and 400-600 Hz are shown in
Figures 1 a-d.</p>
      <p>The results of the assessment of intelligibility for the given fragments for the first auditor: 1, 1, 0,
0.</p>
      <p>Similar assessments were carried out for all auditors, phonemes and spectrum regions. The final
results for assessing intelligibility for isolated phonemes are presented in Tables 4-6.</p>
      <p>Similar characteristics for male and female voices are presented in Table 7, the total is averaged
over all vowels.
0
0
0
0
0
1
1
1
0
0</p>
      <p>Once the intelligibility scores have been obtained, you can analyze and compare them.</p>
    </sec>
    <sec id="sec-5">
      <title>Analysis of results</title>
      <p>It can be noted that in the interval from 1800 to 3000 Hz, intelligibility is not preserved for all
vowel phonemes. As an exception, the phoneme [a] and [i] can be distinguished, their ranges are close
to high frequencies, but to maintain intelligibility, a wide band is required at intervals of 400-800 Hz.
You can also see that the intelligibility of the male voice is lower in the frequency range than the
female, which is most likely due to the lower value of the pitch frequency.</p>
      <p>It can also be seen that some of the listeners noted the intelligibility of the phoneme [a] in the
range from 400 to 600 Hz and in the band from 600 to 800 Hz, therefore, you need to look at the
frequency range from 500 to 700 Hz. Additional analysis of this range confirmed the intelligibility
within its limits equal to 1.</p>
      <p>Phoneme [a] is partially legible at 1000-1800 Hz and 1000-1400 Hz, which means that for
legibility it may be necessary to partially expand the range towards 200-1000 Hz. As a result, with the
extended range, it turned out that intelligibility was preserved in the range 800-1400 Hz, while the
frequencies in the range significance for 400- 600 Hz and 600-800 Hz cannot be denied.</p>
      <p>Phoneme [i], the frequency range from 700 to 900 Hz, 14% of listeners noted that intelligibility
was preserved, since in the range from 800 to 1000 Hz, intelligibility is preserved, it is possible to
distinguish that the range partially coincides, but the main band lies higher, and this assessment is
related to individual differences of the listener.</p>
      <p>Next, consider the phoneme [i], as with the phoneme [a], we will shift the frequency range and
evaluate the results. intelligibility appeared in the 1400 to 2200 Hz range, but as the range decreases,
the intelligibility begins to fade, but these ranges significance cannot be denied.</p>
      <p>For phoneme [o], the range was extended in the same way as for phoneme [a]. It can be concluded
that a wider range is required for intelligibility and the data is due to the individual listener, but this
range significance cannot be denied.</p>
      <p>In the phoneme [u] in the range 300-500Hz, 14% of the listeners emphasized intelligibility, this is
due to the individual differences of the listener and this range significance cannot be denied.</p>
      <p>After analyzing the results in syllables, a clear difference can be distinguished in that a wider band
is required to preserve audibility, and audibility is also preserved not for a single phoneme, but for the
syllable as a whole. As a feature, the intelligibility of the syllables stands out in the aggregate, during
the listening, the falling out syllables did not stand out, as a result, the intelligibility or absence is
preserved for the entire audio file as a whole.</p>
      <p>An intelligibility table was compiled for all midrange syllables. The data can then be used to
develop a module for assessing the quality and intelligibility of speech.</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>
        In the course of the study, it can be concluded that the frequency ranges corresponding to the most
informative set of features can be compared to the frequencies of the formants of sounds and the
frequency of the main tone of the speaker. Indirectly, this dependence depends on the gender of the
speaker through its influence on the frequency of the main tone [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The data obtained can be used to
identify the most informative areas of the phoneme spectrum when solving speech recognition
problems and assessing the quality of pronouncing phonemes.
      </p>
      <p>
        In addition, the studies carried out have clearly confirmed that the use of existing expert methods
for assessing intelligibility can introduce significant contradictions due to differences in the
perception of messages by five (the recommended number [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) experts. These results clearly
substantiate the need to form a dataset for creating a system based on machine learning for assessing
syllabic, verbal and phrasal speech intelligibility when solving, in particular, assessing the quality of
speech in speech rehabilitation problems using machine learning for recognition [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgements</title>
      <p>This research was funded by a grant from the Russian Science Foundation (project 16-15-00038).
8. References</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] Understand me. "Promobot" engaged in speech recognition techn</article-title>
          ,
          <year>ol2o0g1ie9s</year>
          .‖ URL: https://www.kommersant.ru/doc/3960716
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Kipyatkova</surname>
            <given-names>I.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karpov</surname>
            <given-names>A.A.</given-names>
          </string-name>
          <article-title>Analytical review of Russian speech recognition systems with a large dictionary Trudy SPIIRAN -</article-title>
          SPIIRAS
          <string-name>
            <surname>Proceedings</surname>
          </string-name>
          ,
          <year>2010</year>
          , vol.
          <volume>12</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>7</fpage>
          -
          <lpage>20</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Rakhmanenko</surname>
            ,
            <given-names>I.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shelupanov</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kostyuchenko</surname>
          </string-name>
          , E.Y.
          <article-title>―Automatic text-independent speaker verification using convolutional deep belief network‖</article-title>
          .
          <source>Computer Optics</source>
          .
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Standard</surname>
            <given-names>GOST</given-names>
          </string-name>
          50840-
          <fpage>95</fpage>
          .
          <article-title>Voice over paths of communication (1995) Methods for Assessing the Quality, Legibility and Recognition</article-title>
          .
          <source>Publishing Standards</source>
          , Moscow January 01,
          <year>1997</year>
          , p.
          <fpage>234</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kostyuchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Novokhrestova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tirskaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nemirovich-Danchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Choynzonov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Balatskaya</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Shelupanov</surname>
          </string-name>
          <article-title>The evaluation process automation of phrase and word intelligibility using speech recognition systems</article-title>
          .
          <source>Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)</source>
          , Volume
          <volume>11658</volume>
          ,
          <string-name>
            <surname>LNAI</surname>
          </string-name>
          ,
          <year>2019</year>
          , pp.
          <fpage>237</fpage>
          -
          <lpage>246</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I.S.</given-names>
            <surname>Kipyatkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.A.</given-names>
            <surname>Karpov</surname>
          </string-name>
          <article-title>Variants of Deep Artificial Neural Networks for Speech Recognition Systems</article-title>
          .
          <source>Trudy SPIIRAN - SPIIRAS Proceedings</source>
          ,
          <year>2016</year>
          , vol.
          <volume>6</volume>
          , no.
          <issue>49</issue>
          , pp.
          <fpage>80</fpage>
          -
          <lpage>103</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>I.A</surname>
          </string-name>
          . Boduen d'KurteneExperience of phonetic alternations,
          <year>1895</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.V.</given-names>
            <surname>Bondarko</surname>
          </string-name>
          <article-title>Phonetic system of the modern Russian language</article-title>
          . Moscow: Prosveshcheniye,
          <year>1977</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.V.</given-names>
            <surname>Zlatoustova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.K.</given-names>
            <surname>Potapova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.V.</given-names>
            <surname>Potapov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.N.</given-names>
            <surname>Trunin-Donskoi General</surname>
          </string-name>
          and applied phonetics. M.:
          <article-title>Izdatel'stvo Moskovskogo universiteta</article-title>
          , .
          <year>1997</year>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.R.</given-names>
            <surname>Rabiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.W.</given-names>
            <surname>Schafer</surname>
          </string-name>
          .
          <source>Digital Processing of Speech Signals</source>
          . - Paramus, NJ: PrenticeHall,
          <year>1978</year>
          . - ISBN 0-13-213603-1.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.V.</given-names>
            <surname>Bondarko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.A.</given-names>
            <surname>Verbitskaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.V.</given-names>
            <surname>Gordina</surname>
          </string-name>
          <article-title>Fundamentals of general phonetics</article-title>
          . - 4th ed.,
          <source>St. Petersburg: Academy</source>
          ,
          <year>2004</year>
          , 160 p.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.A.</given-names>
            <surname>Salah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karpov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Frolova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Grigorev</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          <article-title>Lyakso Emotion, age, and gender classification in children's speech by humans and machines Computer Speech</article-title>
          and Language Volume
          <volume>46</volume>
          ,
          <string-name>
            <surname>November</surname>
            <given-names>2017</given-names>
          </string-name>
          , Pages
          <fpage>268</fpage>
          -
          <lpage>283</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>