Analysis of the Intelligibility of Phonemes at Different Mid-
frequency Intervals
Anton Konev a, Evgeny Kostyuchenko a, Alexander Shelupanov a, Evgeny Choynzonov a,b
and Andrey Nikolenkoa
a
  Tomsk State University of Control Systems and Radioelectronics, 40 Lenina Prospect, Tomsk, 634050, Russia
b
  Cancer Research Institute of Tomsk National Research Medical Center of the Russian Academy of Sciences
(Tomsk NRMC), 5 Kooperativny Street, Tomsk, 634009, Russia


               Abstract
               The article analyzes the intelligibility of phonemes at various mid-frequency intervals. As
               part of the work, 6 vowel phonemes were considered, a table of their intelligibility was
               formed and analyzed. The correctness of the work was assessed by listening to the changed
               sound track by 7 persons. The results obtained allow us to identify parts in the spectrum that
               most influencing the intelligibility of phonemes. That parts can be used to assess the quality
               of speech and the intelligibility of phonemes during the rehabilitation of patients after
               surgical treatment of the organs of the speech-forming tract.

               Keywords 1
               Speech intelligibility, speech quality, frequency ranges

1. Introduction
    To date, automatic speech recognition systems have achieved significant results, investing and
promoting commercial applications in this area is beneficial. So, according to BCC Research, by 2021
the world market for speech recognition technologies will be estimated at $ 184.9 billion [1].
    Continuous speech recognition [2] and person identification by voice [3] are especially difficult
problems for an open set of speakers. The main disadvantages of existing software products are: the
need for long-term training of the system and insufficient quality of work with spontaneous speech. It
is known that the recorded speech signal differs to one degree or another from the original one. This
difference, first of all, is explained by the presence of interference and distortions in the composition
of the speech signal recorded at the source.
    Another important area where speech recognition can be used is the assessment of its quality, in
particular, intelligibility. To assess the quality, standard approaches [4] can be used, provided that
they replace the auditor with a recognition system [5, 6].
    One of the important aspects in speech recognition is data preparation. Analysis methods can be
resource-intensive, therefore, preparing an optimal data set in terms of volume is an urgent task. In
terms of optimizing the size of the parameters, the potential reduction in the frequency analysis ranges
is important. Filtering out areas that do not affect legibility can have a significant impact on system
performance or reduce the amount of resources it consumes.
    It is also important that the application of many existing methods for assessing intelligibility
requires the participation of experts and, as a result, does not claim to be completely objective.
However, such methods claim to be objective by increasing the number of experts, for example, up to
five people [4]. The question arises: is there a need for signal preprocessing, the formation of a data

YRID-2020: International Workshop on Data Mining and Knowledge Engineering, October 15-16, 2020, Stavropol, Russia
EMAIL: kaa1@fb.tusur.ru (Anton Konev); key@fb.tusur.ru (Evgeny Kostyuchenko); saa@tusur.ru (Alexander Shelupanov);
choynzonov@tnimc.ru (Evgeny Choinzonov); office@fb.tusur.ru (Andrey Nikolenko)
ORCID: 0000-0002-3222-9956 (Anton Konev); 0000-0001-8000-2716 (Evgeny Kostyuchenko); 0000-0003-2393-6701 (Alexander
Shelupanov); 0000-0002-9084-683X (Evgeny Choinzonov), 0000-0002-7580-5853 (Andrey Nikolenko)
            2020 Copyright for this paper by its authors.
          Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
          CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                                36
set for building and evaluating a speech recognition and intelligibility system, or in individual tasks
(for example, related to assessing intelligibility during speech rehabilitation, when the flow of patients
is relatively small, 1-2 per day, and the number of records does not exceed two hundred [4]), is it
possible to use expert assessments without problems?
    Based on these considerations, preliminary data analysis with the aim of forming the most
informative features for further analysis using machine learning methods is an urgent task.
    In order to increase the capabilities of continuous speech systems, it is necessary to consider
sounds separately. In this work, the features of the behavior of phonemes at various intervals of mid-
frequencies are considered and analyzed.
    The basic unit of the phonetic level of the language is the phoneme. The concept of a phoneme is
associated with the development of understanding of language as an integral system. Professor of
Kazan University I. A. Baudouin de Courtenay, who was the first to develop the concept of a
phoneme, emphasized that the allocation of a phoneme is possible only when the entire system of
phonemes of a given language is taken into account [7].
    The phoneme is the minimal meaningful unit of the language, which does not independently have
lexical or grammatical meaning, but serves to distinguish morphemes and words.
    The phoneme as an abstract unit of language corresponds to the sound of speech as a concrete unit.
    The spectrum of speech sound can be decomposed into tone (periodic) and noise (non-periodic)
components. Tone sounds are formed with the participation of the vocal cords, noise sounds - by
obstacles in the oral cavity. By the presence of these components, the first classification of speech
sounds can be made:
     Vowels – tone
     Voiceless consonants – noise
     Sonorous consonants - tones with a slight admixture of noise
     Voiced consonants - noise with tone participation [8]
    Differential features of phonemes are associated with the difference in acoustic features of sounds,
which, in turn, is associated with differences in their articulation, that is, with a difference in the work
of the speech organs. Voicedness - the presence of not only noise in the sound, but also the tone
created by the work of the vocal cords; softness - a large pitch of sound caused by a change in the
shape of the oral cavity as a result of additional articulation - the rise of the middle part of the back of
the tongue to the hard palate.
    But for the selection of phonemes, it is not the articulatory and acoustic aspects of these signs
themselves that are important, but their opposition, their use to distinguish other linguistic units. The
vowel sounds [a], [o], [i] can be pronounced in Russian with different durations (compare the
extension of vowels in words when expressing surprise, doubt, indignation, etc.: [ta: m?], [ kn'iga?],
[vo: n!] and under.), but the duration of pronunciation in Russian is not used to distinguish between
words and forms of words, and therefore, the difference between sounds [o] and [o:], [i] and [i:] do
not develop into phonemic differences.
    Phonemes that differ in only one differential feature are called paired. Paired in Russian are the
phonemes [b] and [b'], [b] and [p], [d] and [d'], [d] and [t] and so on. For example, the phonemes [ts]
and [ch] are unpaired, since there is no phoneme that would differ from [ts] or from [ch] just one
feature.
    Different languages have different types of syllables. The types of syllables differ according to the
ratio of the syllable (G) and non-syllable (S) element. When taking into account the end of a syllable,
open - the syllable ends with a syllable element (SG) - and closed - the syllable ends with a non-
syllable element (GS) - syllables. When taking into account the beginning of a syllable, they
distinguish between covered (first sound of a non-syllable) and naked (first sound of a syllable)
syllables. In the word [o\kno], both syllables are open, but the first is open and the second is covered,
in the word [go\rod] both syllables are covered, but the first is open and the second is closed [9].
    For the Russian language, open syllables are more characteristic, consisting of a consonant and a
vowel, they make up more than half of all syllables found in speech. Based on this, we can conclude
that the analysis of vowel phonemes is an important component of speech recognition and assessment
of its quality.


                                                                                                         37
2. Description of the data used
    In this work, audio recordings are taken from the base of vowel phonemes of the male and female
speaker, vowel phonemes are recorded, and also taken from the database of audio recordings of the
syllables of the female and male speaker. Audio files in Russian were used for the analysis.
    Number of syllables (recordings) from a speaker: 50.
    Vowel phonemes: [a], [i], [o], [u], [ɨ], [e].
    Before use, audio files were converted to wav format, 16 bit, mono.
    The number of auditors is 7 people.
    Total number of assessments received from auditors: 350.
    Since one of the goals in the work is to demonstrate the problems in assessing intelligibility by
expert methods even on small amounts of data, it was not the goal to form a large set intended solely
for the application of automated analysis methods based on machine learning methods. It was
necessary to identify and evaluate the problems associated precisely with the accuracy and objectivity
of the obtained assessments of intelligibility when using standard expert assessments [4].

3. Research methods
    The Butterworth filter [10] was applied to the recordings to highlight the frequency range of
interest. Further, all the data were assessed in the form of a questionnaire, for this 7 persons were
selected who had not previously listened to the audio recording data for a more accurate assessment.
The results were compared with the baseline data to assess intelligibility.
    The following division of the total frequency range of 200-3000 Hz into sub-ranges was used,
presented in tables 1-3.

Table 1
Splitting a range 200-1000 Hz
                            200-600 Hz                        600-1000 Hz
                            200-400 Hz                        600-800 Hz
                            300-500 Hz                        700-900 Hz
                            400-600 Hz                        800-100 Hz

Table 2
Splitting a range 1000-1800 Hz
                          1000-1400 Hz                        1400-1800 Hz
                          1000-1200 Hz                        1400-1600 Hz
                          1100-1300 Hz                        1500-1700 Hz
                          1200-1400 Hz                        1600-1800 Hz

Table 3
Splitting a range 1800-3000 Hz
           1800-2200 Hz                     2200-2600 Hz                       2600-3000 Hz
           1800-2000 Hz                     2200-2400 Hz
           1900-2100 Hz                     2300-2500 Hz
           2000-2200 Hz                     2400-2600 Hz


    This division is associated with the distribution of the resonant frequencies of the speech-forming
tract (formant) for the first formant (table 1) and the second formant of various phonemes (tables 2
and 3) [11]. The lack of division of the last frequency range is due to the fact that even with its full
use, the intelligibility turned out to be equal to 0.

                                                                                                     38
   For each of the subranges, the intelligibility was assessed as the proportion of correctly defined
phonemes averaged over all speakers.
   Further, on the basis of the obtained intelligibility values, the most important parts of the spectrum
of vowel phonemes influencing the intelligibility of vowels were found.

4. Description of the experiment
   Let's consider the order of research using the example of the phoneme "a".
   1. Select the phonemes of interest
   2. We carry out filtering to select the frequency range of interest
   3. Save the resulting file
   4. Submitting to the experts for assessing intelligibility

   Signal spectra for the ranges 200-600 Hz, 200-400 Hz, 300-500 Hz and 400-600 Hz are shown in
Figures 1 a-d.
   The results of the assessment of intelligibility for the given fragments for the first auditor: 1, 1, 0,
0.
   Similar assessments were carried out for all auditors, phonemes and spectrum regions. The final
results for assessing intelligibility for isolated phonemes are presented in Tables 4-6.


Figure 1: Cutting out spectra for signal synthesis in various frequency ranges (200-600 Hz, 200-400
Hz, 300-500 Hz, 400-600 Hz)


                                                                                                        39
Table 4
Intelligibility for the range 200-1000 Hz
                           [a]             [i]        [o]          [u]          [ɨ]          [e]
   200-1000 Hz              1               1          1            1            1            1
   200-600 Hz               1               1        0,148          1            1            1
   600-1000 Hz              1               1          1            1            1            1
   200-400 Hz               0               0          0            0            1            1
   300-500 Hz               0               0          0          0,149          0            0
   400-600 Hz             0,148             0        0,148          1            0            0
   600-800 Hz             0,444             0        0,147          0            0            0
   700-900 Hz               0            0,147         0            0            0            0
   800-1000 Hz              0               1          0            0            0            0

Table 5
Intelligibility for the range 1000-1800 Hz
                           [a]             [i]         [o]         [u]          [ɨ]          [e]
  1000-1800 Hz            0,297             1           0           0            0            0
  1000-1400 Hz            0,296             1           0           0            0            0
  1400-1800 Hz              0            0,444          0           0            0            0
  1000-1200 Hz              0               1           0           0            0            0
  1100-1300 Hz              0               0           0           0            0            0
  1200-1400 Hz              0               0           0           0            0            0
  1400-1600 Hz              0               0           0           0            0            0
  1500-1700 Hz              0               0           0           0            0            0
  1600-1800 Hz              0               0           0           0            0            0

Table 6
Intelligibility for the range 1800-3000 Hz
                           [a]             [i]         [o]         [u]          [ɨ]          [e]
  1800-2600 Hz              0               1           0           0            0            0
  1800-2200 Hz              0            0,149          0           0            0            0
  2200-2600 Hz              0               0           0           0            0            0
  2600-3000 Hz              0               0           0           0            0            0
  1800-2000 Hz              0               0           0           0            0            0
  1900-2100 Hz              0               0           0           0            0            0
  2000-2200 Hz              0               0           0           0            0            0
  2200-2400 Hz              0               0           0           0            0            0
  2300-2500 Hz              0               0           0           0            0            0
  2400-2600 Hz              0               0           0           0            0            0

   Similar characteristics for male and female voices are presented in Table 7, the total is averaged
over all vowels.

Table 7
Intelligibility for the range 1800-3000 Hz
                                m            w                            m            w
          200-1000 Hz           1            1        1200-1400 Hz        0            0
           200-600 Hz           1            1        1400-1600 Hz        0            0
          600-1000 Hz           1            1        1500-1700 Hz        0            0
           200-400 Hz           0            0        1600-1800 Hz        0            0

                                                                                                   40
         300-500 Hz           0               0          1800-2600 Hz          1             1
         400-600 Hz           0               0          1800-2200 Hz          0             1
         600-800 Hz           0               0          2200-2600 Hz          0             0
         700-900 Hz           0               0          2600-3000 Hz          0             0
        800-1000 Hz           0               0          1800-2000 Hz          0             0
        1000-1800 Hz          1               1          1900-2100 Hz          0             0
        1000-1400 Hz          1               1          2000-2200 Hz          0             0
        1400-1800 Hz          1               1          2200-2400 Hz          0             0
        1000-1200 Hz          0               0          2300-2500 Hz          0             0
        1100-1300 Hz          0               0          2400-2600 Hz          0             0

   Once the intelligibility scores have been obtained, you can analyze and compare them.

5. Analysis of results
    It can be noted that in the interval from 1800 to 3000 Hz, intelligibility is not preserved for all
vowel phonemes. As an exception, the phoneme [a] and [i] can be distinguished, their ranges are close
to high frequencies, but to maintain intelligibility, a wide band is required at intervals of 400-800 Hz.
You can also see that the intelligibility of the male voice is lower in the frequency range than the
female, which is most likely due to the lower value of the pitch frequency.
    It can also be seen that some of the listeners noted the intelligibility of the phoneme [a] in the
range from 400 to 600 Hz and in the band from 600 to 800 Hz, therefore, you need to look at the
frequency range from 500 to 700 Hz. Additional analysis of this range confirmed the intelligibility
within its limits equal to 1.
    Phoneme [a] is partially legible at 1000-1800 Hz and 1000-1400 Hz, which means that for
legibility it may be necessary to partially expand the range towards 200-1000 Hz. As a result, with the
extended range, it turned out that intelligibility was preserved in the range 800-1400 Hz, while the
frequencies in the range significance for 400- 600 Hz and 600-800 Hz cannot be denied.
    Phoneme [i], the frequency range from 700 to 900 Hz, 14% of listeners noted that intelligibility
was preserved, since in the range from 800 to 1000 Hz, intelligibility is preserved, it is possible to
distinguish that the range partially coincides, but the main band lies higher, and this assessment is
related to individual differences of the listener.
    Next, consider the phoneme [i], as with the phoneme [a], we will shift the frequency range and
evaluate the results. intelligibility appeared in the 1400 to 2200 Hz range, but as the range decreases,
the intelligibility begins to fade, but these ranges significance cannot be denied.
    For phoneme [o], the range was extended in the same way as for phoneme [a]. It can be concluded
that a wider range is required for intelligibility and the data is due to the individual listener, but this
range significance cannot be denied.
    In the phoneme [u] in the range 300-500Hz, 14% of the listeners emphasized intelligibility, this is
due to the individual differences of the listener and this range significance cannot be denied.
    After analyzing the results in syllables, a clear difference can be distinguished in that a wider band
is required to preserve audibility, and audibility is also preserved not for a single phoneme, but for the
syllable as a whole. As a feature, the intelligibility of the syllables stands out in the aggregate, during
the listening, the falling out syllables did not stand out, as a result, the intelligibility or absence is
preserved for the entire audio file as a whole.
    An intelligibility table was compiled for all midrange syllables. The data can then be used to
develop a module for assessing the quality and intelligibility of speech.

6. Conclusion
   In the course of the study, it can be concluded that the frequency ranges corresponding to the most
informative set of features can be compared to the frequencies of the formants of sounds and the
frequency of the main tone of the speaker. Indirectly, this dependence depends on the gender of the

                                                                                                        41
speaker through its influence on the frequency of the main tone [12]. The data obtained can be used to
identify the most informative areas of the phoneme spectrum when solving speech recognition
problems and assessing the quality of pronouncing phonemes.
   In addition, the studies carried out have clearly confirmed that the use of existing expert methods
for assessing intelligibility can introduce significant contradictions due to differences in the
perception of messages by five (the recommended number [4]) experts. These results clearly
substantiate the need to form a dataset for creating a system based on machine learning for assessing
syllabic, verbal and phrasal speech intelligibility when solving, in particular, assessing the quality of
speech in speech rehabilitation problems using machine learning for recognition [5].

7. Acknowledgements
   This research was funded by a grant from the Russian Science Foundation (project 16-15-00038).

8. References
   [1] Understand me. "Promobot" engaged in speech recognition technologies‖, 2019. URL:
       https://www.kommersant.ru/doc/3960716
   [2] Kipyatkova I.S., Karpov A.A. Analytical review of Russian speech recognition systems with a
       large dictionary Trudy SPIIRAN – SPIIRAS Proceedings, 2010, vol. 12, no. 1, pp. 7–20
   [3] Rakhmanenko, I.A., Shelupanov, A.A., Kostyuchenko, E.Y. ―Automatic text-independent
       speaker verification using convolutional deep belief network‖. Computer Optics. 2020.
   [4] Standard GOST 50840-95. Voice over paths of communication (1995) Methods for Assessing
       the Quality, Legibility and Recognition. Publishing Standards, Moscow January 01, 1997, p.
       234.
   [5] E. Kostyuchenko, D. Novokhrestova, M. Tirskaya, M. Nemirovich-Danchenko, E.
       Choynzonov, L. Balatskaya, A. Shelupanov The evaluation process automation of phrase and
       word intelligibility using speech recognition systems. Lecture Notes in Computer Science
       (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
       Bioinformatics), Volume 11658, LNAI, 2019, pp. 237-246.
   [6] I.S. Kipyatkova, A.A. Karpov Variants of Deep Artificial Neural Networks for Speech
       Recognition Systems. Trudy SPIIRAN – SPIIRAS Proceedings, 2016, vol. 6, no. 49, pp. 80–
       103
   [7] I.A. Boduen d’Kurtene Experience of phonetic alternations, 1895.
   [8] L.V. Bondarko Phonetic system of the modern Russian language. Moscow: Prosveshcheniye,
       1977.
   [9] L.V. Zlatoustova, R.K. Potapova, V.V. Potapov, V.N. Trunin-Donskoi General and applied
       phonetics. M.: Izdatel’stvo Moskovskogo universiteta, 1997.
   [10] L.R. Rabiner, R.W. Schafer. Digital Processing of Speech Signals. — Paramus, NJ: Prentice-
       Hall, 1978. — ISBN 0-13-213603-1.
   [11] L.V. Bondarko, L.A. Verbitskaya, M.V. Gordina Fundamentals of general phonetics. - 4th
       ed., St. Petersburg: Academy, 2004, 160 p.
   [12] H. Kaya, A.A. Salah, A. Karpov, O. Frolova, A. Grigorev, A., E. Lyakso Emotion, age, and
       gender classification in children's speech by humans and machines Computer Speech and
       Language Volume 46, November 2017, Pages 268-283


                                                                                                      42