Automatic Speech Recognition System with Dynamic Time
Warping and Mel-Frequency Cepstral Coefficients
Kateryna Yalova, Mykhailo Babenko and Kseniia Yashyna
Dniprovsky State Technical University, Dniprobydivska str.2, Kamyanske, 51918, Ukraine


                 Abstract
                 The approach to speech recognition presented in this paper is used to create a system for
                 automatic recognition of user commands for a graphical editor. The automatic speech
                 recognition system is used as a recognition module in a plug-in for a graphics editor. The
                 proposed automatic speech recognition system has a limited dictionary size, is speaker-
                 dependent, and is used to recognize separate speech given by the user in the form of short
                 speech commands. The list of commands contains 20 commands in Ukrainian language, the
                 name of which corresponds to the name of the pictograms in the graphical editor. The user’s
                 voice command is used as an input, which is processed, recognized and presented as a
                 command for the graphical editor. The user must use a microphone to transmit a voice
                 command. The system allows processing user commands in real time. Isolated command
                 words are used as commands. The stages of voice command recognition are next: analysis of
                 an analog signal, its transformation into a digital signal, formation of a filter bank,
                 comparison of the processed command with a template. To analyze the sound wave, it is
                 proposed to use the Fourier transform. The Hamming function is used to reduce spectrum
                 blurring. For feature extraction from voice commands, it is proposed to use the Mel-
                 frequency cepstral coefficients algorithm. Matching voice commands with a template is
                 carried out using the Dynamic time warping method. The use of the Mel-frequency cepstral
                 coefficients and Dynamic time warping algorithm is justified by the fact that the vocabulary
                 is limited and the commands are short. The accuracy of command recognition was evaluated
                 for various speakers. The average recognition accuracy is 93%.

                 Keywords 1
                 Automated speech recognition, dynamic time warping (DTW), (MFCC).

1. Introduction
    Speaking is the most natural form of human communication, and therefore the implementation of
an interface based on the analysis of speech information is a promising direction for the development
of intelligent control systems. One of the current unsolved problems in information and measurement
systems is the construction of systems for automatic recognition of speech signals that are invariant to
the speaker [1]. Its solution would make it possible to expand the range of users of such systems and
significantly increase the efficiency of information exchange in man-machine systems. The task of
language analysis includes a wide range of tasks. Traditionally, they are divided into three subclasses:
identification, classification, and diagnostic tasks. Identification tasks include the tasks of verification
and identification of announcers. The tasks of classification include the task of recognizing key
words, recognizing fused speech, and the task of semantic language analysis. Diagnostic tasks include
the task of determining the psychophysical state of the announcer.
    The creation of language interfaces can be used in systems of various purposes: voice control for
people with disabilities, answering machines, automatic call processing, implementation of “smart

COLINS-2023: 7th International Conference on Computational Linguistics and Intelligent Systems, April 20–21, 2023, Lviv, Ukraine
EMAIL: yalovakateryna@gmail.com (K. Yalova); mvbab130973@gmail.com (M. Babenko); yashinaksenia85@gmail.com (K. Yashyna)
ORCID: 0000-0002-2687-5863 (K. Yalova); 0000-0003-1013-9383 (M. Babenko); 0000-0002-8817-8609 (K. Yashyna)
            ©️ 2023 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)
home” commands. However, despite the rapidly increasing computing power, the creation of speech
recognition systems remains an extremely difficult problem. This is due to both its interdisciplinary
nature (it is necessary to have knowledge of linguistics, digital signal processing, acoustics, pattern
recognition, etc.), and the high computational complexity of the developed algorithms [2]. The latter
imposes significant limitations on automatic speech recognition systems – on the volume of the
processed dictionary, the speed of receiving an answer, and its accuracy. The task of optimizing the
quality and level of recognition in automatic speech recognition systems is a relevant scientific and
practical task, the solution of which allows improving the quality of the developed SILK (speech,
image, language, knowledge) interfaces of software systems and applications.
    The purpose of the article is to present the results of applying the Dynamic warping method
(DTW), fast Fourier transform (FFT), Mel-frequency cepstral coefficients (MFCC), to solving the
problem of developing a system for automatic speech signal recognition. The authors implemented
the following tasks in the course of achieving the goal:
    - the peculiarities of the application of the DTW algorithm for speech signal recognition are
analyzed. A fast Fourier transform was applied to the analysis of the input signal and MFCC were
used to construct the input vector of features;
    - a model of an automatic speech recognition system with characteristics has been designed: the
developed system is command-based, dependent on the speaker with the type of structural unit – the
speaker’s command phrase, which sets a command for working with the text within the text editor;
    - a speech recognition software module for text editors has been developed in the form of a plug-
in, which allows a user to evaluate the quality of the proposed solutions and establish the recognition
error.

1.1.    Related works
    During the recognition of the incoming speech signal, various methods are used: hidden Markov
model (HMM), Decision Trees, Linear Predictive Codes, DTW, methods based on the use of neural
networks of various architectures.
    The earliest attempts to create systems for automatic speech recognition began in the 1950s, when
the first speaker-dependent system that recognized numbers was developed [3], and resonances of
vowel sounds in words were used as characteristics of the input signal. In the 70s, the DTW and the
method of linear predictive coding (Linear Predictive Coding – LPC) were discovered. Despite the
rapid development of the neural network approach to solving the problem of automatic speech
recognition, DTW remains a popular method.
    Such scientists as K. Chakraborty, A. Talele, S.R. Suralkar, A. C. Wani, M. Mahajan, A. Katuri,
A. G. Siva and many others devoted their works to the use of DTW in the task of automatic speech
recognition. Methods for using DTW and MFCC in an automatic speech recognition system for
certain native languages were presented in their works by M.S. Nguyen, L. Muda Awad, H. Omar, Y.
Farghaly. In the works of X. Sun, Y. Miyanaga, B. Sai, it is proposed to use multireferences DTW to
reduce the calculation cost. Scientists S. Joshi, S. Nagar, A. Ismail, S. Abdlerazek proposed to apply
the DTW algorithm within language-dependent and speaker-dependent speech signal recognition
systems. In scientific papers A. S. Haq, C. Setianingsih, R. Martinek, J. Vanus, J. Nedoma, M.
Fridrich, J. Frnda, T. Desot the effectiveness of using DTW and MFCC in the task of creating
automatic speech recognition systems for home automation and creating systems for interacting with
smart home devices is substantiated. In the works of E. Principi, V. Prasad, M. A. Anusuya, K.
Sharma, S.D. Dhingra, B.J. Mohan carried out the development of automatic speech recognition
systems that transform an acoustic signal into text. The versatility of DTW and MFCC application
areas testifies to the effectiveness of their use. The speech recognition problem solution has a number
of different applications from voice control of digital devices, to determining the owner of the voice
and even recognizing the species of birds by their sounds.
    Despite the significant number of scientific works devoted to the problem of speech signal
recognition, the tasks of improving the quality of speech recognition and developing new approaches
to the implementation of automatic speech signal recognition systems remain relevant scientific and
practical tasks [4].
2. Proposed methodology
   The input data for the proposed system are the user's voice commands entered through the
microphone. Since the system must process voice signals in real time, it is proposed to use FFT to
analyze the speech signal. To change the size of the spectrum, removed after the FFT stop, it is
recommended to use the Hamming window.
   In this paper, it is proposed to use MFCCs, a method that allows extracting features that an
acoustic signal received from a speaker has. It was introduced by Davies and Mermelstein in the
1980s and has been relevant ever since, supplanting linear prediction coefficients (LPCs) and linear
predictive cepstral coefficients (LPCCs), which were previously the main features for automatic
speech recognition, especially with HMMs classifiers. The disadvantage of this algorithm is a
significant dependence on the correctness of the process of converting an analog signal into a digital
one. Also, the recognition process will be negatively affected by extraneous noise and speech defects
of the speaker. After building the filter bank, you can proceed to compare the processed commands
with templates.
   DTW method is an automatic speech recognition method based on pattern matching. DTW allows
to find the difference between two 2-time series of voice commands that have different durations.
Although the accuracy of speech recognition using DTW is lower than that of methods using neural
networks, it is still popular and is used for speech recognition systems with a limited vocabulary. The
expediency of using DTW when developing a plug-in for a graphic editor as an effective method for
recognizing speech commands is justified by the characteristics of the automatic speech recognition
system, namely: the limited size of the dictionary, recognition of separate speech given by the user in
the form of short speech commands.
   The adequacy of the proposed methods was evaluated by determining the level of recognition
accuracy for different speakers and voice commands.

3. Results
    Speech recognition is the process of transforming a speech signal into digital information [5]. The
automatic voice recognition system is an information system that converts an incoming speech signal
into a recognized message. At the same time, the message can be presented both in the form of the
text of this message, and immediately transformed into a form convenient for its further processing
[6].
    Most speech recognition systems (Automatic Speech Recognition – ASR) consist of an analog
signal analysis and processing process and a recognition process. When analyzing an analog signal
from speech, properties are extracted that are used further in the recognition process to determine
what was said. Automatic voice recognition systems are classified according to the following
characteristics: dictionary size (limited set of words or a large dictionary), dependence on the
announcer (announcer-dependent or announcer-independent), language type (fused, separate),
purpose (dictation systems, command systems), recognition algorithm that is used, the type of
structural unit (phrases, words, phonemes, etc.), principles of selection of structural units [7].
    The general scheme of the speech signal recognition process is next: receiving the acoustic signal
that comes from the user's microphone, digitizing the sound signal, obtaining the characteristics of the
signal and comparing the feature vector of the input signal with templates.

3.1.    Speech signal analysis
    To analyze a sound wave, let’s use Fourier’s theorem, which states that any complex periodic
oscillation can be decomposed into the sum of simple harmonic oscillations. As a result, we will get a
set of amplitudes, phases and frequencies for each sinusoidal wave component:
                                          𝑁−1                                                  (1)
                                                    2𝜋
                                    𝑋𝑘 = ∑ 𝑥𝑛 𝑒 − 𝑁 𝑘𝑛 ,
                                          𝑛=0
where N – the number of signal values, К – the number of frequencies, xn – the value of the signal at
certain points in time, Хk – complex amplitudes of sinusoidal signals that make up the initial signal,
k=0,…K-1 – frequency index, n=0,…,N-1 – discrete time points at which the signal was measured.
   The frequency or phase point taken together with the amplitude is called the spectrum. To restore a
discrete signal from the spectrum, we use the inverse Fourier transform:
                                            𝐾−1                                                (2)
                                         1          2𝜋
                                    𝑥𝑛 = ∑ 𝑋𝑘 𝑒 𝐾 𝑘𝑛 .
                                         𝐾
                                           𝑘=0
    To monitor changes in the spectrum of a signal over time, you can use a spectrogram - a
visualization of changes in the spectrum over the entire sound segment. For its construction, a
windowed Fourier transform is used – the spectrum is calculated from successive windows of the
signal, each of which overlaps a part of the previous window. To significantly speed up the
spectrogram construction process, the FFT algorithm was used, which works with complex numbers
and transformation sizes representing powers of two [8]. If the frequency of the tone coincides with
one of the frequencies of the FFT grid, then the spectrum will look “perfect”: a single sharp peak will
indicate the frequency and amplitude of the tone. If the frequency of the tone does not coincide with
any of the frequencies of the grid, then the FFT will “collect” the tone from the frequencies available
in the grid, combined with different weights. It is worth taking into account that the FFT decomposes
the signal not according to the frequencies that are actually present in the signal, but according to a
fixed uniform frequency grid. Such blurring is usually undesirable, as it can mask weaker sounds at
nearby frequencies.
    To reduce the effect of spectrum blurring, the signal before FFT calculation is multiplied by
weight windows – functions falling to the edges of the interval [8]. They reduce the blurring of the
spectrum due to some deterioration of the frequency resolution. In this work, it is proposed to use the
Hamming window to reduce the blurring of the spectrum obtained after applying the FFT. The
formula by which the Hamming function can be determined can be presented in the form [9]:
                                                            2𝜋𝑛                                  (3)
                              𝑤(𝑛) = 0.54 − 0.46 ∗ 𝑐𝑜𝑠 (         ),
                                                           𝑁−1
where n – total amount of samples in a single frame.
    Applying a Hamming window reduces the level of spectral blurring by about 40 dB relative to the
main peak. The input signal was divided into intervals of 20-40 ms, since the size of such an interval
is sufficient to obtain a reliable spectral estimate. To compensate for peak broadening when applying
weight windows, longer FFT windows can be used: for example, 8192 counts instead of 4096.

3.2.    Building a bank of filters
    The sound signal is constantly changing, so for simplicity, let’s assume that the sound signal
hardly changes over a short period of time. That is why the paper suggests dividing the signal into
intervals of 20-40 ms. If the frame is much shorter, we don’t have enough samples to get a reliable
spectral estimate, if longer, the signal changes too much over the entire frame. Spectral estimation
determines which frequencies are present in the frame.
    Spectral estimation still contains a lot of information that is not needed for automatic speech
recognition. In particular, it is not possible to distinguish between two closely spaced frequencies.
This effect becomes more pronounced with increasing frequency. To construct a vector of
characteristics, it is advisable to use the MFCC algorithm, which will allow dividing the spectrum into
sections that will be represented by frequency projections in the corresponding range on the Mel scale
[10]:
                                                           𝑓                                     (4)
                                  𝑀(𝑓) = 1127 ∗ 𝑙𝑛 (1 +       ),
                                                          700
where f – the frequency that is projected onto the Mel scale.
    The obtained values need to be converted back to the frequency form:
                                                       𝑚                                         (5)
                               ℎ(𝑚) = 700 ∗ (𝑒𝑥𝑝 (        ) − 1),
                                                     1127
where m – frequency projection on the Mel scale
                                                        ℎ(𝑖)                                    (6)
                              𝑓(𝑖) = 𝐹𝑙𝑜𝑜𝑟 ((𝑛 + 1) ∗       ),
                                                         𝑅
where n –FFT window size; R – signal sampling rate.
  The equation is used to form the filter bank:

                                       0,         𝑘 < 𝑓(𝑚 − 1)                                  (7)
                            𝑘 − 𝑓(𝑚 − 1)
                                             ,    𝑓(𝑚 − 1) ≤ 𝑘 ≤ 𝑓(𝑚))
                          𝑓(𝑚) − 𝑓(𝑚 − 1)
              𝐻𝑚 (𝑘) =                                                      ,
                             𝑓(𝑚 + 1) − 𝑘
                                               ,  𝑓(𝑚) ≤ 𝑘 ≤ 𝑓(𝑚 + 1)
                          𝑓(𝑚 + 1) − 𝑓(𝑚)
                        {          0,              𝑘 > 𝑓(𝑚 + 1)
where m – amount of MFCC; k – current frequency.
  The Mel filter bank contains triangular-shaped overlapping filters [11] as shown in Figure 1.


Figure 1: View of the filter bank

   After calculating the energy of the filter bank, it is necessary to calculate their logarithm, since
humans do not hear loudness on a linear scale. This operation makes Mel coefficients more similar to
human perception of sound. The last step is to calculate the Discrete Cosine Transform (DCT) of the
energy of the logarithms of the filter bank:
                                    𝑁−1                                                         (8)
                                               𝜋    1
                              𝑋𝑘 = ∑ 𝑥𝑛 ∗ 𝑐𝑜𝑠 [ (𝑛 + ) 𝑘].
                                               𝑁    2
                                    𝑛=0
   In the process of language recognition, the most difficult thing is to carry out the procedure of
comparing two language elements, which are also characterized by a length in time, therefore there
are quite a lot of such procedures and methods.

3.3.    Dynamic time warping application
   DTW allows to find the minimum distance between two sequences or time series depending on
certain values, such as time scales, which is effectively used in automatic speech recognition systems
[12].
   Let’s assume there are two numerical sequences P=(p1, p2, ..., pn) and Q=(q1, q2, ..., qm),
schematically the alignment of the sequences in time can be represented as shown in Figure 2.
Figure 2: Scheme of time alignment of two sequences

   To calculate local deviations between elements of two sequences, you can calculate the absolute
deviation of the values of two elements (Euclidean distance) [13,14]. As a result, the matrix of
deviations d will be obtained:
                                            𝑛                                               (9)
                                 ⅆ(𝑝, 𝑞) = √∑(𝑝𝑖 − 𝑞𝑖 )2 ,
                                              𝑖=1

   Next, it is necessary to calculate the matrix of minimum distances between the sequences. Its
elements are according to the following formula:
                     𝑚ⅆ𝑖𝑗 = ⅆ𝑖𝑗 + 𝑚𝑖𝑛(𝑚ⅆ𝑖−1,𝑗−1 , 𝑚ⅆ𝑖−1,𝑗 , 𝑚ⅆ𝑖,𝑗−1 ),                     (10)
where mdij – minimum distance between the i-th and the j-th elements of sequences P and Q.
   After that, we find the minimum path in the obtained matrix, which can be followed from the
element mdnm to md00 by following the next rules [15-16]:
   1. The path is laid only forward – indices i and j are never increased.
   2. Indices are only decremented by one per iteration.
   3. The path starts in the lower right corner and ends in the upper left corner of the matrix.
   Based on the obtained path, we estimate the global deformation:
                                               𝑘                                                 (11)
                                            1
                                      𝐺𝐶 = ∑ 𝑤𝑖 ,
                                            𝑘
                                               𝑖=1
where wi – elements of the minimum deformation path; k – the number of elements of the
deformation path.

4. Implementation
    To evaluate the quality of the proposed solutions regarding the speech signal recognition process, a
command-based, announcer-dependent automatic speech signal recognition system was designed that
converts the announcer’s input speech signal into a text formatting command within the text editor.
C# language is selected as the programming language. For software implementation, an object-
oriented approach to the analysis of the data domain and the construction of the architecture of the
internal classes of the system was used. Testing of the program code was carried out in manual mode.
    The developed system is equipped with functions: formation of a dictionary of announcer
commands, training of the system for a specific announcer, and execution of recognized commands.
The automatic speech recognition system is used as a software module for the implementation of a
graphic editor plug-in, the use of which allows you to use the functionality of the graphic editor
through voice control commands. The developed voice recognition system has a limited command
dictionary size, aimed at recognizing 20 isolated command words. The developed application has the
following functional requirements:
        the possibility of training the system for a specific announcer;
        recognition of received voice commands;
        conversion of voice commands to commands of a graphic editor
   The scheme of application of DTW and MFCC, used during the development of the system of
automatic recognition of voice commands, is presented in Figure 3.


Figure 3: Application scheme of DTW and MFCC

    The use of an object-oriented approach made it possible to present the program code in the form of
a set of classes available for repeated use. The architecture of the developed application is presented
in the form of a class diagram in Figure 4.


Figure 4: Architecture of software application classes
    The presented class diagram was generated automatically within the software environment
VisualStudio.net during the software implementation. The architecture of the class does not contain
inheritance relations. To increase the software flexibility, the composition relations were implemented
for interaction between classes. A composition relationship is a relationship in which a data consumer
class has a data provider class type field. This type of relation is not displayed in the class diagram
generated in the VisualStudio.net.
    The SoundRecording class is responsible for receiving the digitized sound signal that comes from
the user’s microphone. The OnDataAvailable method is called every 100ms, and passes the digitized
signal to the InputProcessing method of the SoundProcessing class. This class manages all stages of
recognition – obtaining the spectrum of the signal window by Fourier transformation (FFT class),
calculating low-frequency cepstral coefficients (MFCC class), comparing the input signal feature
vector with templates (DTW class). Auxiliary classes are also used to store constant values (Class
Constant), common to the project, and a class to store sequentially calculated values (Class
Precalculated), which significantly speeds up calculations, thanks to the calculation of values only
once – when the program is started.
    The result of the developed program code execution is a new item displaying – the Addins menu
item in the main toolbar of the graphic editor. After clicking the Addins button, the Start button is
displayed, which activates the command recognition mode and Preferences, which starts the
application settings mode. This part of the software is implemented as a desktop application within
Windows Application Form in C# programming language. The functionality of the developed plugin
is divided into two parts:
        system training with the ability to create and save a command template
        command execution mode.
    By pressing the Start button and starting the voice command recognition mode, you can speak
commands and evaluate the result of their execution. The recognition mode stops functioning when
the Stop button is pressed.
    In the program settings mode, you can change the voice signal transmission device, add the voice
equivalent for the command to the dictionary, analyze the sound wave and its spectrum. Figure 5
shows the application settings window.


Figure 5: Screen form of software application settings

    When the Start Training button is pressed, the user needs to say the command, after which the
training results window will open, where you can see which command the spoken signal is similar to.
If necessary, it can be added to the command dictionary with the “Save as template” button.
    The software application was tested in the following mode: 20 test commands, 10 announcers, 10
requests to pronounce each command, the sampling frequency of speech signals is 44.1 kHz, the
resolution of the speech sample is 16 bits, the number of channels is 2. The templates of the test
commands coincide with the name of the commands. presented on the Microsoft Word quick access
panel in the form of icon buttons. In order to establish the adequacy of the proposed solutions and
estimate the speech signal recognition error, a transaction log was developed, which received a
description of each speaker command recognition operation. Analysis of the transaction log data
allowed for the average value of recognition accuracy for each command. Recognition accuracy
values can be determined as [8,17]:

                                   𝑐𝑜𝑟𝑒𝑐𝑡𝑙𝑦 ⅆ𝑒𝑡𝑒𝑐𝑡𝑒ⅆ 𝑐𝑜𝑚𝑚𝑎𝑛ⅆ𝑠                                (12)
                    𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =                                  .
                                  𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑚𝑎𝑛ⅆ𝑠 𝑖𝑛 ⅆ𝑎𝑡𝑎 𝑠𝑒𝑡

   Table 1 shows the average results of recognition level estimation for the described conditions of
the experiment.

Table 1
The results of the conducted experiment
                    Command                                    Accuracy Rate
                     Align left                                    95%
                    Align right                                    93%
                      Justify                                      98%
                      Center                                       96%
                       Bold                                        93%
                  Strikethrough                                    90%
                       Italic                                      97%
                      Bullets                                      92%
                  Multilevel list                                  91%
                   Superscript                                     91%
                   Numbering                                       93%
                   Cancel input                                    98%
                 Display all signs                                 94%
                 Repeat the input                                  96%
                     Subscript                                     89%
                    Underline                                      91%
                    Page break                                     95%
                       Save                                        99%
                Increase font size                                 94%
                Decrease font size                                 94%

   Based on the data presented in table 1, it is possible to determine the average level of command
recognition accuracy. The average recognition error does not exceed 7%. It should be noted that the
main factor of erroneous recognition is the diction of the user and the similarity of voice commands
[18], for example, such as: “Superscript”, “Subscript”. The obtained result of recognition and its
comparison with the recognition accuracy of systems using DTW and MFCC allows to draw
conclusions regarding the adequacy and effectiveness of the proposed design solutions.

5. Conclusions
   Automatic speech recognition allows computers, machines and digital devices to understand
natural speech and perform appropriate actions. The generalized speech recognition algorithm
includes the steps of:
     1. Receiving an analog signal.
     2. Transformation of the received signal into a digital one.
     3. Spectogram formation using FFT and Hamming algorithm.
     4. Definition of a filter bank when using MFCC.
     5. DTW application to compare the received command and template.
    There are various techniques, methodologies and algorithms that are used in each stage of the
speech recognition process. Combining different approaches allows to achieve the desired level of
recognition accuracy. The paper presents the results of the development of an automatic speech
recognition system using DTW and MFCC. DTW, FFT and MFCC are well known methods that are
commonly used in the speech recognition, but even systems implemented using these methods have
different levels of recognition accuracy. Recognition accuracy varies significantly, for example, from
46% [19] to 96,4% [20] or even 99.5% for the dictionary with 8 isolated words [21] and largely
depends on the implementation of the recognition system, on the volume of the dictionary, input
signal transmission method, the language that is used for recognition. The proposed system is
speaker-dependent, has a limited dictionary and is designed to recognize 20 short user commands
specified by the user in Ukrainian using a microphone. The developed system is used as a program
module of a plug-in for a graphic editor. The use of such a plug-in allows you to control the
functionality of the graphic editor through the user’s voice commands. The language of the software
implementation is C #. The software has been developed using an object-oriented approach within the
Windows Application Form as a desktop application to set option and record the results of the
experiments and as a plug-in for the graphical redactor.
    A feature of the developed system of automatic speech recognition is that the system is able to
work with the user in real time, unlike other developments, where it is necessary to transmit an audio
recording of a certain format to the input. At the same time, the system response time (recognition
time) does not exceed 2 seconds on average.
    The DTW speech command recognition algorithm is an effective method for accounting for
temporal variations when comparing related time series in the task of automatic speech recognition
and can be effectively applied in systems with a limited dictionary, as presented in the paper. Using
(12), the accuracy of command recognition was estimated, the average value of accuracy for all
commands and recognition variants was approximately 93%. Comparing the obtained level of
accuracy with the results of the other authors is not indicative, since systems use different
dictionaries, with different commands or isolated words, use different natural languages. However,
the obtained level of recognition indicates a high efficiency of the proposed solutions and also
demonstrates the promise of DTW, FFT and MFCC approaches combining for speech recognition in
Ukrainian language. The methods, principles and algorithms used in the research, as well as the
results of developed system testing, make it possible to state that the results of the research are valid
and reproducible.

6. References
[1] C. Agarwal, P. Chakraborty, S. Barai, V. Goyal, Quantitative analysis of feature extraction
    techniques for isolated word recognition, Advances in computing and data sciences (2019) 618–
    627. doi: 10.1007/978-981-13-9942-8_58.
[2] І. Sultana, N.K. Pannu, Automatic speech recognition system, International journal of advance
    research, ideas and innovations in technology 4 (2018) 277–279.
[3] D. Pandey, K. K. Singh. Implementation of DTW algorithm for voice recognition using VHDL,
    in: Proceedings of the International Conference on Inventive Systems and Control, ICISC ’17,
    Coimbatore, 2017, pp. 1– 4. doi: 10.1109/ICISC.2017.8068638.
[4] M. Sood, S. Jain, Speech recognition employing MFCC and Dynamic time warping algorithm,
    Innovations in information and communication technologies (2021) 235–242. doi: 10.1007/978-
    3-030-66218-9_27.
[5] H. Pawar, N. Gaikwad, A. Kulkarni, A study of techniques and processes involved in speech
    recognition system, International journal of engineering and technology 7 (2020) 1905–1911.
[6] S. K. Ali, Z. M. Mahdi, Arabic voice system to help illiterate or blind for using computer,
    Journal of physics 1804 (2021) 1–11. doi:10.1088/1742-6596/1804/1/012137.
[7] L. Lerato, T. Niesler, Feature trajectory dynamic time warping for clustering of speech segments,
     Journal on Audio, Speech and Music 6 (2019) 1–9. doi: 10.1186/s13636-019-0149-9.
[8] H. F. C. Chuctaya, R. N. M, Mercado, J. J. G. Gaona, Isolated automatic speech recognition of
     quechua numbers using MFCC, DTW and KNN, International journal of advanced computer
     scince and application 9 (2018) 24–29. doi: 10.14569/IJACSA.2018.091003.
[9] S. Lokesh, M. R. Devi, Speech recognition system using enchanced mel grequency cepstral
     coefficient with windowing and framing method, Cluster Computing 22 (2019) 1669–1679. doi:
     10.1007/s10586-017-1447-6.
[10] I. D. Jokić, S. S. Jokić, V.D. Delić, Z.H. Perić, One solution of extension of Mel-frequency
     cepstral coefficients feature vector for automatic speaker recognition, Information, technology
     and control 49 (2020) 224–236. doi: 10.5755/j01.itc.49.2.22258.
[11] A. Awad, H. Omar, Y. Ahmed, Y. Farghaly, Speech Recognition System Using MFCC and
     DTW 4 (2018).
[12] A. S. Haq, C. Setianingsih, M. Nasrun, M.A. Murti, Speech recognition emplemetation using
     MFCC and DTW algorithm for home automation Proceeding of the Electrical Engineering
     Computer Science and Informatics 7 (2020) 78-85. doi: 10.11591/eecsi.v7.2041.
[13] B. Kurniadhani, S. Hadiyoso, S. Aulia, R. Magdalena, FPGA-based implementation of speech
     recognition for robocar control using MFCC, Telekomnika 17(4) (2019) 1914–1922. doi:
     10.12928/telkomnika.v17i4.12615.
[14] Y. Permanasari, E. H. Harahap, E. P. Ali, Speech recognition using Dynamic Time Warping
     (DTW), Journal of Physics 1366 (2019) 1–6 doi:10.1088/1742-6596/1366/1/012091.
[15] R. G. Kanke, R. M. Gaikwad, M. R. Baheti, Enhanced Marathi Speech Recognition Using
     Double Delta MFCC and DTW, International Journal of Digital Technologies 2(1) (2023) 49–
     58.
[16] I. Wibawa, I. Darmawan, Implementation of audio recognition using mel frequency cepstrum
     coefficient and dynamic time warping in wirama praharsini, Journal of Physics 1722(2021) 1–8.
     doi: 10.1088/1742-6596/1722/1/012014.
[17] B. Paul, R. Paul, S. Bera, S. Phadikar, Isolated Bangla spoken digit and word recognition using
     MFCC and DTW, Engineering mathematics and computing 1042 (2022) 235–246. doi:
     10.1007/978-981-19-2300-5_16.
[18] R. Harshavardhini, P. Jahnavi, S.K. Zaiba Afrin, S. Harika, Y. Annapa, N. Naga Swathi, MFCC
     and DTW based speech recognition, International research journal of engineering and technology
     5 (2018) 1937–1940.
[19] I. Khine, C. Su, Speech recognition system using MFCC and DTW, International Journal of
     Electrical, Electronics and Data Communication 6 (2018) 29–34.
[20] S. Riyaz, B. L. Bhavani, S. V. Kumar, Automatic Speaker Recognition System in Urdu using
     MFCC and HMM, International Journal of Recent Technology and Engineering 7 (2019) 109–
     113.
[21] N. Adnene, B. Sabri, B. Mohammed, Design and implementation of an automatic speech
     recognition based voice control system, EasyChair Preprint 5116 (2021) 1–7.