<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>N. Komarov);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Analysis of modern problems in automatic speech recognition: solutions and practical examples⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Samat Mukhanov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikolai Komarov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Orken Mamyrbayev</string-name>
          <email>morkenj@mail.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daryn Amrin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhiger</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bolatov</string-name>
          <email>zh.bolatov@iitu.edu.kz</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ikram Bazarbekov</string-name>
          <email>i.bazarbekov@iitu.edu.kz</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Information and Computational Technologies</institution>
          ,
          <addr-line>Shevchenko str. 28 050010 Almaty</addr-line>
          ,
          <country country="KZ">Kazakhstan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>International Information Technology University</institution>
          ,
          <addr-line>Manas st 34/1 050040 Almaty</addr-line>
          ,
          <country country="KZ">Kazakhstan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1945</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Automatic Speech Recognition (ASR) has made significant progress, nearing human-level accuracy. However, challenges remain, including spontaneous speech processing, noise robustness, and adaptation to accents and context. This paper analyzes key issues and solutions, focusing on mathematical models, performance metrics, and practical cases. ASR has evolved from Hidden Markov Models (HMM) to deep learning approaches, leveraging recurrent and convolutional neural networks. Techniques like Minimum Classification Error (MCE) and Maximum Mutual Information (MMI) further optimize accuracy. Speech recognition quality is measured using Word Error Rate (WER), with state-of-the-art systems achieving 5.8-6.8%, compared to 5.1% for human transcription. Despite advances, ASR remains domaindependent, struggling with speaker variability, background noise, and linguistic diversity. This study also highlights misconceptions in ASR evaluation and the need for large-scale testing. While ASR is often seen as a solved problem, a truly universal solution is yet to be achieved.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Automatic Speech Recognition (ASR)</kwd>
        <kwd>speech processing</kwd>
        <kwd>noise robustness</kwd>
        <kwd>accents</kwd>
        <kwd>contextual information</kwd>
        <kwd>Hidden Markov Models (HMM)</kwd>
        <kwd>deep learning</kwd>
        <kwd>neural networks</kwd>
        <kwd>word error rate (WER)</kwd>
        <kwd>Bayesian methods</kwd>
        <kwd>error optimization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Modern automatic speech recognition (ASR) technologies have made significant strides, achieving
accuracy levels that approach human performance. The development of ASR has been driven by
advancements in machine learning, particularly deep neural networks, which have significantly
improved recognition accuracy in controlled environments. However, ASR systems still face
fundamental challenges when dealing with spontaneous speech, background noise, diverse accents,
and context-dependent variations [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        One of the primary difficulties in ASR is its dependence on domain-specificdata. Variations in
speakers, recording conditions, and linguistic styles create inconsistencies that reduce recognition
accuracy. Additionally, ASR systems struggle with spontaneous speech, where hesitations, filler
words, and non- standard grammatical structures are common [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. External factors such as
background noise, microphone quality, and encoding distortions further complicate speech
recognition.
      </p>
      <p>
        This paper explores key challenges in ASR development, analyzes mathematical models and
techniques used to improve recognition accuracy, and examines practical examples to highlight
existing limitations. We also discuss evaluation metrics such as Word Error Rate (WER), which
remains a critical benchmark for ASR performance. While ASR is often perceived as a solved problem
in controlled environments, this study emphasizes that achieving a truly universal solution remains
an open challenge [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Literary review</title>
      <p>
        Automatic Speech Recognition (ASR) has been a subject of extensive research for decades, evolving
from early rule-based systems to modern deep learning approaches. This section reviews key
developments in ASR technology, including traditional statistical models, deep neural networks, and
optimization techniques aimed at improving recognition accuracy [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Initial ASR systems relied on statistical models, particularly Hidden Markov Models (HMM)
combined with Gaussian Mixture Models (GMM) (Rabiner, 1989). These models were effective for
structured speech recognition but struggled with variability in pronunciation, accents, and
spontaneous speech. Over time, discriminative training techniques, such as Maximum Mutual
Information (MMI) and</p>
      <p>
        Minimum Classification Error(MCE), were introduced to improve accuracy (Woodland &amp; Povey,
2002) [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ].
      </p>
      <p>
        The emergence of deep learning significantly transformed ASR capabilities. Recurrent Neural
Networks (RNNs) and Long Short-Term Memory (LSTM) networks (Graves et al., 2013) improved
sequential data processing, enabling better recognition of speech patterns. Later, Convolutional
Neural Networks (CNNs) were applied to spectrogram-based speech recognition (Abdel-Hamid et al.,
2014), enhancing feature extraction [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]
      </p>
      <p>
        The introduction of Transformer-based architectures, such as wav2vec 2.0 (Baevski et al., 2020)
and Whisper (Radford et al., 2022), further improved ASR performance by leveraging self-supervised
learning and massive datasets. These models demonstrated superior generalization across languages,
accents, and noisy environments [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Despite advancements, ASR systems still face challenges in real-world applications. Studies
highlight difficulties with noisy environments (Kim et al., 2017), accent adaptation (Sun et al., 2018),
and domain-specific speech recognition (Li et al., 2020). Researchers continue exploring novel
approaches, including Bayesian optimization for error reduction (Sak et al., 2015) and hybrid ASR
models combining statistical and neural network-based methods (Hinton et al., 2012) [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ].
      </p>
      <p>
        This review highlights that while ASR has reached near-human accuracy in controlled settings,
challenges persist in spontaneous speech processing, noise robustness, and adaptation to diverse
linguistic contexts. Future research aims to develop more robust, context-aware ASR systems capable
of real-time adaptation to varying speech conditions [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Basic concepts</title>
      <p>
        Automatic Speech Recognition (ASR) is a technology that converts spoken language into text. It relies
on complex mathematical models and machine learning algorithms to process and interpret audio
signals [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The key components of ASR systems include:
4. Acoustic Model (AM) – Represents the relationship between audio signals and phonetic units.
      </p>
      <p>Traditional ASR systems use Hidden Markov Models (HMM), while modern approaches
incorporate deep neural networks (DNNs) for higher accuracy.
5. Language Model (LM) – Predicts the probability of word sequences, improving recognition
accuracy by incorporating linguistic context. Popular models include n-grams, neural
network-based LMs, and transformers.
6. Feature Extraction – Converts raw audio signals into numerical representations such as
MelFrequency Cepstral Coefficients (MFCC) or Mel-Spectrograms, which serve as inputs to ASR
models.</p>
      <p>Decoding and Post-Processing – Utilizes search algorithms like the Viterbi decoder to find the
most probable word sequence based on the acoustic and language models. Error correction
techniques, such as rescoring and hypothesis selection, further refine the output.</p>
      <p>
        These fundamental concepts form the basis of modern ASR systems, enabling applications in
virtual assistants, transcription services, and voice-controlled interfaces [
        <xref ref-type="bibr" rid="ref14">14, 15, 16</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Materials and methods</title>
      <p>Key Methods of Automatic Speech Recognition. The evolution of Automatic Speech Recognition
(ASR) includes several key stages:
 Statistical Models: Early systems were based on dynamic programming methods and Hidden</p>
      <p>Markov Models (HMM), enabling recognition of limited vocabularies.
 Deep Learning: Modern systems leverage neural network architectures (deep recurrent and
convolutional networks), significantly improving recognition accuracy.
 Bayesian Methods and Error Optimization: Techniques such as Minimum Classification Error
(MCE) and Maximum Mutual Information (MMI) help optimize speech recognition systems.
WER is calculated using the formula:</p>
      <p>WER=
100∗( Insertions + Substitutions + Deletions)</p>
      <p>Totalword
(1)
where S = number of substitutions, D = number of deletions, I = number of insertions, N = total
number of words in the reference transcription.</p>
      <p>Example:
Original text: “Hello, how are you?”
Recognized text: “Hello how you are?”
Errors: 1 insertion (“you”)
Challenges in Speech Recognition</p>
      <p>There is a common belief that speech recognition is a solved problem, but this is not entirely true.
While the problem is resolved for specific scenarios, a universal solution does not yet exist due to
several challenges:










</p>
      <p>Domain Dependence
Different speakers
“Acoustic” recording channel: codecs, distortions
Various environments: phone noise, city noise, background speakers
Different speech pace and preparedness
Different speech styles and topics
Large and "inconvenient" datasets</p>
      <p>Quality metrics that are not always intuitive</p>
      <sec id="sec-4-1">
        <title>4.1. Quality Metric</title>
        <p>Speech recognition quality is measured using the Word Error Rate (WER) metric.</p>
        <p>Insertions – words that were added but do not exist in the original audio recording.
Substitutions – words that were incorrectly replaced with other words.</p>
        <p>Deletions – words that were not recognized, resulting in omissions.</p>
        <p>Example Calculation
fOriginal sentence:
“The stationary (unintelligible speech) phone rang late at night.”
Hypothesis (recognized text):
“The stationary bluei iPhone rangs late at night.”</p>
        <p>WER=
=60 %∗(The error rate is 60 % .)
(2)</p>
        <sec id="sec-4-1-1">
          <title>Common Misconceptions in WER Calculation There are both simple and complex errors that can occur when calculating WER. Examples of Simple Misconceptions:</title>
          <p>






</p>
          <p>Diacritical marks (e.g., d9ots on the letter "Ë" in Russian) – If the speech-to-text system outputs
“E” instead of “Ë” while the reference text includes “Ë”, this can artificially increase
Substitutions, leading to a 1% increase in WER.</p>
          <p>Different spellings of the same word – For example, “hello” vs. “hallo” vs. “hallé.”
Uppercase vs. lowercase letters.</p>
          <p>Averaging WER across separate texts instead of calculating it globally – If one text has WER
= 0.6, another has WER = 0.5, and a third has WER = 0.8, it is incorrect to average these values
arithmetically. The correct approach is to calculate WER based on the total number of words
across all texts.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Complex Misconceptions in WER Calculation</title>
        <p>WER varies across different test samples – Some solutions may perform better or worse on
specific audio recordings compared to competitors. However, drawing general conclusions based on
a single audio file is incorrect. A large dataset is needed for meaningful evaluation. Types of Speech
Recognition Systems. Speech recognition systems can be categorized into hybrid and end-to-end
(E2E) models.</p>
        <p> End-to-end (E2E) systems directly convert a sequence of sounds into a sequence of letters.
 Hybrid systems consist of separate acoustic and language models, which operate
independently. The Amvera Speech solution is built on a hybrid architecture.</p>
        <p>Structure of a Hybrid Speech Recognition System
The Hybrid Speech Recognition System Works:
4. A neural network classifieseach individual sound frame.
5. The Hidden Markov Model (HMM) models dynamics, lexicon, and word structure, based on
posterior probabilities from the neural network.
6. The Viterbi algorithm (Viterbi decoder, beam search) searches the HMM for the optimal path,
considering classifier posteriors. Processing Pipeline:</p>
        <p>Incoming Sound → Preprocessing → Feature Extraction → Classification Model &amp; Language
Model→Prediction</p>
        <p>The first step in a hybrid speech recognition model is feature extraction, typically using
MFCC (Mel-Frequency Cepstral Coefficients).</p>
        <p>The acoustic model then classifiesaudio frames.</p>
        <p>The Viterbi decoder (beam search) predicts the most probable words by combining acoustic
model predictions with statistical language model data (e.g., n-gram probabilities).</p>
        <p>Finally, rescoring is applied to generate the most likely word output.
0.3
0
0.1
0
0
0.1
0.1
0.9</p>
        <p>Imagine that in the first frame,the classifierdetects the phoneme "д", and this continues for 10
consecutive frames. The loop will keep running until the phoneme "а" is detected. If the word exists
in the dictionary, it will be recorded. Similarly, the algorithm will process the word "нет" and will
terminate when silence ("SIL") is detected in the audio channel.</p>
        <p>Let's combine the visualization of the search graph with frames for better clarity.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Training a Hybrid Speech Recognition System</title>
        <p>The general principles of training an acoustic model classifier, the process of mapping frames to
phonemes, the number of classes, and ways to improve the system.</p>
        <p>Training an Acoustic Model Classifier
4. Start with graphemes.</p>
        <p>Example: М, А, Ш, А
5. Convert them into phonemes.</p>
        <p>Result: m i1 sh a0



4.
5.
6.
7.</p>
        <p>Biphones: 57 phonemes² = 3,249 classes
Two-state biphones: (2 states × 57 phonemes)² = 12,996 classes</p>
        <p>Triphones: 57 phonemes³ = 185,193 classes
Problem 2: Unbalanced Class Distribution
Solution – Clustering:
 Groups similar classes (5,000–10,000 clusters).
 Balances class sizes.
 The clustered unit is called a senone (for phoneme-based models) or chenone (for
grapheme-based models).</p>
        <p>Using MMI (Maximum Mutual Information) or MPE/sMBR (Minimum Phone Error / State-level
Minimum Bayes Risk):</p>
        <p>Train a Cross-Entropy (CE) model.</p>
        <p>Build a numerator – a set of recognition variants that lead to the correct answer.</p>
        <p>Build a denominator – a set of incorrect recognition variants.</p>
        <p>Compute Loss = f(numerator) / f(denominator) → “boost correct predictions, suppress
incorrect ones.”</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Mathematical Formulas and Sample Calculations for ASR Methods</title>
        <p>Each ASR method uses different mathematical models for speech recognition. Below, I describe the
key formulas and provide five example calculations for each method.</p>
        <p>HMM-GMM (Hidden Markov Model + Gaussian Mixture Model)</p>
        <p>Q (O∨ λ)=∑ q 1 , q 2 , … qr P (q 1)∏ T P (Ot ∨qt ) P (qt ∨qt − 1)
(3)
 = (1, 2, … , ) is the observed speech sequence.</p>
        <p>(|) are hidden states representing phonemes.</p>
        <p>Wav2Vec 2.0 (End-to-End Transformer-Based ASR)</p>
        <p>L=− ∑ t ∑ c yt , c log P (Ot ∨qt , θ)
(5)
 is the raw speech input.
( = |, θ) is the probability of predicting character ccc at timestep .
, is the true label (1 if correct, 0 otherwise).</p>
        <p>θ are the model parameters.</p>
        <p>Sample Calculations:</p>
        <p>Whisper (Transformer-Based ASR by OpenAI)</p>
        <p>Q (Y ∨ X )=∏ T P ( yt ∨ y 1 : t − 1 , X , θ)
(6)
 is the input speech signal.
 is the predicted text sequence.
(|1:−1, , θ) is the probability of each token at time θ, given the previous tokens and
audio features.</p>
        <p>The model is trained using a sequence-to-sequence transformer approach.</p>
        <p>Sample</p>
        <sec id="sec-4-4-1">
          <title>Example 1</title>
          <p>Example 2</p>
        </sec>
        <sec id="sec-4-4-2">
          <title>Example 3</title>
        </sec>
        <sec id="sec-4-4-3">
          <title>Example 4</title>
        </sec>
        <sec id="sec-4-4-4">
          <title>Example 5</title>
          <p>Sample</p>
        </sec>
        <sec id="sec-4-4-5">
          <title>Example 1</title>
          <p>Example 2</p>
        </sec>
        <sec id="sec-4-4-6">
          <title>Example 3 Sample Calculations:</title>
          <p>4.9
4.8
5. Results and discussion
The evaluation of modern Automatic Speech Recognition (ASR) systems demonstrates significant
progress but also reveals persistent challenges. Our analysis focuses on recognition accuracy, error
patterns, and the impact of different conditions on ASR performance.</p>
          <p>Experimental results indicate that human transcription maintains a Word Error Rate (WER) of
5.1% on benchmark datasets such as Switchboard. In comparison, state-of-the-art ASR systems from
IBM and Microsoft achieve WER between 5.8% and 6.8%, demonstrating near-human performance.
However, these results vary depending on speech conditions, domain specificity, and noise levels.</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>5.1. Domain-Specific Challenges</title>
        <p>Despite the advances in ASR, recognition accuracy heavily depends on domain factors, including:
Speaker variability: Different accents and speaking styles impact recognition accuracy.
Acoustic environment: Background noise and recording distortions degrade performance.
Speech tempo and spontaneity: Faster or spontaneous speech increases recognition difficulty.</p>
        <p>Contextual adaptation: Current models struggle with specialized terminology and new words.</p>
      </sec>
      <sec id="sec-4-6">
        <title>5.2. Comparing Hybrid and End-to-End ASR Architectures</title>
        <p>Hybrid ASR systems, incorporating Hidden Markov Models (HMM) and Neural Networks (NN),
achieve better generalization across different domains. They allow separate optimization of
acoustic and language models.</p>
        <p>End-to-End ASR models (such as Transformer-based architectures) directly map audio to text
but require vast amounts of labeled data for effective performance.</p>
      </sec>
      <sec id="sec-4-7">
        <title>5.3. WER Metric and Evaluation Challenges</title>
        <p>The accuracy of WER as a metric is often debated due to:
 Ambiguities in word normalization (e.g., variations in capitalization and punctuation).
 Incorrect phoneme-to-text alignments affecting error rates.
 Unfair comparisons across datasets due to varying speech complexity.</p>
        <p>While ASR systems continue to advance, they still struggle with domain dependency, spontaneous
speech, and error correction. Future improvements in deep learning, data augmentation, and hybrid
modeling approaches will be critical to bridging the gap between human and machine speech
recognition.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Paragraphs</title>
      <p>Despite significant advancements,automatic speech recognition (ASR) is still an evolving field with
numerous challenges. While modern systems have reached near-human accuracy in controlled
environments, real-world conditions introduce complexities that remain unsolved.</p>
      <p>One of the biggest obstacles is spontaneous speech, which includes hesitations, disfluencies,
interruptions, and informal expressions. These factors make transcription difficult, especially for
systems that rely heavily on structured training data. Additionally, environmental noise, such as
background chatter, reverberation, and microphone quality, significantly affects recognition
performance.</p>
      <p>Another major challenge is speaker variability. Different accents, dialects, speech rates, and vocal
characteristics can greatly impact accuracy, making it difficult to develop universal ASR models.
While hybrid models can incorporate diverse linguistic data, they still struggle with words not
present in their vocabulary. End-to-end models, while more flexible, often require vast amounts of
data and still face issues with generalization across different domains.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
RESULTS. Herald of the Kazakh-British technical university. 2024;21(3):78-89. (In
Kazakh) https://doi.org/10.55452/1998-6688-2024-21-3-78-89.
[15] Alpar, S., Faizulin, R., Tokmukhamedova, F., &amp; Daineko, Y. (2024). Applications of
SymmetryEnhanced Physics-Informed Neural Networks in High-Pressure Gas Flow Simulations in
Pipelines. Symmetry, 16(5), 538. https://doi.org/10.3390/sym16050538.
[16] Nuralin M.; Daineko Y.; Aljawarneh S.; Tsoy D.; Ipalakova M. The real-time hand and object
recognition for virtual interaction. 2024. PeerJ Computer Science.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dahl</surname>
            ,
            <given-names>G. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohamed</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jaitly</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , ... &amp;
          <string-name>
            <surname>Kingsbury</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups</article-title>
          .
          <source>IEEE Signal Processing Magazine</source>
          ,
          <volume>29</volume>
          (
          <issue>6</issue>
          ),
          <fpage>82</fpage>
          -
          <lpage>97</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohamed</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Speech recognition with deep recurrent neural networks</article-title>
          .
          <source>In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          (pp.
          <fpage>6645</fpage>
          -
          <lpage>6649</lpage>
          ). IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Rabiner</surname>
            ,
            <given-names>L. R.</given-names>
          </string-name>
          (
          <year>1989</year>
          ).
          <article-title>A tutorial on hidden Markov models and selected applications in speech recognition</article-title>
          .
          <source>Proceedings of the IEEE</source>
          ,
          <volume>77</volume>
          (
          <issue>2</issue>
          ),
          <fpage>257</fpage>
          -
          <lpage>286</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Povey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghoshal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boulianne</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burget</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glembek</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goel</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , ... &amp;
          <string-name>
            <surname>Veselý</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>The Kaldi speech recognition toolkit</article-title>
          .
          <source>In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)</source>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Saon</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuo</surname>
            ,
            <given-names>H. K. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rennie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Picheny</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>The IBM 2016 English conversational telephone speech recognition system</article-title>
          .
          <source>In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          (pp.
          <fpage>5085</fpage>
          -
          <lpage>5089</lpage>
          ). IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Xiong</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alleva</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Droppo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Stolcke</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>The Microsoft 2017 conversational speech recognition system</article-title>
          .
          <source>In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          (pp.
          <fpage>5934</fpage>
          -
          <lpage>5938</lpage>
          ). IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Hannun</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Case</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Casper</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Catanzaro</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Diamos</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elsen</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , ... &amp;
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Deep Speech: Scaling up end-to-end speech recognition</article-title>
          .
          <source>arXiv preprint arXiv:1412</source>
          .
          <fpage>5567</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Amodei</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ananthanarayanan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anubhai</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bai</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Battenberg</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Case</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , ... &amp;
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Deep Speech 2: End-to-end speech recognition in English and Mandarin</article-title>
          .
          <source>In Proceedings of the 33rd International Conference on Machine Learning (ICML)</source>
          (pp.
          <fpage>173</fpage>
          -
          <lpage>182</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Baevski</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohamed</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Auli</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>wav2vec 2.0: A framework for selfsupervised learning of speech representations</article-title>
          .
          <source>Advances in Neural Information Processing Systems (NeurIPS)</source>
          ,
          <volume>33</volume>
          ,
          <fpage>12449</fpage>
          -
          <lpage>12460</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Belcic</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <article-title>Hyperparameter Tuning: Approaches and Best Practices</article-title>
          .
          <source>In IBM</source>
          ,
          <year>2024</year>
          . Available at: https://www.ibm.com/think/topics/hyperparameter-tuning
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kenshimov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mukhanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Merembayev</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Yedilkhan</surname>
          </string-name>
          , “
          <article-title>A comparison of convolutional neural networks for Kazakh sign language recognition</article-title>
          ,
          <source>” EEJET</source>
          , vol.
          <volume>5</volume>
          , no.
          <volume>2</volume>
          (
          <issue>113</issue>
          ), pp.
          <fpage>44</fpage>
          -
          <lpage>54</lpage>
          , Oct.
          <year>2021</year>
          , doi: 10.15587/
          <fpage>1729</fpage>
          -
          <lpage>4061</lpage>
          .
          <year>2021</year>
          .
          <volume>241535</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Mukhanov</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Uskenbayeva</surname>
          </string-name>
          , “
          <article-title>Pattern Recognition with Using Effective Algorithms and</article-title>
          Methods of Computer Vision Library,” in
          <source>Optimization of Complex Systems: Theory, Models, Algorithms and Applications</source>
          , vol.
          <volume>991</volume>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Le Thi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Le</surname>
          </string-name>
          , and T. Pham Dinh, Eds.,
          <source>in Advances in Intelligent Systems and Computing</source>
          , vol.
          <volume>991</volume>
          . , Cham: Springer International Publishing,
          <year>2020</year>
          , pp.
          <fpage>810</fpage>
          -
          <lpage>819</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -21803-4_
          <fpage>81</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Mukhanov</surname>
            ,
            <given-names>Samat</given-names>
          </string-name>
          &amp; Uskenbayeva, Raissa &amp; Rakhim, Abd &amp; Akim, Akbota &amp; Mamanova,
          <string-name>
            <surname>Symbat.</surname>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>Gesture recognition of the Kazakh alphabet based on machine and deep learning models</article-title>
          .
          <source>Procedia Computer Science</source>
          .
          <volume>241</volume>
          .
          <fpage>458</fpage>
          -
          <lpage>463</lpage>
          .
          <fpage>10</fpage>
          .1016/j.procs.
          <year>2024</year>
          .
          <volume>08</volume>
          .064.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Bazarbekov</surname>
            <given-names>I.М.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ipalakova</surname>
            <given-names>M.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daineko</surname>
            <given-names>E.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mukhanov S.B. DEVELOPMENT AND DATA ANALYSIS OF A ROBO-PEN FOR ALZHEIMER'S DISEASE DIAGNOSIS: PRELIMINARY</surname>
          </string-name>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>