<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhancing Domain-Specific ASR Performance Using Finetuning and Zero-Shot Prompting: A Study in the Medical Domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Utsav Bandyopadhyay Maulik</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pabitra Mitra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sudeshna Sarkar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ASR</institution>
          ,
          <addr-line>Finetuning, LLM, Postprocessing</addr-line>
          ,
          <institution>Medical Domain</institution>
          ,
          <addr-line>Domain Adaptation</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dept. of Computer Science and Engineering</institution>
          ,
          <addr-line>IIT Kharagpur, West Bengal</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Domain Adaptation has emerged as an important development in Speech Recognition systems for improving the transcription accuracy of the input audio. This study explores the enhancement of Domain-specific Automatic Speech Recognition performance through finetuning and postprocessing using Large Language Models, focusing specifically on the medical domain. We investigate how domain-specific finetuning and advanced text postprocessing techniques can significantly improve transcription accuracy in medical contexts, reducing errors in specialized terminology, acronyms, and abbreviations. Our findings highlight the benefits of integrating Large Language Model based postprocessing with Automatic Speech Recognition systems to achieve better results in complex domains.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Automated Speech Recognition (ASR) enables computers to transcribe spoken language into text and
has evolved significantly from statistical methods [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ] to advanced end-to-end deep learning
models. Key advancements in this shift include works by [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which highlighted deep learning’s
role in end-to-end ASR systems. Despite these improvements, challenges persist in domain-specific
ifelds like medicine, where specialized vocabulary and jargon pose dificulties. Medical ASR faces
constraints such as limited labeled data, complex terminologies, accents, dialect variations, unheard
terms, and privacy concerns, causing state-of-the-art models to underperform. Domain Adaptation
(DA) is therefore essential to address these limitations efectively.
      </p>
      <p>
        Domain Adaptation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] involves tailoring a machine learning model to perform efectively on data
from a domain diferent from its training domain. In speech recognition, domain adaptation is crucial.
For example, an ASR system trained on conversational English may struggle with legal proceedings or
technical support calls, where language and context deviate significantly from the training data.
Similarly, conversations in specialized fields like medicine contain complex terms, acronyms, and
unique phrases that general ASR models may fail to transcribe accurately. For instance, a standard ASR
system may incorrectly transcribe medical terms like “hypertension” or “tachycardia”, leading to
confusion or errors. Domain adaptation addresses these challenges by integrating domain-specific
knowledge into the model.
      </p>
      <p>Even with fine-tuning, domain-specific ASR systems may produce imperfect outputs, making
postprocessing crucial. Postprocessing improves ASR results by correcting errors, refining text, and
enhancing accuracy and readability. In medical transcription, for instance, minor errors can lead to
significant misunderstandings, highlighting the importance of postprocessing to identify mistakes,
ensure proper formatting, and refine clinical notes or prescriptions.</p>
      <p>Large Language Models (LLMs), trained on extensive text datasets, excel at learning patterns, context,
and relationships between words. They have transformed postprocessing for ASR systems by
intelligently improving raw transcriptions through context understanding, error correction, and word
prediction. For example, in medical contexts, phrases like high tension can be accurately corrected to
hypertension or pencil in to penicillin based on the context.</p>
      <p>Traditional language models, such as n-grams, rely on fixed word sequences and statistical probabilities
to predict text. These models analyze word frequency and patterns but struggle with complex or less
frequent combinations. In contrast, LLMs like GPT-3, trained on massive datasets, capture not only
word sequences but also deeper semantic meaning and context across sentences or paragraphs. This
allows them to handle diverse tasks, from answering questions to generating detailed, coherent text,
with greater versatility and accuracy. For instance, while a classical model might predict “he is going to”
based on frequency, an LLM could predict “he is going to the hospital for surgery” by fully grasping the
context.</p>
      <p>In this work, we address the afore-mentioned challenges of domain specific ASR in the medical field.
We focus on developing a method using open-source, publicly available models and data sets, making it
ideally suited for use of the entire community. Pre-trained ASR models are used which is further
ifne-tuned on the domain specific datasets without hampering their generalizations. LLMs are then
integrated on these fine-tuned ASR models to further enhance domain specific word recognition.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Most modern ASR systems leverage deep learning, particularly Recurrent Neural Networks (RNNs) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and Transformer models [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Tools like Google’s Speech-to-Text API, Microsoft’s Azure Speech
Services, and OpenAI’s Whisper model have made ASR scalable and accessible.
      </p>
      <p>
        Earlier, RNNs, particularly Long Short-Term Memory (LSTM) networks [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], marked an early
advancement in handling sequential data by maintaining context over longer text spans. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] explores
the use if LSTM-based models for such applications. More recently, Transformer-based models, [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
have revolutionized ASR by allowing for parallel processing and capturing much larger contexts in
both directions. This self-attention mechanism used in transformers enables the model to consider the
entire sentence or even multiple sentences when predicting the next word, thus significantly improving
the system’s ability to handle complex or domain-specific language. Advances in speech recognition
have been driven by the rise of self supervised and unsupervised pre-training methods, such as
Wav2Vec 2.0 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Alec Radford et al in their work of the Whisper ASR model [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], for instance,
employs such architectures to improve context understanding and achieve high transcription accuracy
across various languages and domains.
      </p>
      <p>
        Recent work on domain-specific ASR has focused on techniques like transfer learning, where a general
ASR model is adapted to a specific domain by fine-tuning it on smaller, domain-specific datasets. Gulati
et al in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] propose Conformer models demonstrating the efectiveness of transfer learning for domain
adaptation. Similarly, Chen et al in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], explored fine-tuning pre-trained models like Wav2Vec 2.0,
showing that this technique can significantly improve recognition accuracy, particularly for emotion
detection in speech. Liu et al. in their paper [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] “Exploration of Whisper Fine-Tuning Strategies for
Low-Resource ASR”, explored various strategies for fine-tuning the Whisper ASR model in
low-resource environments.
      </p>
      <p>
        While finetuning can improve the performance of ASR models, errors still occur, especially in the
transcription of medical jargon, acronyms, and abbreviations. Postprocessing techniques may be used
to refine the ASR outputs to address this. Traditional rule-based correction systems and classical
language models, such as n-gram models, were used to detect and correct errors in transcription. The
advent of Large Language Models (LLMs) like GPT-3 [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], GPT-4, and LLAMA (Large Language Model
Meta AI) [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] has opened up new possibilities for postprocessing ASR outputs. For example, [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], in
their paper “The Sound of Healthcare: Improving Medical Transcription ASR Accuracy with Large
Language Models” (Google Cloud), explored how integrating large language models can significantly
enhance the accuracy of medical transcription produced by ASR systems. However, they have used
only commercially available models.
      </p>
      <p>The integration of ASR systems in the medical domain is not new. ASR technology has been used in
radiology, electronic health record (EHR) documentation, and telemedicine. However, achieving high
accuracy remains a challenge due to the complexities of medical speech, which often includes technical
language, accents, bad quality audios, and noisy environments.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>As shown in Figure 1, our approach involves integrating Automatic Speech Recognition (ASR) models
with Large Language Models (LLMs) to improve transcription accuracy, particularly in medical
conversations. Initially, the pre-trained ASR models were used to generate raw text predictions from
the speech inputs. These predictions were then passed through the LLMs, leveraging the prompt
engineering techniques to refine the outputs.</p>
      <p>Subsequently, we fine-tuned the pre-trained ASR models on 80% of the dataset, leaving the remaining
20% as unseen data for evaluation. The fine-tuned ASR models were then used to generate predictions
on the unseen test data. These ASR outputs were passed into the LLMs, where prompt engineering
techniques were applied once again. This final step enabled us to generate more accurate, contextually
refined predictions for medical conversations, improving the system’s overall performance.</p>
      <sec id="sec-3-1">
        <title>3.1. Automatic Speech Recognition</title>
        <p>
          This study evaluates four prominent ASR models: wav2vec2-base-960h and wav2vec2-large-960h from
Facebook [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], and whisper-small and whisper-large from OpenAI [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>
          Wav2Vec 2.0 [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], developed by Facebook AI, is a state-of-the-art self-supervised ASR model designed
to learn speech representations from large amounts of unlabeled audio data. The wav2vec 2.0 model
learns general speech representations by masking random portions of the input waveform and training
the model to predict the masked regions. This approach is akin to BERT-style pre-training for natural
language processing.
        </p>
        <p>• Base (wav2vec2-base-960h): The base model has 95 million parameters and is trained on the
960-hour Librispeech dataset. It consists of a convolutional feature encoder followed by
Transformer layers.
• Large (wav2vec2-large-960h): The large model has 317 million parameters, ofering more
capacity and improved performance due to the deeper Transformer architecture. It’s also trained
on the 960-hour Librispeech dataset.</p>
        <p>
          Whisper [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] is a model developed by OpenAI, designed as a general-purpose speech recognition
system. Unlike many ASR models, Whisper is capable of multilingual transcription and translation
tasks, making it highly versatile. Whisper is trained in a fully supervised manner on an extensive
dataset of 680,000 hours of labeled speech data sourced from the web.
        </p>
        <p>• Whisper Small: This model variant contains 244 million parameters, using an encoder-decoder
architecture similar to those found in sequence-to-sequence models.
• Whisper Large: This model variant contains approximately 1.5 billion parameters, utilizing an
encoder-decoder architecture akin to those used in sequence-to-sequence models. Its extensive
parameter count enables it to capture more intricate patterns in audio, resulting in improved
accuracy and robustness in diverse acoustic environments.</p>
        <p>All of these ASR models we used are open-sourced, and downloaded from the HuggingFace library.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Large Language Model</title>
        <p>In our study, two primary variants of LLaMA 3 (Large Language Model Meta AI) were utilized for
postprocessing tasks: LLaMA 3 (8 billion parameters) and LLaMA 3 (70 billion parameters). These
models, developed by Meta AI are part of the LLaMA series, which are state-of-the-art
transformer-based language models. The 8 billion and 70 billion parameter variants of LLaMA difer
primarily in their scale and capacity. While both models leverage the same underlying architecture
based on the Transformer model the larger 70B variant is more capable of understanding complex
relationships in language due to its greater number of parameters.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Fine-tuning</title>
        <p>Fine-tuning ASR models, like Whisper and Wav2Vec 2.0, involves adapting the pre-trained model to a
specific dataset by continuing the training process on a smaller, domain-specific dataset. In our study,
we used 80% of the data for fine-tuning and kept 20% for testing.</p>
        <p>We set specific training arguments such as the learning rate, batch size, and the number of epochs,
using the TrainingArguments package from the transformers library. In our case, learning rates are
typically set in the range of 1e-5 to 4e-5 to avoid overfitting, while batch sizes are tuned based on the
model and hardware limitations. We used Batch Gradient Descent by using a batch size of 8 or 16 or 32,
depending on GPU capacity. We also used warm-up steps where the learning rate is gradually
increased, save steps and eval steps were kept between 500 and 1000 according to how frequently the
parameters were saved and evaluated. Number of training epochs was set to 30. We used regularization
by setting the weight-decay as 0.005. Gradient checkpointing was kept True to reduce the memory
requirements during the backpropagation phase of training.</p>
        <p>During fine-tuning, it is common to freeze certain layers or parameters of the model, particularly those
that capture general language knowledge. This helps speed up training and prevents overfitting on the
smaller, domain-specific dataset. In our case, the feature encoder of the ASR model is typically frozen
during fine-tuning. This component of the model processes raw audio input into latent speech
representations. When fine-tuning the ASR model, the . _ _() function
freezes the parameters of the feature encoder, which means these layers are not updated during the
training process. Fine-tuning focuses instead on the top transformer layers that map the latent
representations to text outputs, allowing the model to specialize in a specific task or domain and not
completely changing the pre-trained checkpoints.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Prompt Engineering</title>
        <p>In our approach with LLAMA 3, various prompt engineering strategies were explored to optimize
model performance. Initially, we experimented with Zero-Shot prompting. Simple generalized prompts
were given, followed by enhancing the prompt, pushing it further towards the domain, and the type of
input, using key phrases like “medical consultations”, “doctor-patient conversations” and “health-based
discussions” to provide domain-specific contextual guidance.</p>
        <p>In addition, we experimented with passing diferent chunk sizes to the LLMs. We processed one
sentence at a time and compared this approach with larger 5-line and 10-line chunks to determine
whether longer inputs helped the model grasp the context of medical conversations more efectively,
especially in doctor-patient interactions.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Results</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>
          Our research utilized the PriMock57 [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] dataset by Babylon Health, comprising 57 mock medical
consultations totaling 9 hours of recorded speech. These consultations span diverse medical scenarios
typical of clinical practice, with an average of 1500 spoken words per session. The dataset is balanced
by gender between clinicians and patient actors, with participants aged 25-45 years. It includes various
accents: clinicians primarily speak British English, while patients represent Indian and European
dialects, reflecting the linguistic diversity of UK healthcare.
        </p>
        <p>To simulate real-world clinical settings, we combined separate audio tracks for doctors and patients
into a single file. This step was essential to capture the natural flow of medical dialogues and evaluate
ASR performance in noisy healthcare environments.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Evaluation Metric</title>
        <p>In this work, we use Word Error Rate (WER) as the primary evaluation metric to measure the
performance of the transcription system. WER is a common metric used to assess the accuracy of ASR
systems. It calculates the minimum number of word-level edits (insertions, deletions, and substitutions)
required to transform the system’s transcription into the reference text.</p>
        <p>The formula for WER is as follows:</p>
        <p>WER =
 +  +</p>
        <p>: Number of substitutions
 : Number of deletions
 : Number of insertions
 : Total number of words in the reference text</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Results and Comparison</title>
        <p>Model Used</p>
      </sec>
      <sec id="sec-4-4">
        <title>Fine-Tuning</title>
        <p>None
None
None
None</p>
        <p>WER
• The wav2vec2-base-960h model, without any fine-tuning, achieves a WER of 47.90, indicating
moderate performance. Its larger variant, wav2vec2-large-960h, slightly improves this with a
WER of 44.92, demonstrating the impact of increased model capacity on performance.
• Fine-tuning the wav2vec2-base model with 80% of the data leads to a significant improvement,
reducing the WER to 29.70. This highlights the efectiveness of fine-tuning in enhancing the
model’s ability to generalize to the specific data it is trained on.
• Whisper-small and whisper-large models, which are not fine-tuned, achieve WERs of 36.70
and 34.70, respectively. The larger model benefits from greater capacity.
• Fine-tuning the whisper-small model using 80% of the data brings about the largest
improvement in performance, reducing the WER to 20.30. This further emphasizes the
importance of model fine-tuning for domain-specific tasks.</p>
        <p>We have seen how finetuning enhances the model performance hugely. However, finetuning is dataset
dependant and at times may force the model weights to overfit the training data. Let us now try to see
the efects of using an LLM to postprocess the outputs provided by the ASR models.
We tested providing input as single sentences, chunks of  sentences, and all sentences together. Single
sentences performed poorly, ofering no improvement in ASR output. Chunks of 10–20 sentences
performed better, with the best results at  = 20, as medical consultation context improves error
correction. Beyond  = 20, performance declines. Due to LLAMA token limits, all sentences cannot be
input at once.
• Each of the wav2vec2 models performs significantly better after the post processing step. Using
the Zero Shot prompt, the wav2vec2 base model has a reduced WER of 35.5 from 47.90, which
represents a reduction of 25.8% The wav2vec2 large model has an improved WER of 28.7 from
44.92 which accounts for a reduction of 36.11%. The wav2vec2-base-finetuned model achieves
the lowest WER across all models, with an improved score of 21.9 from 29.70. This suggests that
LLM postprocessing after the fine-tuning process significantly enhances the model’s ability to
understand and transcribe spoken language accurately.
• Conversely, as of now, the whisper-small model exhibits the highest WER. As we see, whisper
was producing much better results than wav2vec models, but the LLM post processing setup
does not provide any improvement here. At times, it works adversely, showing that, although
some errors are corrected by the LLM, it also changes many words that were originally correct.
Also, whisper does correct most of the domain specific words and most of the errors are due to
the informal nature of the consultations and filler words which are hard for the LLM to refine.</p>
        <p>Moreover, whisper produces a lot of punctuations which contribute to the WER as well.
Zero-Shot Prompt Example:</p>
        <p>You are a text refining model. There are medical consultation audios between doctors and patients
and the transcribed text of the speech is your input. You are expected to use your advanced
understanding of medical terminology, conversational context and sentence structure to refine
the input text. Note that you just need to refine misspelt or inaccurate words according to its
context. Do not make any grammatical changes. We need to calculate Word Error Rate, hence
you are expected to refine a word but not change its position or generate anything new. Do not
ask for confirmation. Do not give any reasoning or justification in the output, just the refined
sentence is expected. If it is not possible to understand the context or meaning or language of the
input sentence, just return the sentence as it is. Do not return empty sentences or make drastic
changes.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Conclusion</title>
      <p>In this study, we extensively utilized pre-trained open-source ASR models, open-source LLM models,
and their combinations. For transcribing domain-specific audio conversations, we observed that the
best results were achieved by fine-tuning the Whisper ASR model. For wav2vec 2.0, the lowest Word
Error Rate was obtained using a combined pipeline of fine-tuning followed by Zero Shot-prompted
LLM postprocessing. The present study considers only Zero shot prompts. In future, few shot, chain of
thought and several other promoting techniques need to be explored for further improving the
performance of the proposed model.</p>
      <p>There are certain limitations of this work. Firstly, fine-tuning is highly dataset-specific and can result in
overfitting. Therefore, as our results indicate, if fine-tuning is not feasible, the most efective
performance is achieved using the wav2vec 2.0 large version followed by Zero-Shot-prompted LLM
postprocessing. Although Whisper Large significantly outperforms wav2vec 2.0 Large, this study has
not yet observed the enhancement of Whisper’s domain-specific transcripts using LLM postprocessing.
On the contrary, we observed adverse efects when attempting to use LLMs to refine the Whisper
transcripts. These findings underscore the need for further research aimed at reducing Word Error
Rates more efectively and reliably.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT and Grammarly to rephrase and
perform Grammar and spelling checks. After using these tools, the authors reviewed and edited the
content as needed. The authors take full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-7">
      <title>A. Online Resources</title>
      <p>• Primock 57 - Medical COnversation Dataset,
• Wav2vec2 - Pretrained ASR Model by facebook,
• Whisper - Pretrained ASR Model by OpenAI,
• LLAMA - Large Language Model</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Jelinek</surname>
          </string-name>
          ,
          <article-title>Statistical methods for speech recognition</article-title>
          , MIT press,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Young</surname>
          </string-name>
          , et al.,
          <article-title>The application of hidden markov models in speech recognition</article-title>
          ,
          <source>Foundations and Trends® in Signal Processing</source>
          <volume>1</volume>
          (
          <year>2008</year>
          )
          <fpage>195</fpage>
          -
          <lpage>304</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Povey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghoshal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Boulianne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Burget</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Glembek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hannemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Motlicek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schwarz</surname>
          </string-name>
          , et al.,
          <article-title>The kaldi speech recognition toolkit</article-title>
          , in: IEEE 2011 workshop
          <article-title>on automatic speech recognition and understanding</article-title>
          ,
          <source>IEEE Signal Processing Society</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Dahl</surname>
          </string-name>
          , A.-r. Mohamed,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jaitly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Senior</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. N.</given-names>
            <surname>Sainath</surname>
          </string-name>
          , et al.,
          <article-title>Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups</article-title>
          ,
          <source>IEEE Signal processing magazine</source>
          <volume>29</volume>
          (
          <year>2012</year>
          )
          <fpage>82</fpage>
          -
          <lpage>97</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ananthanarayanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Anubhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Battenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Case</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Casper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Catanzaro</surname>
          </string-name>
          , Q. Cheng, G. Chen, et al.,
          <article-title>Deep speech 2: End-to-end speech recognition in english and mandarin</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>173</fpage>
          -
          <lpage>182</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ben-David</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Blitzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Crammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kulesza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Vaughan</surname>
          </string-name>
          ,
          <article-title>A theory of learning from diferent domains</article-title>
          ,
          <source>Machine learning 79</source>
          (
          <year>2010</year>
          )
          <fpage>151</fpage>
          -
          <lpage>175</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Rumelhart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <article-title>Learning representations by back-propagating errors</article-title>
          ,
          <source>nature</source>
          <volume>323</volume>
          (
          <year>1986</year>
          )
          <fpage>533</fpage>
          -
          <lpage>536</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L. R.</given-names>
            <surname>Medsker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jain</surname>
          </string-name>
          , et al.,
          <source>Recurrent neural networks, Design and Applications</source>
          <volume>5</volume>
          (
          <year>2001</year>
          )
          <article-title>2</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          ,
          <article-title>Long short-term memory</article-title>
          , Neural Computation MIT-Press (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Graves, Long short-term memory, Supervised sequence labelling with recurrent neural networks (</article-title>
          <year>2012</year>
          )
          <fpage>37</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          , wav2vec
          <volume>2</volume>
          .
          <article-title>0: A framework for self-supervised learning of speech representations</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>12449</fpage>
          -
          <lpage>12460</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          , T. Xu,
          <string-name>
            <given-names>G.</given-names>
            <surname>Brockman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>McLeavey</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Robust speech recognition via large-scale weak supervision</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>28492</fpage>
          -
          <lpage>28518</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gulati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-C. Chiu</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Parmar</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            , W. Han,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
          </string-name>
          , et al.,
          <article-title>Conformer: Convolution-augmented transformer for speech recognition</article-title>
          , arXiv preprint arXiv:
          <year>2005</year>
          .
          <volume>08100</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>L.-W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rudnicky</surname>
          </string-name>
          ,
          <source>Exploring wav2vec 2</source>
          .
          <article-title>0 fine tuning for improved speech emotion recognition</article-title>
          ,
          <source>in: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <article-title>Exploration of whisper fine-tuning strategies for low-resource asr</article-title>
          ,
          <source>EURASIP Journal on Audio, Speech, and Music Processing</source>
          <year>2024</year>
          (
          <year>2024</year>
          )
          <fpage>29</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>T. B. Brown</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          , arXiv preprint arXiv:
          <year>2005</year>
          .
          <volume>14165</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
          </string-name>
          , et al.,
          <article-title>Llama: Open and eficient foundation language models</article-title>
          ,
          <source>arXiv preprint arXiv:2302.13971</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Adedeji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Doohan</surname>
          </string-name>
          ,
          <article-title>The sound of healthcare: Improving medical transcription asr accuracy with large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2402.07658</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Korfiatis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Moramarco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sarac</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Savkov,</surname>
          </string-name>
          <article-title>Primock57: A dataset of primary care mock consultations</article-title>
          ,
          <source>arXiv preprint arXiv:2204.00333</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>