Enhancing Domain-Specific ASR Performance Using Finetuning and Zero-Shot Prompting: A Study in the Medical Domain Utsav Bandyopadhyay Maulik1,† , Pabitra Mitra1,*,† and Sudeshna Sarkar1,† 1 Dept. of Computer Science and Engineering, IIT Kharagpur, West Bengal, India Abstract Domain Adaptation has emerged as an important development in Speech Recognition systems for improving the transcription accuracy of the input audio. This study explores the enhancement of Domain-specific Automatic Speech Recognition performance through finetuning and postprocessing using Large Language Models, focusing specifically on the medical domain. We investigate how domain-specific finetuning and advanced text postprocessing techniques can significantly improve transcription accuracy in medical contexts, reducing errors in specialized terminology, acronyms, and abbreviations. Our findings highlight the benefits of integrating Large Language Model based postprocessing with Automatic Speech Recognition systems to achieve better results in complex domains. Keywords ASR, Finetuning, LLM, Postprocessing, Medical Domain, Domain Adaptation, 1. Introduction Automated Speech Recognition (ASR) enables computers to transcribe spoken language into text and has evolved significantly from statistical methods [1, 2, 3] to advanced end-to-end deep learning models. Key advancements in this shift include works by [4] and [5], which highlighted deep learning’s role in end-to-end ASR systems. Despite these improvements, challenges persist in domain-specific fields like medicine, where specialized vocabulary and jargon pose difficulties. Medical ASR faces constraints such as limited labeled data, complex terminologies, accents, dialect variations, unheard terms, and privacy concerns, causing state-of-the-art models to underperform. Domain Adaptation (DA) is therefore essential to address these limitations effectively. Domain Adaptation [6] involves tailoring a machine learning model to perform effectively on data from a domain different from its training domain. In speech recognition, domain adaptation is crucial. For example, an ASR system trained on conversational English may struggle with legal proceedings or technical support calls, where language and context deviate significantly from the training data. Similarly, conversations in specialized fields like medicine contain complex terms, acronyms, and unique phrases that general ASR models may fail to transcribe accurately. For instance, a standard ASR system may incorrectly transcribe medical terms like “hypertension” or “tachycardia”, leading to confusion or errors. Domain adaptation addresses these challenges by integrating domain-specific knowledge into the model. Even with fine-tuning, domain-specific ASR systems may produce imperfect outputs, making postprocessing crucial. Postprocessing improves ASR results by correcting errors, refining text, and The 2024 Sixth Doctoral Symposium on Intelligence Enabled Research (DoSIER 2024), November 28–29, 2024, Jalpaiguri, India * Corresponding author. † These authors contributed equally. $ utsav2000@gmail.com (U. B. Maulik); pabitra@gmail.com (P. Mitra); shudeshna@gmail.com (S. Sarkar) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings enhancing accuracy and readability. In medical transcription, for instance, minor errors can lead to significant misunderstandings, highlighting the importance of postprocessing to identify mistakes, ensure proper formatting, and refine clinical notes or prescriptions. Large Language Models (LLMs), trained on extensive text datasets, excel at learning patterns, context, and relationships between words. They have transformed postprocessing for ASR systems by intelligently improving raw transcriptions through context understanding, error correction, and word prediction. For example, in medical contexts, phrases like high tension can be accurately corrected to hypertension or pencil in to penicillin based on the context. Traditional language models, such as n-grams, rely on fixed word sequences and statistical probabilities to predict text. These models analyze word frequency and patterns but struggle with complex or less frequent combinations. In contrast, LLMs like GPT-3, trained on massive datasets, capture not only word sequences but also deeper semantic meaning and context across sentences or paragraphs. This allows them to handle diverse tasks, from answering questions to generating detailed, coherent text, with greater versatility and accuracy. For instance, while a classical model might predict “he is going to” based on frequency, an LLM could predict “he is going to the hospital for surgery” by fully grasping the context. In this work, we address the afore-mentioned challenges of domain specific ASR in the medical field. We focus on developing a method using open-source, publicly available models and data sets, making it ideally suited for use of the entire community. Pre-trained ASR models are used which is further fine-tuned on the domain specific datasets without hampering their generalizations. LLMs are then integrated on these fine-tuned ASR models to further enhance domain specific word recognition. 2. Related Work Most modern ASR systems leverage deep learning, particularly Recurrent Neural Networks (RNNs) [7] [8] and Transformer models [9]. Tools like Google’s Speech-to-Text API, Microsoft’s Azure Speech Services, and OpenAI’s Whisper model have made ASR scalable and accessible. Earlier, RNNs, particularly Long Short-Term Memory (LSTM) networks [10], marked an early advancement in handling sequential data by maintaining context over longer text spans. [11] explores the use if LSTM-based models for such applications. More recently, Transformer-based models, [9], have revolutionized ASR by allowing for parallel processing and capturing much larger contexts in both directions. This self-attention mechanism used in transformers enables the model to consider the entire sentence or even multiple sentences when predicting the next word, thus significantly improving the system’s ability to handle complex or domain-specific language. Advances in speech recognition have been driven by the rise of self supervised and unsupervised pre-training methods, such as Wav2Vec 2.0 [12]. Alec Radford et al in their work of the Whisper ASR model [13], for instance, employs such architectures to improve context understanding and achieve high transcription accuracy across various languages and domains. Recent work on domain-specific ASR has focused on techniques like transfer learning, where a general ASR model is adapted to a specific domain by fine-tuning it on smaller, domain-specific datasets. Gulati et al in [14] propose Conformer models demonstrating the effectiveness of transfer learning for domain adaptation. Similarly, Chen et al in [15], explored fine-tuning pre-trained models like Wav2Vec 2.0, showing that this technique can significantly improve recognition accuracy, particularly for emotion detection in speech. Liu et al. in their paper [16] “Exploration of Whisper Fine-Tuning Strategies for Low-Resource ASR”, explored various strategies for fine-tuning the Whisper ASR model in low-resource environments. While finetuning can improve the performance of ASR models, errors still occur, especially in the transcription of medical jargon, acronyms, and abbreviations. Postprocessing techniques may be used to refine the ASR outputs to address this. Traditional rule-based correction systems and classical language models, such as n-gram models, were used to detect and correct errors in transcription. The advent of Large Language Models (LLMs) like GPT-3 [17], GPT-4, and LLAMA (Large Language Model Meta AI) [18] has opened up new possibilities for postprocessing ASR outputs. For example, [19], in their paper “The Sound of Healthcare: Improving Medical Transcription ASR Accuracy with Large Language Models” (Google Cloud), explored how integrating large language models can significantly enhance the accuracy of medical transcription produced by ASR systems. However, they have used only commercially available models. The integration of ASR systems in the medical domain is not new. ASR technology has been used in radiology, electronic health record (EHR) documentation, and telemedicine. However, achieving high accuracy remains a challenge due to the complexities of medical speech, which often includes technical language, accents, bad quality audios, and noisy environments. 3. Methodology As shown in Figure 1, our approach involves integrating Automatic Speech Recognition (ASR) models with Large Language Models (LLMs) to improve transcription accuracy, particularly in medical conversations. Initially, the pre-trained ASR models were used to generate raw text predictions from the speech inputs. These predictions were then passed through the LLMs, leveraging the prompt engineering techniques to refine the outputs. Subsequently, we fine-tuned the pre-trained ASR models on 80% of the dataset, leaving the remaining 20% as unseen data for evaluation. The fine-tuned ASR models were then used to generate predictions on the unseen test data. These ASR outputs were passed into the LLMs, where prompt engineering techniques were applied once again. This final step enabled us to generate more accurate, contextually refined predictions for medical conversations, improving the system’s overall performance. Figure 1: Method Pipeline 3.1. Automatic Speech Recognition This study evaluates four prominent ASR models: wav2vec2-base-960h and wav2vec2-large-960h from Facebook [12], and whisper-small and whisper-large from OpenAI [13]. Wav2Vec 2.0 [12], developed by Facebook AI, is a state-of-the-art self-supervised ASR model designed to learn speech representations from large amounts of unlabeled audio data. The wav2vec 2.0 model learns general speech representations by masking random portions of the input waveform and training the model to predict the masked regions. This approach is akin to BERT-style pre-training for natural language processing. • Base (wav2vec2-base-960h): The base model has 95 million parameters and is trained on the 960-hour Librispeech dataset. It consists of a convolutional feature encoder followed by Transformer layers. • Large (wav2vec2-large-960h): The large model has 317 million parameters, offering more capacity and improved performance due to the deeper Transformer architecture. It’s also trained on the 960-hour Librispeech dataset. Whisper [13] is a model developed by OpenAI, designed as a general-purpose speech recognition system. Unlike many ASR models, Whisper is capable of multilingual transcription and translation tasks, making it highly versatile. Whisper is trained in a fully supervised manner on an extensive dataset of 680,000 hours of labeled speech data sourced from the web. • Whisper Small: This model variant contains 244 million parameters, using an encoder-decoder architecture similar to those found in sequence-to-sequence models. • Whisper Large: This model variant contains approximately 1.5 billion parameters, utilizing an encoder-decoder architecture akin to those used in sequence-to-sequence models. Its extensive parameter count enables it to capture more intricate patterns in audio, resulting in improved accuracy and robustness in diverse acoustic environments. All of these ASR models we used are open-sourced, and downloaded from the HuggingFace library. 3.2. Large Language Model In our study, two primary variants of LLaMA 3 (Large Language Model Meta AI) were utilized for postprocessing tasks: LLaMA 3 (8 billion parameters) and LLaMA 3 (70 billion parameters). These models, developed by Meta AI are part of the LLaMA series, which are state-of-the-art transformer-based language models. The 8 billion and 70 billion parameter variants of LLaMA differ primarily in their scale and capacity. While both models leverage the same underlying architecture based on the Transformer model the larger 70B variant is more capable of understanding complex relationships in language due to its greater number of parameters. 3.3. Fine-tuning Fine-tuning ASR models, like Whisper and Wav2Vec 2.0, involves adapting the pre-trained model to a specific dataset by continuing the training process on a smaller, domain-specific dataset. In our study, we used 80% of the data for fine-tuning and kept 20% for testing. We set specific training arguments such as the learning rate, batch size, and the number of epochs, using the TrainingArguments package from the transformers library. In our case, learning rates are typically set in the range of 1e-5 to 4e-5 to avoid overfitting, while batch sizes are tuned based on the model and hardware limitations. We used Batch Gradient Descent by using a batch size of 8 or 16 or 32, depending on GPU capacity. We also used warm-up steps where the learning rate is gradually increased, save steps and eval steps were kept between 500 and 1000 according to how frequently the parameters were saved and evaluated. Number of training epochs was set to 30. We used regularization by setting the weight-decay as 0.005. Gradient checkpointing was kept True to reduce the memory requirements during the backpropagation phase of training. During fine-tuning, it is common to freeze certain layers or parameters of the model, particularly those that capture general language knowledge. This helps speed up training and prevents overfitting on the smaller, domain-specific dataset. In our case, the feature encoder of the ASR model is typically frozen during fine-tuning. This component of the model processes raw audio input into latent speech representations. When fine-tuning the ASR model, the 𝑚𝑜𝑑𝑒𝑙.𝑓 𝑟𝑒𝑒𝑧𝑒_𝑓 𝑒𝑎𝑡𝑢𝑟𝑒_𝑒𝑛𝑐𝑜𝑑𝑒𝑟() function freezes the parameters of the feature encoder, which means these layers are not updated during the training process. Fine-tuning focuses instead on the top transformer layers that map the latent representations to text outputs, allowing the model to specialize in a specific task or domain and not completely changing the pre-trained checkpoints. 3.4. Prompt Engineering In our approach with LLAMA 3, various prompt engineering strategies were explored to optimize model performance. Initially, we experimented with Zero-Shot prompting. Simple generalized prompts were given, followed by enhancing the prompt, pushing it further towards the domain, and the type of input, using key phrases like “medical consultations”, “doctor-patient conversations” and “health-based discussions” to provide domain-specific contextual guidance. In addition, we experimented with passing different chunk sizes to the LLMs. We processed one sentence at a time and compared this approach with larger 5-line and 10-line chunks to determine whether longer inputs helped the model grasp the context of medical conversations more effectively, especially in doctor-patient interactions. 4. Experimental Results 4.1. Dataset Our research utilized the PriMock57 [20] dataset by Babylon Health, comprising 57 mock medical consultations totaling 9 hours of recorded speech. These consultations span diverse medical scenarios typical of clinical practice, with an average of 1500 spoken words per session. The dataset is balanced by gender between clinicians and patient actors, with participants aged 25-45 years. It includes various accents: clinicians primarily speak British English, while patients represent Indian and European dialects, reflecting the linguistic diversity of UK healthcare. To simulate real-world clinical settings, we combined separate audio tracks for doctors and patients into a single file. This step was essential to capture the natural flow of medical dialogues and evaluate ASR performance in noisy healthcare environments. 4.2. Evaluation Metric In this work, we use Word Error Rate (WER) as the primary evaluation metric to measure the performance of the transcription system. WER is a common metric used to assess the accuracy of ASR systems. It calculates the minimum number of word-level edits (insertions, deletions, and substitutions) required to transform the system’s transcription into the reference text. The formula for WER is as follows: 𝑆+𝐷+𝐼 WER = 𝑁 𝑆 : Number of substitutions 𝐷 : Number of deletions 𝐼 : Number of insertions 𝑁 : Total number of words in the reference text 4.3. Results and Comparison Table 1 compares the Word Error Rate (WER) of several speech recognition models namely both versions of wav2vec2 and the small and large versions of whisper, both pre-trained and fine-tuned using the domain specific dataset. Table 1 Effects of Fine-Tuning Pretrained ASR Models Model Used Fine-Tuning WER wav2vec2-base-960h None 47.90 wav2vec2-large-960h None 44.92 wav2vec2-base-finetuned ft. using 80% data 29.70 whisper-small None 36.70 whisper-large None 34.70 whisper-small fine-tuned ft. using 80% data 20.30 • The wav2vec2-base-960h model, without any fine-tuning, achieves a WER of 47.90, indicating moderate performance. Its larger variant, wav2vec2-large-960h, slightly improves this with a WER of 44.92, demonstrating the impact of increased model capacity on performance. • Fine-tuning the wav2vec2-base model with 80% of the data leads to a significant improvement, reducing the WER to 29.70. This highlights the effectiveness of fine-tuning in enhancing the model’s ability to generalize to the specific data it is trained on. • Whisper-small and whisper-large models, which are not fine-tuned, achieve WERs of 36.70 and 34.70, respectively. The larger model benefits from greater capacity. • Fine-tuning the whisper-small model using 80% of the data brings about the largest improvement in performance, reducing the WER to 20.30. This further emphasizes the importance of model fine-tuning for domain-specific tasks. We have seen how finetuning enhances the model performance hugely. However, finetuning is dataset dependant and at times may force the model weights to overfit the training data. Let us now try to see the effects of using an LLM to postprocess the outputs provided by the ASR models. We tested providing input as single sentences, chunks of 𝑛 sentences, and all sentences together. Single sentences performed poorly, offering no improvement in ASR output. Chunks of 10–20 sentences performed better, with the best results at 𝑛 = 20, as medical consultation context improves error correction. Beyond 𝑛 = 20, performance declines. Due to LLAMA token limits, all sentences cannot be input at once. Table 2 summarizes the performance of various automatic speech recognition (ASR) models based on their Word Error Rate after postprocessing the raw outputs using LLAMA 3 with Zero Shot Prompting. Model Comparison: • Each of the wav2vec2 models performs significantly better after the post processing step. Using the Zero Shot prompt, the wav2vec2 base model has a reduced WER of 35.5 from 47.90, which represents a reduction of 25.8% The wav2vec2 large model has an improved WER of 28.7 from Table 2 Comparison of WER across Models with Zero Shot Prompt Post-Processing Model Name Variants WER (after LLM Postpro- cessing) wav2vec2-base-960h 35.5 Wav2vec 2.0 wav2vec2-large-960h 28.7 wav2vec2-base-finetuned 21.9 whisper-small 36.3 Whisper whisper-large 34.7 whisper-small finetuned 22.7 44.92 which accounts for a reduction of 36.11%. The wav2vec2-base-finetuned model achieves the lowest WER across all models, with an improved score of 21.9 from 29.70. This suggests that LLM postprocessing after the fine-tuning process significantly enhances the model’s ability to understand and transcribe spoken language accurately. • Conversely, as of now, the whisper-small model exhibits the highest WER. As we see, whisper was producing much better results than wav2vec models, but the LLM post processing setup does not provide any improvement here. At times, it works adversely, showing that, although some errors are corrected by the LLM, it also changes many words that were originally correct. Also, whisper does correct most of the domain specific words and most of the errors are due to the informal nature of the consultations and filler words which are hard for the LLM to refine. Moreover, whisper produces a lot of punctuations which contribute to the WER as well. Zero-Shot Prompt Example: You are a text refining model. There are medical consultation audios between doctors and patients and the transcribed text of the speech is your input. You are expected to use your advanced understanding of medical terminology, conversational context and sentence structure to refine the input text. Note that you just need to refine misspelt or inaccurate words according to its context. Do not make any grammatical changes. We need to calculate Word Error Rate, hence you are expected to refine a word but not change its position or generate anything new. Do not ask for confirmation. Do not give any reasoning or justification in the output, just the refined sentence is expected. If it is not possible to understand the context or meaning or language of the input sentence, just return the sentence as it is. Do not return empty sentences or make drastic changes. 5. Discussion and Conclusion In this study, we extensively utilized pre-trained open-source ASR models, open-source LLM models, and their combinations. For transcribing domain-specific audio conversations, we observed that the best results were achieved by fine-tuning the Whisper ASR model. For wav2vec 2.0, the lowest Word Error Rate was obtained using a combined pipeline of fine-tuning followed by Zero Shot-prompted LLM postprocessing. The present study considers only Zero shot prompts. In future, few shot, chain of thought and several other promoting techniques need to be explored for further improving the performance of the proposed model. There are certain limitations of this work. Firstly, fine-tuning is highly dataset-specific and can result in overfitting. Therefore, as our results indicate, if fine-tuning is not feasible, the most effective performance is achieved using the wav2vec 2.0 large version followed by Zero-Shot-prompted LLM postprocessing. Although Whisper Large significantly outperforms wav2vec 2.0 Large, this study has not yet observed the enhancement of Whisper’s domain-specific transcripts using LLM postprocessing. On the contrary, we observed adverse effects when attempting to use LLMs to refine the Whisper transcripts. These findings underscore the need for further research aimed at reducing Word Error Rates more effectively and reliably. Declaration on Generative AI During the preparation of this work, the authors used ChatGPT and Grammarly to rephrase and perform Grammar and spelling checks. After using these tools, the authors reviewed and edited the content as needed. The authors take full responsibility for the publication’s content. References [1] F. Jelinek, Statistical methods for speech recognition, MIT press, 1998. [2] M. Gales, S. Young, et al., The application of hidden markov models in speech recognition, Foundations and Trends® in Signal Processing 1 (2008) 195–304. [3] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al., The kaldi speech recognition toolkit, in: IEEE 2011 workshop on automatic speech recognition and understanding, IEEE Signal Processing Society, 2011. [4] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal processing magazine 29 (2012) 82–97. [5] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al., Deep speech 2: End-to-end speech recognition in english and mandarin, in: International conference on machine learning, PMLR, 2016, pp. 173–182. [6] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, J. W. Vaughan, A theory of learning from different domains, Machine learning 79 (2010) 151–175. [7] D. E. Rumelhart, G. E. Hinton, R. J. Williams, Learning representations by back-propagating errors, nature 323 (1986) 533–536. [8] L. R. Medsker, L. Jain, et al., Recurrent neural networks, Design and Applications 5 (2001) 2. [9] A. Vaswani, Attention is all you need, Advances in Neural Information Processing Systems (2017). [10] S. Hochreiter, Long short-term memory, Neural Computation MIT-Press (1997). [11] A. Graves, A. Graves, Long short-term memory, Supervised sequence labelling with recurrent neural networks (2012) 37–45. [12] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: A framework for self-supervised learning of speech representations, Advances in neural information processing systems 33 (2020) 12449–12460. [13] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, Robust speech recognition via large-scale weak supervision, in: International conference on machine learning, PMLR, 2023, pp. 28492–28518. [14] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al., Conformer: Convolution-augmented transformer for speech recognition, arXiv preprint arXiv:2005.08100 (2020). [15] L.-W. Chen, A. Rudnicky, Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition, in: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2023, pp. 1–5. [16] Y. Liu, X. Yang, D. Qu, Exploration of whisper fine-tuning strategies for low-resource asr, EURASIP Journal on Audio, Speech, and Music Processing 2024 (2024) 29. [17] T. B. Brown, Language models are few-shot learners, arXiv preprint arXiv:2005.14165 (2020). [18] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., Llama: Open and efficient foundation language models, arXiv preprint arXiv:2302.13971 (2023). [19] A. Adedeji, S. Joshi, B. Doohan, The sound of healthcare: Improving medical transcription asr accuracy with large language models, arXiv preprint arXiv:2402.07658 (2024). [20] A. P. Korfiatis, F. Moramarco, R. Sarac, A. Savkov, Primock57: A dataset of primary care mock consultations, arXiv preprint arXiv:2204.00333 (2022). A. Online Resources • Primock 57 - Medical COnversation Dataset, • Wav2vec2 - Pretrained ASR Model by facebook, • Whisper - Pretrained ASR Model by OpenAI, • LLAMA - Large Language Model