Better Transcription of UK Supreme Court Hearings Hadeel Saadany1,* , Catherine Breslin2 , Constantin Orăsan3 and Sophie Walker4 1 Centre for Translation Studies, University of Surrey, United Kingdom 2 Kingfisher Labs Ltd, United Kingdom 3 Centre for Translation Studies, University of Surrey, United Kingdom 4 Just Access, United Kingdom Abstract Transcription of legal proceedings is very important for enabling access to justice. However, manual speech transcription is an expensive and slow process. In this paper we describe part of a combined research and industrial project for building an automated transcription tool designed specifically for the justice sector in the UK. We explain the challenges involved in transcribing court room hearings and the Natural Language Processing (NLP) techniques we employ to tackle these challenges. We will show that fine-tuning a generic off-the-shelf pre-trained Automatic Speech Recognition (ASR) system with an in-domain language model as well as infusing common phrases extracted with a collocation detection model can improve not only the Word Error Rate (WER) of the transcribed hearings but avoid critical errors that are specific of the legal jargon and terminology commonly used in British courts. Keywords Legal Transcription, UK Supreme Court, Automatic Speech Recognition 1. Introduction Table 1 Examples of Errors Produced by Amazon Transcribe for Legal There has been a recent interest in employing NLP tech- Hearings. Errors and Corrections are typed in bold. niques to aid the textual processing of the legal domain Model Transcript [1, 2, 3, 4]. In contrast, processing spoken court hearings Reference So my lady um it is difficult to.. has not received the same attention as understanding the AWS ASR So melody um it is difficult to... legal text documents. In the UK legal system, the court Reference All rise ... hearings sessions have a unique tradition of verbal argu- AWS ASR All right ... ment. Moreover, these hearings crucially aid in new case Reference it makes further financial order preparation, provide guidance for court appeals, help in AWS ASR it makes further five natural legal training and even guide future policy. However, the audio material for a case typically spans over several hours, which makes it both time and effort consuming tem which can compete with well-known cloud-based for legal professionals to extract important information ASR systems which are trained on much larger datasets. relevant to their needs. Currently, the existing need for At the same time, in commercial scenarios, using generic legal transcriptions (covering 449K cases p.a in the UK cloud-based ASR systems to transcribe a specialised do- across all court tribunals [5] is largely met by human main may result in a sub-optimal quality transcriptions transcribers. for clients who require this service. Although there are several current speech-to-text This holds particularly true for British court room au- (STT) technology providers which could be used to tran- dio procedures. When applying a generic cloud-based scribe this data automatically, most of these systems ASR system (in our case Amazon Transcribe) on British are trained on general domain data which may result court rooms, the Word Error Rate (WER) remains rela- in domain-specific transcription errors if applied to a spe- tively high due to hearings’ length, multiplicity of speak- cialised domain. One way to address this problem is for ers, complex speech patterns, and more crucially, due end-users to train their own ASR engines using their in- to unique pronunciations and domain-specific vocabu- domain data. However, in most of the cases the amount lary. Examples in Table 1 show some common prob- of data available is too low to enable them to train a sys- lems we faced when transcribing UK court hearings by on-the-shelf ASR systems such as Amazon Web Ser- 1 Workshop on Artificial Intelligence for Access to Justice (AI4AJ 2023), vices (AWS) Transcribe . The references are taken from June 19, 2023, Braga, Portugal human-generated ground-truth transcripts of real UK * Corresponding author. Supreme Court Hearings2 created by the legal editors $ hadeel.saadany@surrey.ac.uk (H. Saadany)  0000-0002-2620-1842 (H. Saadany) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License 1 Attribution 4.0 International (CC BY 4.0). https://aws.amazon.com/transcribe/ CEUR Workshop Proceedings (CEUR-WS.org) 2 CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 https://www.supremecourt.uk/decided-cases/index.html Figure 1: Pipeline for Improving ASR Output for Legal Specific Errors in our project’s team. The first error is due to a special 2. Related Work pronunciation of the phrase ‘my lady’ in British court rooms as it is pronounced like ‘mee-lady’ when barris- Automatic speech recognition (ASR) models convert au- ters address a female judge. Similarly, in the second dio input to text and they have optimal performance example, the error relates to the linguistic etiquette of when used to transcribe data which is similar to the one UK court hearings which the ASR system consistently they were trained on. However, performance degrades fails to recognise. The error in the third example, on the when there is a mismatch between the data used for train- other hand, is related to legal terminology critical of the ing and the one that is being transcribed. Additionally, specific transcribed case. Errors similar to the third ex- some types of audio material are intrinsically harder for ample are numerous in our dataset and also affect named speech recognition systems to transcribe. In practice, entities such as numbers and names that are vital in un- this means that speech recognition system performance derstanding the legal argument in the transcribed cases. degrades when, for example, there is background noise These errors can lead to serious information loss and [7], non-native accents [8, 9], young or elderly speakers cause confusion. [8], or a shift in domain [10]. In this paper, we describe a joint research and com- Performance degradation is typically mitigated by mercial effort to perform domain adaptation of a generic adapting or fine-tuning ASR models towards the domain ASR system to mitigate the errors in the automated UK of the targeted data by using a domain-specific dataset court transcription services. We propose to minimise [11, 12, 13]. Some methods for domain adaptation adopt legal-specific errors by fine-tuning off-the-shelf ASR sys- NLP techniques such as using machine translation mod- tems with a custom language model (CLM) trained on els to learn a mapping from out-of-domain ASR errors to legal documents as well as 139 hours of human-edited in-domain terms [14]. An alternative approach is to build transcriptions of UK Supreme Court hearings. We also a large ASR model with a substantially varied training employ NLP techniques to automatically build a custom set, so that the model is more robust to data shifts. An vocabulary of common multi-word expressions and word example of this latter approach is the recently released n-gram collocations that are critical in court hearings. OpenAI Whisper model which is trained on 680k hours We infuse our custom vocabulary to the CLM at tran- of diverse domain data to generalise well on a range of scription time. In this research, we evaluate the benefits unseen datasets without the need for explicit adaptation of our proposed domain adaptation methods by compar- [6]. ing the WER of the CLM output with two off-the-shelf Moreover, ASR models are evaluated using Word Error ASR systems: AWS Transcribe (commercial) and the Ope- Rate (WER), which treats each incorrect word equally. nAI Whisper model (open-source) [6]. We also compare However, ASR models do not perform equally on different the general improvement in the ASR system’s ability categories of words. Performance is worse for categories to correctly transcribe legal entities with and without like names of people and organisations as compared to adopting our proposed methods. In addition we discuss categories like numbers or dates [15]. ASR research tar- the transcription time with different ASR settings since geted improving specific errors such as different named transcription time is critical for the commercial pipeline entities using NLP techniques [16, 17]. implemented by the industrial partner of the project. In this paper, we propose simple techniques to improve the effect of the domain mismatch between a generic ASR model and the specialised domain of British court room hearings. Our proposed method, improves both the system’s WER rate as well as its ability to capture case-specific terms and entities. In the next section, we 3.2. Phrase Extraction Model present the setup of our experiments and the evaluation For the vocabulary list, we use a dataset of ≈ 139 hours results. of gold-standard transcriptions of Supreme Court hear- ings along with the supreme court judgements used for 3. Experiment Setup training the CLM. To extract the vocabulary from this dataset, we implement two methods. First, we use this Figure 1 illustrates our proposed pipeline to improve the dataset to train a phrase detection model that collocates ASR system performance by legal domain-adaptation bigrams based on Pointwise Mutual Information (PMI) techniques. First, we build a custom language model scoring of the words in context [20]. PMI is a measure of (CLM) by fine-tuning the base AWS ASR system, using association between words; it compares the probability two types of training data: 1) textual data from the legal of two words occurring together to what this probability domain, 2) a corpus of human-generated legal transcrip- would be if the two words were independent. We train tions. Second, we use NLP techniques to extract domain- the collocation model using the Gensim Python library specific phrases and legal entities from the in-domain with a minimum score threshold for a bigram to be taken data to create a vocabulary list. We use both the CLM into account set to 1 and with PMI as the probability and the vocabulary list for transcribing legal proceedings. scoring method [21]. The collocation model is trained The following sections explain details of our experiment on the textual data of the Supreme Court transcriptions where we implemented this pipeline on the AWS Tran- and the supreme court judgements. The model is then scribe base model. We compare the performance of our used to extract a list of most common bigrams in this CLM model with different settings to AWS Transcribe dataset. Figure 2 shows an example of the type of com- base ASR system and OpenAI Whisper open-source ASR mon phrases extracted by our collocation model along system when transcribing ≈ 12 hours of UK Supreme with their frequencies. As can be seen from the figure, Court Hearings. the extracted phrases include frequent legal terms (high- lighted in blue) as well as named entities such as names 3.1. Fine-tuning the ASR system of institutions and persons (highlighted in yellow) which are specific of the Supreme Court cases included in the AWS Transcribe improves the quality of speech recognis- training corpus. ers by employing an architecture known as the recurrent neural network-transducer (RNN-T) [18]. It is an end-to- end model for automatic speech recognition (ASR) which has gained popularity in recent years as a way to fold separate components of a conventional ASR system (i.e., acoustic, pronunciation and language models) into a sin- gle neural network [19]. The AWS Transcribe platform allows the fine-tuning of their ASR architecture via build- ing custom language models to improve transcription accuracy for domain-specific speech. Creating a robust custom language model requires a significant amount of text data, which must contain spoken domain-specific vocabulary. For training our CLM, we use two datasets from the legal domain. The first is Supreme Court written judge- ments of 43 cases consisting of 3.26M tokens scraped Figure 2: Example of Common Collocations Extracted by the from the official site of the UK Supreme Court3 . The sec- Phrase Extraction Model ond dataset consists of ≈ 81 hours of gold-standard tran- scripts of 10 Supreme Court hearings. The gold-standard transcripts are created by post-editing the AWS Tran- The second method we employ to create a list of scribe output of the court hearings by a team of legal custom vocabulary is to identify named entities in our professionals using a specially designed interface. We dataset. For this purpose, we use Blackstone4 , an NLP use both datasets to train a CLM that fine-tunes the base library for processing long-form and unstructured legal AWS ASR architecture to the UK legal domain. text capable of identifying legal entities. The list of legal entities includes: Case Name, Court Name, Provision (i.e. a clause in a legal instrument), Instrument (i.e. a legal 3 4 https://www.supremecourt.uk/decided-cases/ https://research.iclr.co.uk/blackstone Table 2 Average WER and Transcription Time WER WER WER Model Transcription Time Case1 Case2 Average AWS base 8.7 16.2 12.3 85 mins CLM1 8.5 16.5 12.4 77 mins CLM2 7.9 15.5 11.6 77 mins CLM2+Vocab 7.9 15.6 11.6 132 mins CLM2+Vocab2 8.0 15.6 11.7 112 mins Whisper 9.6 15.3 12.4 191 mins term of art) and Judge. We concatenated this Blackstone We compare the performance of CLM2 infused with the entity list with the spaCy v3.4 library list of non-legal en- legal terms list (CLM2+Vocab) to the two generic ASR tities such as: Cardinals, Persons and Dates. The results systems. The ratios in Table 3 indicate that CLM2+Vocab of applying our domain-adaptation methods for the tran- is generally more capable of transcribing legal-specific scription of 2 Supreme Court case hearings consisting of terms than the other two models. It is also better at 12 hours is explained in the next section. transcribing critical legal entities such as Provisions.5 Such legal terminology needs to be accurately transcribed. Our CLM2 model with legal vocabulary demonstrates 4. Results better reliability in transcribing these terms. A similar trend is evident with the legal entity Judge Table 2 shows the WER scores and WER average score which refers to the forms of address used in British court for the 2 transcribed cases with different CLM system rooms (e.g. ‘Lord Phillips’, ‘Lady Hale’). This entity is settings, as well as, for the two baseline systems: the typically repeated in court hearings whenever a barris- AWS Transcribe (AWS base) and Whisper. The different ter or solicitor addresses the court. We see that both the CLM settings are as follows: generic ASR systems perform badly on this category with 1. CLM1 is trained on only the texts of the Supreme ratios of 0.66 and 0.69, respectively. On the other hand, Court judgements. we observe a significant improvement in correctly tran- 2. CLM2 is trained on both the judgements and the scribing this type of entities by the CLM2+Vocab with a gold-standard transcripts. ration of 0.84 correct transcriptions. Appendix A shows 3. CLM2+Vocab uses CLM2 for transcription plus an example of the output of the AWS base ASR model the global vocabulary list extracted by our phrase without our domain-adaptation methods compared to detection model. the output of the CLM correcting the mistakes. The tran- scription errors (highlighted yellow) in the base output 4. CLM2+Vocab2 uses CLM2 for transcription plus includes legal jargon, legal terms and named entities. The the legal entities vocabulary list extracted by errors are corrected by our CLM model (corrections are Blackstone and spaCy v3.4 library. highlighted in blue). As can be seen in Table 2, the ASR performance is In addition to evaluating the output of the ASR en- consistently better with the CLM models than with the gines, we also recorded the time required to produce the generic ASR systems for the two transcribed cases. CLM2 transcription. The models based on AWS were run in the model, trained on textual data (i.e. the written judge- cloud using the Amazon infrastructure. Whisper was run ments) and gold-standard court hearing transcriptions, on a Linux desktop with an NVIDIA GeForce RTX 2070 outperforms AWS base and Whisper with a 9% and 8% GPU with 8G VRAM. For all the experiments, the medium WER improvement, respectively. Moreover, we observe English-only model was used. As expected the fastest run- around 9% improvement in average WER score over the ning time is obtained using the AWS base model. Running two generic models when concatenating the list of legal the best performing model increases the time by 155%, phrases that is extracted by our phrase detection model whilst Whisper more than doubles it. Trade-off between with the CLM2 system. While ASR error correction in- running time and the level of domain-specific accuracy dicates an improved transcription quality with our pro- is a variable parameter that can be determined based on posed domain adaptation methods, we also evaluated the the transcription purpose and the end-user needs defined ASR systems performance with specific errors such as by our project’s commercial partner. legal entities and terms. 5 A Provision, a statement within an agreement or a law, typically Table 3 shows the average ratio of correctly transcribed consists of alphanumeric utterances in British court hearings (e.g. legal entities in the two studied court room hearings. ‘section 25(2)(a)-(h)’ or ‘rule 3.17’). Table 3 Ratio of Correctly Captured Legal Entities by the ASR Systems Entity AWS BASE Whisper CLM2+vocab Judge 0.66 0.77 0.84 CASE NAME 0.69 0.85 0.71 Court 0.98 1 0.93 Provision 0.88 0.95 0.97 Cardinal 1 0.97 1 5. Conclusion cessing in legal tech, Legal Tech and the Future of Civil Justice (David Engstrom ed.) (2022). In this paper, we present a study which shows the effect of [5] G. Sturge, Court statistics for England and Wales, domain adaption methods on improving the off-the-shelf Technical Report, House of Commons Library, ASR system performance in transcribing a specialised 2021. URL: https://commonslibrary.parliament.uk/ domain such as British court hearings. We optimised the research-briefings/cbp-8372/. performance of the ASR system by training an ASR cus- [6] A. Radford, J. W. Kim, T. Xu, G. Brockman, tom language model on gold-standard legal transcripts C. McLeavey, I. Sutskever, Robust Speech Recog- and textual data from the legal domain. We also trained nition via Large-Scale Weak Supervision, OpenAI a phrase detection model to incorporate extracted list of (2022). data-specific bigram collocations at transcription time. [7] S. Watanabe, M. Mandel, J. Barker, E. Vincent, We evaluated the ASR quality improvements both in A. Arora, X. Chang, S. Khudanpur, V. Manohar, terms of average WER and ratio of correctly transcribed D. Povey, D. Raj, et al., CHiME-6 Challenge: Tack- legal-specific terms. We observe significant gains in the ling multispeaker speech recognition for unseg- ASR transcription quality by our domain adaptation tech- mented recordings, 2020. niques. For commercial use of ASR technologies, im- [8] S. Feng, O. Kudina, B. M. Halpern, O. Scharenborg, proving error rate in general and transcription quality of Quantifying bias in automatic speech recognition, critical legal terms in particular would minimise manual arXiv preprint arXiv:2103.15122 (2021). post-editing effort and hence save both time and money. [9] Y. Zhang, Mitigating bias against non-native ac- We plan to evaluate the impact of different configurations cents, Delft University of Technology (2022). proposed in this paper on the editors’ postediting effort. [10] L. Mai, J. Carson-Berndsen, Unsupervised domain In the future, we will expand to record data from a adaptation for speech recognition with unsuper- variety of accents to address another axis of degradation vised error correction, Proc. Interspeech 2022 (2022) in British audio procedures different than the Supreme 5120–5124. Court hearings which are mostly a homogeneous group [11] Z. Huo, D. Hwang, K. C. Sim, S. Garg, A. Misra, of speakers. We will also explore the ability to use NLP N. Siddhartha, T. Strohman, F. Beaufays, Incremen- topic modelling techniques to connect legal entities that tal layer-wise self-supervised learning for efficient were crucial in a court’s case decision. speech domain adaptation on device, arXiv preprint arXiv:2110.00155 (2021). [12] H. Sato, T. Komori, T. Mishima, Y. Kawai, References T. Mochizuki, S. Sato, T. Ogawa, Text-Only Do- [1] E. Elwany, D. Moore, G. Oberoi, Bert goes to law main Adaptation Based on Intermediate CTC, Proc. school: Quantifying the competitive advantage of Interspeech 2022 (2022) 2208–2212. access to large legal corpora in contract understand- [13] S. Dingliwa, A. Shenoy, S. Bodapati, A. Gandhe, R. T. ing, arXiv preprint arXiv:1911.00473 (2019). Gadde, K. Kirchhoff, Domain prompts: Towards [2] J. J. Nay, Natural Language Processing for Legal memory and compute efficient domain adaptation Texts, DOI=10.1017/9781316529683.011, Cambridge of ASR systems, https://tinyurl.com/2a9jp88t, 2022. University Press, 2021, p. 99–113. [14] A. Mani, S. Palaskar, N. V. Meripo, S. Konam, [3] E. Mumcuoğlu, C. E. Öztürk, H. M. Ozaktas, A. Koç, F. Metze, Asr error correction and domain adapta- Natural language processing in law: Prediction of tion using machine translation, in: ICASSP 2020- outcomes in the higher courts of turkey, Informa- 2020 IEEE International Conference on Acoustics, tion Processing & Management 58 (2021) 102684. Speech and Signal Processing (ICASSP), IEEE, 2020, [4] J. Frankenreiter, J. Nyarko, Natural language pro- pp. 6344–6348. [15] M. Del Rio, N. Delworth, R. Westerman, M. Huang, N. Bhandari, J. Palakapilly, Q. McNamara, J. Dong, P. Zelasko, M. Jetté, Earnings-21: a practical benchmark for asr in the wild, arXiv preprint arXiv:2104.11348 (2021). [16] H. Wang, S. Dong, Y. Liu, J. Logan, A. K. Agrawal, Y. Liu, ASR Error Correction with Augmented Transformer for Entity Retrieval., in: Interspeech, 2020, pp. 1550–1554. [17] N. Das, D. H. Chau, M. Sunkara, S. Bodapati, D. Bekal, K. Kirchhoff, Listen, Know and Spell: Knowledge-Infused Subword Modeling for Improv- ing ASR Performance of OOV Named Entities, in: ICASSP 2022-2022 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2022, pp. 7887–7891. [18] A. Graves, Sequence transduction with recurrent neural networks, arXiv preprint arXiv:1211.3711 (2012). [19] J. Guo, G. Tiwari, J. Droppo, M. Van Segbroeck, C.-W. Huang, A. Stolcke, R. Maas, Efficient min- imum word error rate training of rnn-transducer for end-to-end speech recognition, arXiv preprint arXiv:2007.13802 (2020). [20] G. Bouma, Normalized (pointwise) mutual informa- tion in collocation extraction, Proceedings of GSCL 30 (2009) 31–40. [21] R. Řehůřek, P. Sojka, Software Framework for Topic Modelling with Large Corpora, in: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA, Valletta, Malta, 2010, pp. 45–50. http://is.muni.cz/publication/884893/en. A. Appendix: Examples of ASR ouput with and without domain-adaptation