1. Introduction

June

Better Transcription of UK Supreme Court Hearings

Hadeel Saadany

Catherine Breslin

Constantin Orăsan

Sophie Walker

1 0 Centre for Translation Studies, University of Surrey , United Kingdom 1 Just Access , United Kingdom 2 Kingfisher Labs Ltd , United Kingdom

2022

19 2023 0000 0002

Transcription of legal proceedings is very important for enabling access to justice. However, manual speech transcription is an expensive and slow process. In this paper we describe part of a combined research and industrial project for building an automated transcription tool designed specifically for the justice sector in the UK. We explain the challenges involved in transcribing court room hearings and the Natural Language Processing (NLP) techniques we employ to tackle these challenges. We will show that fine-tuning a generic of-the-shelf pre-trained Automatic Speech Recognition (ASR) system with an in-domain language model as well as infusing common phrases extracted with a collocation detection model can improve not only the Word Error Rate (WER) of the transcribed hearings but avoid critical errors that are specific of the legal jargon and terminology commonly used in British courts.

eol>Legal Transcription UK Supreme Court Automatic Speech Recognition

1. Introduction

There has been a recent interest in employing NLP techniques to aid the textual processing of the legal domain Model Transcript [ 1, 2, 3, 4 ]. In contrast, processing spoken court hearings Reference So my lady um it is dificult to.. has not received the same attention as understanding the AWS ASR So melody um it is dificult to... legal text documents. In the UK legal system, the court Reference All rise ... hearings sessions have a unique tradition of verbal argu- AWS ASR All right ... ment. Moreover, these hearings crucially aid in new case Reference it makes further financial order preparation, provide guidance for court appeals, help in AWS ASR it makes further five natural legal training and even guide future policy. However, the audio material for a case typically spans over several hours, which makes it both time and efort consuming tem which can compete with well-known cloud-based for legal professionals to extract important information ASR systems which are trained on much larger datasets. relevant to their needs. Currently, the existing need for At the same time, in commercial scenarios, using generic legal transcriptions (covering 449K cases p.a in the UK cloud-based ASR systems to transcribe a specialised doacross all court tribunals [ 5 ] is largely met by human main may result in a sub-optimal quality transcriptions transcribers. for clients who require this service.

Although there are several current speech-to-text This holds particularly true for British court room au(STT) technology providers which could be used to tran- dio procedures. When applying a generic cloud-based scribe this data automatically, most of these systems ASR system (in our case Amazon Transcribe) on British are trained on general domain data which may result court rooms, the Word Error Rate (WER) remains relain domain-specific transcription errors if applied to a spe- tively high due to hearings’ length, multiplicity of speakcialised domain. One way to address this problem is for ers, complex speech patterns, and more crucially, due end-users to train their own ASR engines using their in- to unique pronunciations and domain-specific vocabudomain data. However, in most of the cases the amount lary. Examples in Table 1 show some common probof data available is too low to enable them to train a sys- lems we faced when transcribing UK court hearings by on-the-shelf ASR systems such as Amazon Web Services (AWS) Transcribe1. The references are taken from human-generated ground-truth transcripts of real UK Supreme Court Hearings2 created by the legal editors

2. Related Work

in our project’s team. The first error is due to a special pronunciation of the phrase ‘my lady’ in British court rooms as it is pronounced like ‘mee-lady’ when barris- Automatic speech recognition (ASR) models convert auters address a female judge. Similarly, in the second dio input to text and they have optimal performance example, the error relates to the linguistic etiquette of when used to transcribe data which is similar to the one UK court hearings which the ASR system consistently they were trained on. However, performance degrades fails to recognise. The error in the third example, on the when there is a mismatch between the data used for trainother hand, is related to legal terminology critical of the ing and the one that is being transcribed. Additionally, specific transcribed case. Errors similar to the third ex- some types of audio material are intrinsically harder for ample are numerous in our dataset and also afect named speech recognition systems to transcribe. In practice, entities such as numbers and names that are vital in un- this means that speech recognition system performance derstanding the legal argument in the transcribed cases. degrades when, for example, there is background noise These errors can lead to serious information loss and [ 7 ], non-native accents [ 8, 9 ], young or elderly speakers cause confusion. [ 8 ], or a shift in domain [ 10 ].

In this paper, we describe a joint research and com- Performance degradation is typically mitigated by mercial efort to perform domain adaptation of a generic adapting or fine-tuning ASR models towards the domain ASR system to mitigate the errors in the automated UK of the targeted data by using a domain-specific dataset court transcription services. We propose to minimise [ 11, 12, 13 ]. Some methods for domain adaptation adopt legal-specific errors by fine-tuning of-the-shelf ASR sys- NLP techniques such as using machine translation modtems with a custom language model (CLM) trained on els to learn a mapping from out-of-domain ASR errors to legal documents as well as 139 hours of human-edited in-domain terms [ 14 ]. An alternative approach is to build transcriptions of UK Supreme Court hearings. We also a large ASR model with a substantially varied training employ NLP techniques to automatically build a custom set, so that the model is more robust to data shifts. An vocabulary of common multi-word expressions and word example of this latter approach is the recently released n-gram collocations that are critical in court hearings. OpenAI Whisper model which is trained on 680k hours We infuse our custom vocabulary to the CLM at tran- of diverse domain data to generalise well on a range of scription time. In this research, we evaluate the benefits unseen datasets without the need for explicit adaptation of our proposed domain adaptation methods by compar- [ 6 ]. ing the WER of the CLM output with two of-the-shelf Moreover, ASR models are evaluated using Word Error ASR systems: AWS Transcribe (commercial) and the Ope- Rate (WER), which treats each incorrect word equally. nAI Whisper model (open-source) [ 6 ]. We also compare However, ASR models do not perform equally on diferent the general improvement in the ASR system’s ability categories of words. Performance is worse for categories to correctly transcribe legal entities with and without like names of people and organisations as compared to adopting our proposed methods. In addition we discuss categories like numbers or dates [ 15 ]. ASR research tarthe transcription time with diferent ASR settings since geted improving specific errors such as diferent named transcription time is critical for the commercial pipeline entities using NLP techniques [16, 17]. implemented by the industrial partner of the project. In this paper, we propose simple techniques to improve the efect of the domain mismatch between a generic ASR model and the specialised domain of British court room hearings. Our proposed method, improves both the system’s WER rate as well as its ability to capture 3https://www.supremecourt.uk/decided-cases/ 4https://research.iclr.co.uk/blackstone case-specific terms and entities. In the next section, we present the setup of our experiments and the evaluation results.

The second method we employ to create a list of custom vocabulary is to identify named entities in our dataset. For this purpose, we use Blackstone4, an NLP library for processing long-form and unstructured legal text capable of identifying legal entities. The list of legal entities includes: Case Name, Court Name, Provision (i.e. a clause in a legal instrument), Instrument (i.e. a legal term of art) and Judge. We concatenated this Blackstone We compare the performance of CLM2 infused with the entity list with the spaCy v3.4 library list of non-legal en- legal terms list (CLM2+Vocab) to the two generic ASR tities such as: Cardinals, Persons and Dates. The results systems. The ratios in Table 3 indicate that CLM2+Vocab of applying our domain-adaptation methods for the tran- is generally more capable of transcribing legal-specific scription of 2 Supreme Court case hearings consisting of terms than the other two models. It is also better at 12 hours is explained in the next section. transcribing critical legal entities such as Provisions.5 Such legal terminology needs to be accurately transcribed.

Our CLM2 model with legal vocabulary demonstrates 4. Results better reliability in transcribing these terms. A similar trend is evident with the legal entity Judge Table 2 shows the WER scores and WER average score which refers to the forms of address used in British court for the 2 transcribed cases with diferent CLM system rooms (e.g. ‘Lord Phillips’, ‘Lady Hale’). This entity is settings, as well as, for the two baseline systems: the typically repeated in court hearings whenever a barrisAWS Transcribe (AWS base) and Whisper. The diferent ter or solicitor addresses the court. We see that both the CLM settings are as follows: generic ASR systems perform badly on this category with 1. CLM1 is trained on only the texts of the Supreme ratios of 0.66 and 0.69, respectively. On the other hand,

Court judgements. we observe a significant improvement in correctly tran2. CLM2 is trained on both the judgements and the scribing this type of entities by the CLM2+Vocab with a gold-standard transcripts. ration of 0.84 correct transcriptions. Appendix A shows 3. CLM2+Vocab uses CLM2 for transcription plus an example of the output of the AWS base ASR model the global vocabulary list extracted by our phrase without our domain-adaptation methods compared to detection model. the output of the CLM correcting the mistakes. The transcription errors (highlighted yellow) in the base output 4. CLM2+Vocab2 uses CLM2 for transcription plus includes legal jargon, legal terms and named entities. The the legal entities vocabulary list extracted by errors are corrected by our CLM model (corrections are Blackstone and spaCy v3.4 library. highlighted in blue).

As can be seen in Table 2, the ASR performance is In addition to evaluating the output of the ASR enconsistently better with the CLM models than with the gines, we also recorded the time required to produce the generic ASR systems for the two transcribed cases. CLM2 transcription. The models based on AWS were run in the model, trained on textual data (i.e. the written judge- cloud using the Amazon infrastructure. Whisper was run ments) and gold-standard court hearing transcriptions, on a Linux desktop with an NVIDIA GeForce RTX 2070 outperforms AWS base and Whisper with a 9% and 8% GPU with 8G VRAM. For all the experiments, the medium WER improvement, respectively. Moreover, we observe English-only model was used. As expected the fastest runaround 9% improvement in average WER score over the ning time is obtained using the AWS base model. Running two generic models when concatenating the list of legal the best performing model increases the time by 155%, phrases that is extracted by our phrase detection model whilst Whisper more than doubles it. Trade-of between with the CLM2 system. While ASR error correction in- running time and the level of domain-specific accuracy dicates an improved transcription quality with our pro- is a variable parameter that can be determined based on posed domain adaptation methods, we also evaluated the the transcription purpose and the end-user needs defined ASR systems performance with specific errors such as by our project’s commercial partner. legal entities and terms. 5A Provision, a statement within an agreement or a law, typically

Table 3 shows the average ratio of correctly transcribed consists of alphanumeric utterances in British court hearings (e.g. legal entities in the two studied court room hearings. ‘section 25(2)(a)-(h)’ or ‘rule 3.17’).

5. Conclusion

In this paper, we present a study which shows the efect of domain adaption methods on improving the of-the-shelf ASR system performance in transcribing a specialised domain such as British court hearings. We optimised the performance of the ASR system by training an ASR custom language model on gold-standard legal transcripts and textual data from the legal domain. We also trained a phrase detection model to incorporate extracted list of data-specific bigram collocations at transcription time. We evaluated the ASR quality improvements both in terms of average WER and ratio of correctly transcribed legal-specific terms. We observe significant gains in the ASR transcription quality by our domain adaptation techniques. For commercial use of ASR technologies, improving error rate in general and transcription quality of critical legal terms in particular would minimise manual post-editing efort and hence save both time and money. We plan to evaluate the impact of diferent configurations proposed in this paper on the editors’ postediting efort.

In the future, we will expand to record data from a variety of accents to address another axis of degradation in British audio procedures diferent than the Supreme Court hearings which are mostly a homogeneous group of speakers. We will also explore the ability to use NLP topic modelling techniques to connect legal entities that were crucial in a court’s case decision. P. Zelasko, M. Jetté, Earnings-21: a practical benchmark for asr in the wild, arXiv preprint arXiv:2104.11348 (2021). [16] H. Wang, S. Dong, Y. Liu, J. Logan, A. K. Agrawal, Y. Liu, ASR Error Correction with Augmented Transformer for Entity Retrieval., in: Interspeech, 2020, pp. 1550–1554. [17] N. Das, D. H. Chau, M. Sunkara, S. Bodapati, D. Bekal, K. Kirchhof, Listen, Know and Spell: Knowledge-Infused Subword Modeling for Improving ASR Performance of OOV Named Entities, in: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2022, pp. 7887–7891. [18] A. Graves, Sequence transduction with recurrent neural networks, arXiv preprint arXiv:1211.3711 (2012). [19] J. Guo, G. Tiwari, J. Droppo, M. Van Segbroeck, C.-W. Huang, A. Stolcke, R. Maas, Eficient minimum word error rate training of rnn-transducer for end-to-end speech recognition, arXiv preprint arXiv:2007.13802 (2020). [20] G. Bouma, Normalized (pointwise) mutual information in collocation extraction, Proceedings of GSCL 30 (2009) 31–40. [21] R. Řehůřek, P. Sojka, Software Framework for Topic Modelling with Large Corpora, in: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA, Valletta, Malta, 2010, pp. 45–50. http://is.muni.cz/publication/884893/en.

A. Appendix: Examples of ASR ouput with and without

domain-adaptation

[1]

Elwany ,

Moore , G. Oberoi, Bert goes to law school: Quantifying the competitive advantage of access to large legal corpora in contract understanding , arXiv preprint arXiv: 1911 . 00473 ( 2019 ).

[2]

J. J.

Nay , Natural Language Processing for Legal Texts, DOI=10.1017/9781316529683 .011, Cambridge University Press, 2021 , p. 99 - 113 .

[3]

Mumcuoğlu ,

C. E.

Öztürk ,

H. M.

Ozaktas ,

Koç , Natural language processing in law: Prediction of outcomes in the higher courts of turkey , Information Processing & Management 58 ( 2021 ) 102684 .

[4]

Frankenreiter ,

Nyarko , Natural language processing in legal tech, Legal Tech and the Future of Civil Justice (David Engstrom ed .) ( 2022 ).

[5]

Sturge , Court statistics for England and Wales, Technical Report, House of Commons Library , 2021 . URL: https://commonslibrary.parliament.uk/ research-briefings/cbp-8372/.

[6]

Radford ,

J. W.

Kim , T. Xu,

Brockman ,

McLeavey , I. Sutskever , Robust Speech Recognition via Large-Scale Weak Supervision , OpenAI ( 2022 ).

[7]

Watanabe ,

Mandel ,

Barker ,

Vincent ,

Arora ,

Chang ,

Khudanpur ,

Manohar ,

Povey ,

Raj , et al., CHiME-6 Challenge: Tackling multispeaker speech recognition for unsegmented recordings , 2020 .

[8]

Feng ,

Kudina ,

B. M.

Halpern ,

Scharenborg , Quantifying bias in automatic speech recognition , arXiv preprint arXiv:2103.15122 ( 2021 ).

[9]

Zhang , Mitigating bias against non-native accents , Delft University of Technology ( 2022 ).

[10]

Mai ,

Carson-Berndsen , Unsupervised domain adaptation for speech recognition with unsupervised error correction , Proc. Interspeech 2022 ( 2022 ) 5120 - 5124 .

[11]

Huo ,

Hwang ,

K. C.

Sim ,

Garg ,

Misra ,

Siddhartha ,

Strohman ,

Beaufays , Incremental layer-wise self-supervised learning for eficient speech domain adaptation on device , arXiv preprint arXiv:2110.00155 ( 2021 ).

[12]

Sato ,

Komori ,

Mishima ,

Kawai ,

Mochizuki ,

Sato ,

Ogawa , Text-Only Domain Adaptation Based on Intermediate CTC, Proc. Interspeech 2022 ( 2022 ) 2208 - 2212 .

[13]

Dingliwa ,

Shenoy ,

Bodapati ,

Gandhe ,

R. T.

Gadde ,

Kirchhof , Domain prompts: Towards memory and compute eficient domain adaptation of ASR systems , https://tinyurl.com/2a9jp88t, 2022 .

[14]

Mani ,

Palaskar ,

N. V.

Meripo ,

Konam ,

Metze , Asr error correction and domain adaptation using machine translation , in: ICASSP 2020- 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , IEEE, 2020 , pp. 6344 - 6348 .

[15]

Del Rio ,

Delworth ,

Westerman ,

Huang ,

Bhandari ,

Palakapilly ,

McNamara ,

Dong ,