Better Transcription of UK Supreme Court Hearings
Hadeel Saadany1,* , Catherine Breslin2 , Constantin Orăsan3 and Sophie Walker4
1
  Centre for Translation Studies, University of Surrey, United Kingdom
2
   Kingfisher Labs Ltd, United Kingdom
3
  Centre for Translation Studies, University of Surrey, United Kingdom
4
  Just Access, United Kingdom


                                             Abstract
                                             Transcription of legal proceedings is very important for enabling access to justice. However, manual speech transcription is
                                             an expensive and slow process. In this paper we describe part of a combined research and industrial project for building
                                             an automated transcription tool designed specifically for the justice sector in the UK. We explain the challenges involved
                                             in transcribing court room hearings and the Natural Language Processing (NLP) techniques we employ to tackle these
                                             challenges. We will show that fine-tuning a generic off-the-shelf pre-trained Automatic Speech Recognition (ASR) system
                                             with an in-domain language model as well as infusing common phrases extracted with a collocation detection model can
                                             improve not only the Word Error Rate (WER) of the transcribed hearings but avoid critical errors that are specific of the legal
                                             jargon and terminology commonly used in British courts.

                                             Keywords
                                             Legal Transcription, UK Supreme Court, Automatic Speech Recognition


1. Introduction                                                                                                                       Table 1
                                                                                                                                      Examples of Errors Produced by Amazon Transcribe for Legal
There has been a recent interest in employing NLP tech-                                                                               Hearings. Errors and Corrections are typed in bold.
niques to aid the textual processing of the legal domain                                                                                        Model        Transcript
[1, 2, 3, 4]. In contrast, processing spoken court hearings                                                                                     Reference    So my lady um it is difficult to..
has not received the same attention as understanding the                                                                                        AWS ASR      So melody um it is difficult to...
legal text documents. In the UK legal system, the court                                                                                         Reference    All rise ...
hearings sessions have a unique tradition of verbal argu-                                                                                       AWS ASR      All right ...
ment. Moreover, these hearings crucially aid in new case                                                                                        Reference    it makes further financial order
preparation, provide guidance for court appeals, help in                                                                                        AWS ASR      it makes further five natural
legal training and even guide future policy. However,
the audio material for a case typically spans over several
hours, which makes it both time and effort consuming                    tem which can compete with well-known cloud-based
for legal professionals to extract important information                ASR systems which are trained on much larger datasets.
relevant to their needs. Currently, the existing need for               At the same time, in commercial scenarios, using generic
legal transcriptions (covering 449K cases p.a in the UK                 cloud-based ASR systems to transcribe a specialised do-
across all court tribunals [5] is largely met by human                  main may result in a sub-optimal quality transcriptions
transcribers.                                                           for clients who require this service.
   Although there are several current speech-to-text                       This holds particularly true for British court room au-
(STT) technology providers which could be used to tran-                 dio procedures. When applying a generic cloud-based
scribe this data automatically, most of these systems                   ASR system (in our case Amazon Transcribe) on British
are trained on general domain data which may result                     court rooms, the Word Error Rate (WER) remains rela-
in domain-specific transcription errors if applied to a spe-            tively high due to hearings’ length, multiplicity of speak-
cialised domain. One way to address this problem is for                 ers, complex speech patterns, and more crucially, due
end-users to train their own ASR engines using their in-                to unique pronunciations and domain-specific vocabu-
domain data. However, in most of the cases the amount                   lary. Examples in Table 1 show some common prob-
of data available is too low to enable them to train a sys-             lems we faced when transcribing UK court hearings
                                                                        by on-the-shelf ASR systems such as Amazon Web Ser-
                                                                                                1
Workshop on Artificial Intelligence for Access to Justice (AI4AJ 2023), vices (AWS) Transcribe . The references are taken from
June 19, 2023, Braga, Portugal                                          human-generated ground-truth transcripts of real UK
*
  Corresponding author.
                                                                        Supreme Court Hearings2 created by the legal editors
$ hadeel.saadany@surrey.ac.uk (H. Saadany)
 0000-0002-2620-1842 (H. Saadany)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License   1
                                       Attribution 4.0 International (CC BY 4.0).                                                         https://aws.amazon.com/transcribe/
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                        2
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                                                          https://www.supremecourt.uk/decided-cases/index.html
Figure 1: Pipeline for Improving ASR Output for Legal Specific Errors


in our project’s team. The first error is due to a special    2. Related Work
pronunciation of the phrase ‘my lady’ in British court
rooms as it is pronounced like ‘mee-lady’ when barris-        Automatic speech recognition (ASR) models convert au-
ters address a female judge. Similarly, in the second         dio input to text and they have optimal performance
example, the error relates to the linguistic etiquette of     when used to transcribe data which is similar to the one
UK court hearings which the ASR system consistently           they were trained on. However, performance degrades
fails to recognise. The error in the third example, on the    when there is a mismatch between the data used for train-
other hand, is related to legal terminology critical of the   ing and the one that is being transcribed. Additionally,
specific transcribed case. Errors similar to the third ex-    some types of audio material are intrinsically harder for
ample are numerous in our dataset and also affect named       speech recognition systems to transcribe. In practice,
entities such as numbers and names that are vital in un-      this means that speech recognition system performance
derstanding the legal argument in the transcribed cases.      degrades when, for example, there is background noise
These errors can lead to serious information loss and         [7], non-native accents [8, 9], young or elderly speakers
cause confusion.                                              [8], or a shift in domain [10].
   In this paper, we describe a joint research and com-          Performance degradation is typically mitigated by
mercial effort to perform domain adaptation of a generic      adapting or fine-tuning ASR models towards the domain
ASR system to mitigate the errors in the automated UK         of the targeted data by using a domain-specific dataset
court transcription services. We propose to minimise          [11, 12, 13]. Some methods for domain adaptation adopt
legal-specific errors by fine-tuning off-the-shelf ASR sys-   NLP techniques such as using machine translation mod-
tems with a custom language model (CLM) trained on            els to learn a mapping from out-of-domain ASR errors to
legal documents as well as 139 hours of human-edited          in-domain terms [14]. An alternative approach is to build
transcriptions of UK Supreme Court hearings. We also          a large ASR model with a substantially varied training
employ NLP techniques to automatically build a custom         set, so that the model is more robust to data shifts. An
vocabulary of common multi-word expressions and word          example of this latter approach is the recently released
n-gram collocations that are critical in court hearings.      OpenAI Whisper model which is trained on 680k hours
We infuse our custom vocabulary to the CLM at tran-           of diverse domain data to generalise well on a range of
scription time. In this research, we evaluate the benefits    unseen datasets without the need for explicit adaptation
of our proposed domain adaptation methods by compar-          [6].
ing the WER of the CLM output with two off-the-shelf             Moreover, ASR models are evaluated using Word Error
ASR systems: AWS Transcribe (commercial) and the Ope-         Rate (WER), which treats each incorrect word equally.
nAI Whisper model (open-source) [6]. We also compare          However, ASR models do not perform equally on different
the general improvement in the ASR system’s ability           categories of words. Performance is worse for categories
to correctly transcribe legal entities with and without       like names of people and organisations as compared to
adopting our proposed methods. In addition we discuss         categories like numbers or dates [15]. ASR research tar-
the transcription time with different ASR settings since      geted improving specific errors such as different named
transcription time is critical for the commercial pipeline    entities using NLP techniques [16, 17].
implemented by the industrial partner of the project.            In this paper, we propose simple techniques to improve
                                                              the effect of the domain mismatch between a generic
                                                              ASR model and the specialised domain of British court
                                                              room hearings. Our proposed method, improves both
                                                              the system’s WER rate as well as its ability to capture
case-specific terms and entities. In the next section, we     3.2. Phrase Extraction Model
present the setup of our experiments and the evaluation
                                                              For the vocabulary list, we use a dataset of ≈ 139 hours
results.
                                                              of gold-standard transcriptions of Supreme Court hear-
                                                              ings along with the supreme court judgements used for
3. Experiment Setup                                           training the CLM. To extract the vocabulary from this
                                                              dataset, we implement two methods. First, we use this
Figure 1 illustrates our proposed pipeline to improve the     dataset to train a phrase detection model that collocates
ASR system performance by legal domain-adaptation             bigrams based on Pointwise Mutual Information (PMI)
techniques. First, we build a custom language model           scoring of the words in context [20]. PMI is a measure of
(CLM) by fine-tuning the base AWS ASR system, using           association between words; it compares the probability
two types of training data: 1) textual data from the legal    of two words occurring together to what this probability
domain, 2) a corpus of human-generated legal transcrip-       would be if the two words were independent. We train
tions. Second, we use NLP techniques to extract domain-       the collocation model using the Gensim Python library
specific phrases and legal entities from the in-domain        with a minimum score threshold for a bigram to be taken
data to create a vocabulary list. We use both the CLM         into account set to 1 and with PMI as the probability
and the vocabulary list for transcribing legal proceedings.   scoring method [21]. The collocation model is trained
The following sections explain details of our experiment      on the textual data of the Supreme Court transcriptions
where we implemented this pipeline on the AWS Tran-           and the supreme court judgements. The model is then
scribe base model. We compare the performance of our          used to extract a list of most common bigrams in this
CLM model with different settings to AWS Transcribe           dataset. Figure 2 shows an example of the type of com-
base ASR system and OpenAI Whisper open-source ASR            mon phrases extracted by our collocation model along
system when transcribing ≈ 12 hours of UK Supreme             with their frequencies. As can be seen from the figure,
Court Hearings.                                               the extracted phrases include frequent legal terms (high-
                                                              lighted in blue) as well as named entities such as names
3.1. Fine-tuning the ASR system                               of institutions and persons (highlighted in yellow) which
                                                              are specific of the Supreme Court cases included in the
AWS Transcribe improves the quality of speech recognis-       training corpus.
ers by employing an architecture known as the recurrent
neural network-transducer (RNN-T) [18]. It is an end-to-
end model for automatic speech recognition (ASR) which
has gained popularity in recent years as a way to fold
separate components of a conventional ASR system (i.e.,
acoustic, pronunciation and language models) into a sin-
gle neural network [19]. The AWS Transcribe platform
allows the fine-tuning of their ASR architecture via build-
ing custom language models to improve transcription
accuracy for domain-specific speech. Creating a robust
custom language model requires a significant amount of
text data, which must contain spoken domain-specific
vocabulary.
   For training our CLM, we use two datasets from the
legal domain. The first is Supreme Court written judge-
ments of 43 cases consisting of 3.26M tokens scraped          Figure 2: Example of Common Collocations Extracted by the
from the official site of the UK Supreme Court3 . The sec-    Phrase Extraction Model
ond dataset consists of ≈ 81 hours of gold-standard tran-
scripts of 10 Supreme Court hearings. The gold-standard
transcripts are created by post-editing the AWS Tran-            The second method we employ to create a list of
scribe output of the court hearings by a team of legal        custom vocabulary is to identify named entities in our
professionals using a specially designed interface. We        dataset. For this purpose, we use Blackstone4 , an NLP
use both datasets to train a CLM that fine-tunes the base     library for processing long-form and unstructured legal
AWS ASR architecture to the UK legal domain.                  text capable of identifying legal entities. The list of legal
                                                              entities includes: Case Name, Court Name, Provision (i.e.
                                                              a clause in a legal instrument), Instrument (i.e. a legal

3                                                             4
    https://www.supremecourt.uk/decided-cases/                    https://research.iclr.co.uk/blackstone
Table 2
Average WER and Transcription Time
                                            WER       WER           WER
                         Model                                                Transcription Time
                                            Case1     Case2         Average
                         AWS base           8.7       16.2          12.3      85 mins
                         CLM1               8.5       16.5          12.4      77 mins
                         CLM2               7.9       15.5          11.6      77 mins
                         CLM2+Vocab         7.9       15.6          11.6      132 mins
                         CLM2+Vocab2        8.0       15.6          11.7      112 mins
                         Whisper            9.6       15.3          12.4      191 mins


term of art) and Judge. We concatenated this Blackstone         We compare the performance of CLM2 infused with the
entity list with the spaCy v3.4 library list of non-legal en-   legal terms list (CLM2+Vocab) to the two generic ASR
tities such as: Cardinals, Persons and Dates. The results       systems. The ratios in Table 3 indicate that CLM2+Vocab
of applying our domain-adaptation methods for the tran-         is generally more capable of transcribing legal-specific
scription of 2 Supreme Court case hearings consisting of        terms than the other two models. It is also better at
12 hours is explained in the next section.                      transcribing critical legal entities such as Provisions.5
                                                                Such legal terminology needs to be accurately transcribed.
                                                                Our CLM2 model with legal vocabulary demonstrates
4. Results                                                      better reliability in transcribing these terms.
                                                                   A similar trend is evident with the legal entity Judge
Table 2 shows the WER scores and WER average score
                                                                which refers to the forms of address used in British court
for the 2 transcribed cases with different CLM system
                                                                rooms (e.g. ‘Lord Phillips’, ‘Lady Hale’). This entity is
settings, as well as, for the two baseline systems: the
                                                                typically repeated in court hearings whenever a barris-
AWS Transcribe (AWS base) and Whisper. The different
                                                                ter or solicitor addresses the court. We see that both the
CLM settings are as follows:
                                                                generic ASR systems perform badly on this category with
    1. CLM1 is trained on only the texts of the Supreme         ratios of 0.66 and 0.69, respectively. On the other hand,
       Court judgements.                                        we observe a significant improvement in correctly tran-
    2. CLM2 is trained on both the judgements and the           scribing this type of entities by the CLM2+Vocab with a
       gold-standard transcripts.                               ration of 0.84 correct transcriptions. Appendix A shows
    3. CLM2+Vocab uses CLM2 for transcription plus              an example of the output of the AWS base ASR model
       the global vocabulary list extracted by our phrase       without our domain-adaptation methods compared to
       detection model.                                         the output of the CLM correcting the mistakes. The tran-
                                                                scription errors (highlighted yellow) in the base output
    4. CLM2+Vocab2 uses CLM2 for transcription plus
                                                                includes legal jargon, legal terms and named entities. The
       the legal entities vocabulary list extracted by
                                                                errors are corrected by our CLM model (corrections are
       Blackstone and spaCy v3.4 library.
                                                                highlighted in blue).
   As can be seen in Table 2, the ASR performance is               In addition to evaluating the output of the ASR en-
consistently better with the CLM models than with the           gines, we also recorded the time required to produce the
generic ASR systems for the two transcribed cases. CLM2         transcription. The models based on AWS were run in the
model, trained on textual data (i.e. the written judge-         cloud using the Amazon infrastructure. Whisper was run
ments) and gold-standard court hearing transcriptions,          on a Linux desktop with an NVIDIA GeForce RTX 2070
outperforms AWS base and Whisper with a 9% and 8%               GPU with 8G VRAM. For all the experiments, the medium
WER improvement, respectively. Moreover, we observe             English-only model was used. As expected the fastest run-
around 9% improvement in average WER score over the             ning time is obtained using the AWS base model. Running
two generic models when concatenating the list of legal         the best performing model increases the time by 155%,
phrases that is extracted by our phrase detection model         whilst Whisper more than doubles it. Trade-off between
with the CLM2 system. While ASR error correction in-            running time and the level of domain-specific accuracy
dicates an improved transcription quality with our pro-         is a variable parameter that can be determined based on
posed domain adaptation methods, we also evaluated the          the transcription purpose and the end-user needs defined
ASR systems performance with specific errors such as            by our project’s commercial partner.
legal entities and terms.                                       5
                                                                A Provision, a statement within an agreement or a law, typically
   Table 3 shows the average ratio of correctly transcribed     consists of alphanumeric utterances in British court hearings (e.g.
legal entities in the two studied court room hearings.          ‘section 25(2)(a)-(h)’ or ‘rule 3.17’).
Table 3
Ratio of Correctly Captured Legal Entities by the ASR Systems
                              Entity           AWS BASE          Whisper     CLM2+vocab
                              Judge              0.66             0.77          0.84
                              CASE NAME          0.69             0.85          0.71
                              Court              0.98               1           0.93
                              Provision          0.88             0.95          0.97
                              Cardinal             1              0.97           1


5. Conclusion                                                        cessing in legal tech, Legal Tech and the Future of
                                                                     Civil Justice (David Engstrom ed.) (2022).
In this paper, we present a study which shows the effect of      [5] G. Sturge, Court statistics for England and Wales,
domain adaption methods on improving the off-the-shelf               Technical Report, House of Commons Library,
ASR system performance in transcribing a specialised                 2021. URL: https://commonslibrary.parliament.uk/
domain such as British court hearings. We optimised the              research-briefings/cbp-8372/.
performance of the ASR system by training an ASR cus-            [6] A. Radford, J. W. Kim, T. Xu, G. Brockman,
tom language model on gold-standard legal transcripts                C. McLeavey, I. Sutskever, Robust Speech Recog-
and textual data from the legal domain. We also trained              nition via Large-Scale Weak Supervision, OpenAI
a phrase detection model to incorporate extracted list of            (2022).
data-specific bigram collocations at transcription time.         [7] S. Watanabe, M. Mandel, J. Barker, E. Vincent,
We evaluated the ASR quality improvements both in                    A. Arora, X. Chang, S. Khudanpur, V. Manohar,
terms of average WER and ratio of correctly transcribed              D. Povey, D. Raj, et al., CHiME-6 Challenge: Tack-
legal-specific terms. We observe significant gains in the            ling multispeaker speech recognition for unseg-
ASR transcription quality by our domain adaptation tech-             mented recordings, 2020.
niques. For commercial use of ASR technologies, im-              [8] S. Feng, O. Kudina, B. M. Halpern, O. Scharenborg,
proving error rate in general and transcription quality of           Quantifying bias in automatic speech recognition,
critical legal terms in particular would minimise manual             arXiv preprint arXiv:2103.15122 (2021).
post-editing effort and hence save both time and money.          [9] Y. Zhang, Mitigating bias against non-native ac-
We plan to evaluate the impact of different configurations           cents, Delft University of Technology (2022).
proposed in this paper on the editors’ postediting effort.      [10] L. Mai, J. Carson-Berndsen, Unsupervised domain
   In the future, we will expand to record data from a               adaptation for speech recognition with unsuper-
variety of accents to address another axis of degradation            vised error correction, Proc. Interspeech 2022 (2022)
in British audio procedures different than the Supreme               5120–5124.
Court hearings which are mostly a homogeneous group             [11] Z. Huo, D. Hwang, K. C. Sim, S. Garg, A. Misra,
of speakers. We will also explore the ability to use NLP             N. Siddhartha, T. Strohman, F. Beaufays, Incremen-
topic modelling techniques to connect legal entities that            tal layer-wise self-supervised learning for efficient
were crucial in a court’s case decision.                             speech domain adaptation on device, arXiv preprint
                                                                     arXiv:2110.00155 (2021).
                                                                [12] H. Sato, T. Komori, T. Mishima, Y. Kawai,
References                                                           T. Mochizuki, S. Sato, T. Ogawa, Text-Only Do-
 [1] E. Elwany, D. Moore, G. Oberoi, Bert goes to law                main Adaptation Based on Intermediate CTC, Proc.
     school: Quantifying the competitive advantage of                Interspeech 2022 (2022) 2208–2212.
     access to large legal corpora in contract understand-      [13] S. Dingliwa, A. Shenoy, S. Bodapati, A. Gandhe, R. T.
     ing, arXiv preprint arXiv:1911.00473 (2019).                    Gadde, K. Kirchhoff, Domain prompts: Towards
 [2] J. J. Nay, Natural Language Processing for Legal                memory and compute efficient domain adaptation
     Texts, DOI=10.1017/9781316529683.011, Cambridge                 of ASR systems, https://tinyurl.com/2a9jp88t, 2022.
     University Press, 2021, p. 99–113.                         [14] A. Mani, S. Palaskar, N. V. Meripo, S. Konam,
 [3] E. Mumcuoğlu, C. E. Öztürk, H. M. Ozaktas, A. Koç,              F. Metze, Asr error correction and domain adapta-
     Natural language processing in law: Prediction of               tion using machine translation, in: ICASSP 2020-
     outcomes in the higher courts of turkey, Informa-               2020 IEEE International Conference on Acoustics,
     tion Processing & Management 58 (2021) 102684.                  Speech and Signal Processing (ICASSP), IEEE, 2020,
 [4] J. Frankenreiter, J. Nyarko, Natural language pro-              pp. 6344–6348.
                                                                [15] M. Del Rio, N. Delworth, R. Westerman, M. Huang,
                                                                     N. Bhandari, J. Palakapilly, Q. McNamara, J. Dong,
     P. Zelasko, M. Jetté, Earnings-21: a practical
     benchmark for asr in the wild, arXiv preprint
     arXiv:2104.11348 (2021).
[16] H. Wang, S. Dong, Y. Liu, J. Logan, A. K. Agrawal,
     Y. Liu, ASR Error Correction with Augmented
     Transformer for Entity Retrieval., in: Interspeech,
     2020, pp. 1550–1554.
[17] N. Das, D. H. Chau, M. Sunkara, S. Bodapati,
     D. Bekal, K. Kirchhoff, Listen, Know and Spell:
     Knowledge-Infused Subword Modeling for Improv-
     ing ASR Performance of OOV Named Entities,
     in: ICASSP 2022-2022 IEEE International Confer-
     ence on Acoustics, Speech and Signal Processing
     (ICASSP), IEEE, 2022, pp. 7887–7891.
[18] A. Graves, Sequence transduction with recurrent
     neural networks, arXiv preprint arXiv:1211.3711
     (2012).
[19] J. Guo, G. Tiwari, J. Droppo, M. Van Segbroeck,
     C.-W. Huang, A. Stolcke, R. Maas, Efficient min-
     imum word error rate training of rnn-transducer
     for end-to-end speech recognition, arXiv preprint
     arXiv:2007.13802 (2020).
[20] G. Bouma, Normalized (pointwise) mutual informa-
     tion in collocation extraction, Proceedings of GSCL
     30 (2009) 31–40.
[21] R. Řehůřek, P. Sojka, Software Framework for Topic
     Modelling with Large Corpora, in: Proceedings of
     the LREC 2010 Workshop on New Challenges for
     NLP Frameworks, ELRA, Valletta, Malta, 2010, pp.
     45–50. http://is.muni.cz/publication/884893/en.
A. Appendix: Examples of ASR ouput with and without
   domain-adaptation