Detection of Typical Sentence Errors in Speech Recognition Output Bohan Wang1,† , Ke Wang1,† , Siran Li1,† and Mark Cieliebak2,∗ 1 Section of Electrical and Electronic Engineering, École Polytechnique Fédérale (EPFL), Lausanne, Switzerland 2 Centre for Artificial Intelligence, Zurich University of Applied Sciences (ZHAW), Winterthur, Switzerland Abstract This paper presents a deep learning based model to detect the completeness and correctness of a sentence. It’s designed specifically for detecting errors in speech recognition systems and takes several typical recognition errors into account, including false sentence boundary, missing words, repeating words and false word recognition. The model can be applied to evaluate the quality of the recognized transcripts, and the optimal model reports over 90.5% accuracy on detecting whether the system completely and correctly recognizes a sentence. 1. Introduction mance to learn high quality language representations from large amounts of raw text. The token represen- Automatic Speech Recognition (ASR) systems develop tations produced by these transformers pre-trained on technologies to recognize and translate spoken language unsupervised tasks also help improve the performance into text by machines [1]. Sentence error detection on of a supervised downstream task. ASR systems is important for the two reasons: a) This In this paper, we fine-tune the pre-trained transformers can help to set proper punctuation marks; b) For multiple (BERT, GPT2 and BIG-BIRD) on the speech recognition speakers, speaker recognition often fails at the change error detection task, to build a binary classification model between two speakers, which results in single words at detecting speech recognition errors. The performance of beginning or end of an utterance being assigned to the sequentially linking BERT embedding and a down-stream wrong person. A practical application domain of our text classification network is also studied. We compare work is to detect complete and correct sentences in ASR and analyze the performances of several classification systems to mitigate the aforementioned problems. models. The models are ensembled through a Random In prior works, research focused mainly on grammati- Forest to further improve the performance. Finally, we cal error detection [2, 3]. In this paper, we focus on deal- analyse the performance of BERT-based classifier on a ing with the specific errors emerging in speech recogni- multi-label dataset. tion, such as missing words or incorrect sentence bound- The paper is structured as follows: In Sec. 2, we ex- aries (detailed in Sec. 3.3). In addition, previous works plain the models and experimental design. In Sec. 3, we on enriching speech recognition emphasize on finding describe how the dataset is generated. We discuss the correct sentence boundaries in whole transcripts [4, 5]. experimental results in Sec. 4. However, in real-time speech recognition, we have access to only individual sentences instead of full transcripts, and they don’t take other typical speech recognition er- 2. Methods rors (apart from incorrect sentence boundaries) into ac- count [6]. 2.1. Models Recently, transformer models have shown state-of-art In this section, we use three state-of-art transformer mod- performance in generating word embeddings and extract- els BERT [7], GPT2 [8], BIG-BIRD [9] are considered. ing intrinsic features of word sequences. In specific, Bidi- Besides, we also test the performance of using BERT rectional Encoder Representations from Transformers embedding plus a downstream text classification net- (BERT) [7], Generative Pre-trained Transformer (GPT) work. For the classification networks, we use either a [8] and BIG-BIRD [9] have achieved promising perfor- bi-direction LSTM and a TextCNN. We use a one-layer TextCNN with kernels sizes to be 2, 3 and 4. For LSTM, SwissText 2022: Swiss Text Analytics Conference, June 08–10, 2022, we use a one-layer bi-directional LSTM network [10], fol- Lugano, Switzerland ∗ Corresponding author. lowed by an attention layer and a fully connected layer. † These authors contributed equally. The number of hidden states is 256. Specifically, the Envelope-Open bohan.wang@epfl.ch (B. Wang); k.wang@epfl.ch (K. Wang); attention layer is found to be essential. siran.li@epfl.ch (S. Li); ciel@zhaw.ch (M. Cieliebak) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2.2. Ensemble learning For creating negative samples, we mimic typical errors of the speak recognition system, which are detailed in We ensemble the five trained classifiers with random the following, and we propose corresponding methods forest. Configuration and the final classification perfor- to create negative samples with respect to typical errors. mance are shown in Sec. 4.2. False sentence boundary: When a speech recogni- tion system fails to correctly separate two sentences, the 3. Data preparation first sentence would be cut off in the middle and part of the sentence would be assigned to the next sentence 3.1. Dataset sources (illustrated in Fig. 1 (a)). For such negative samples, we group the sentences by three, and randomly separate For the model to have better generalizing capacity, a the three sentences into 2-4 sentences (so that on aver- training set from diverse sources covering diverse topics age negative samples created in this way would have and occasions is necessary. The following corpora are equal length with positive samples). While choosing ran- included in our proposed dataset: dom separating points, the genuine sentence separations News reports [11]: 143, 000 articles from 15 American points, punctuation and typical words for starting sub- publications sentences (e.g. that, which, because, etc.) are avoided, Ted 2020 Parallel Sentences Corpus [12]: around 4000 and thus reduce the probability that a generated sample TED Talk transcripts from July 2020 is still a complete sentence by chance (e.g. ‘I like you Wikipedia corpus [13]: over 10 million topics because you are beautiful’ to ‘I like you’.) Topical-Chat [14]: nearly 10 thousand human dialog Missing words: A speech recognition system can fail conversations spanning 8 broad topics to recognize one or several words from a sentence, and as a result some words may be missing in the produced 3.2. Dataset Creation transcripts (Fig. 1 (b)). For such negative samples, we randomly remove 1 word for sentences up to 3 words, To make the selected datasets suit our speech recognition and 2-4 words from longer sentences. model, we remove some non-English tokens, sentence Repeating words: The system can record speakers’ ending symbols (‘.’, ‘!’, ‘?’), duplicated sentences and also unintended repeated words (Fig. 1 (c)). For such negative short sentences (less or equal to 5 words) to avoid some samples, we randomly repeat 1 word for sentences within recognition errors. After pre-processing on the data from 3 words, and 1-3 words from longer sentences. the sources, we create the following two datasets: False word recognition: The system can mistakenly Standard Dataset: contains 0.3 million sentences recognize one word as another word (Fig. 1 (d)). For from News reports, 0.3 million sentences from Ted cor- such negative samples, we randomly replace 1 word for pus, 0.3 million sentences from Wikipedia corpus, 0.2 sentences within 3 words, and 1-3 words from a longer million sentences from Topical-Chat, in total 1.1 million sentences, by random words from another sentence. sentences. We split the Standard Dataset randomly over Finally, the punctuation is removed and words are all data sources into train set, ablation set and test set, converted to lower case. with a proportion of 8:1:1. Large Dataset: contains 2.3 million sentences from News reports, 0.4 million sentences from Ted corpus, 2 million sentences from Wikipedia corpus, 0.2 million sentences from Topical-Chat; in total 5 million sentences. We split it into train and test set, with a proportion of 19:1. We train and compare performances of various models on the Standard Dataset. As a comparison, we evaluate the performance of BERT trained on the large dataset to see how an enlarged training set affects generalization ability for this task. Figure 1: Typical errors in speech recognition system After creating the positive and negative samples, the 3.3. Generate positive and negative sentences longer than 100 words are removed, for they samples are too long to appear in speech recognition. We create the same number of negative samples as that of positive For creating positive samples, punctuation is removed samples, so that we have a balanced dataset. The ratio be- (except abbreviations such as it’s, Mr., I’ve, etc.) and tween different types of negative samples is 2:1:1:1. The words are converted to lower case. type False Sentence Boundary corresponds to two times the number of other negative sample types since False forest classifier can generate a final classification through Sentence Boundary contains two types of false sentences, a majority vote mechanism. those which are cut off and those which are assigned To prevent random forest from overfitting the train with extra words. set, we use a separate ablation set, instead of the train set which the models are trained on. The best parameters after 10-fold cross-validation are 100 decision trees, and 4. Experiments and Discussion a maximum depth of 3. The test accuracy of the random forest reaches 90.51%, higher than the optimal accuracy In this section, we report the results of our experiments. among the individual models (90.26%), but not to a large We describe below the setup, and then evaluate the dif- extent. This is probably since the transformers (along ferent models in Sec. 4.1. In Sec. 4.2, based on the models, with their embedding) share similar structures and do we train a Random Forest classifier to further aggregate not diverge much on decisions. the models and improve the performance. In Sec. 4.3, we compare the performance of BERT trained on Standard and Large Dataset. Finally, we show the result of BERT 4.3. Results on Large Dataset trained on a Multi-Labeled Dataset in Sec. 4.4. In this section, we train BERT on the large dataset (5 times Training details: We train each model for 5 epochs the size of the Standard Dataset) with less epochs (1 epoch with batch size 64 using Adam optimizer. The initial in contrast to 5 epochs). Overall, the model is trained learning rate is set as 3𝑒 − 5 for fine-tuning transformer with the same iterations as with Standard Dataset. With models and 1𝑒 −3 for downstream classification networks. the same training details described before (but only for To prevent overfitting, we only save the model with opti- one epoch), results show that training with Large Dataset mal performance on test set after each epoch. provides a higher test accuracy (90.36%), compared with the accuracy trained with Standard Dataset (89.27%). 4.1. Results on Standard Dataset The results suggest that, provided with enough com- putational capacity, we can further improve our model’s As explained in Sec. 2, we train five models on the Stan- generalization ability by training on a larger dataset. dard Dataset containing 1 million proper sentences and 1 million non-proper sentences to evaluate their perfor- mances. 4.4. Result on multi-label dataset The results of this experiment are presented in Table 1. In this section, we further create a Multi-Label Dataset, which contains the same samples as the Standard Dataset, Model Test Accuracy BERT 89.27% whereas the negative samples are distinctively labeled GPT-2 88.67% (including false sentence boundary, false word recognition, BIG-BIRD 90.26% missing words, and repeating words) instead of uniformly BERT embedding + Bi-LSTM 86.33% labeled as negative. BERT embedding + TextCNN 81.40% We train a BERT model on this dataset, and it reached Table 1 85.01% classification test accuracy. The precision, recall Test accuracy of five models on Standard Dataset and F1-score of each class is given in Table 2. From the results, we can see that the transformers provide much better results than the models sequen- Sample Class Complete Sentence Precision Recall F1 Score Support 0.87 0.94 0.90 109857 tially linking BERT embedding and either a BiLSTM or False Sentence Boundary 0.83 0.81 0.82 42677 False Word Recognition 0.84 0.70 0.77 21897 TextCNN. Specifically, BIG-BIRD provides the optimal Missing Words 0.64 0.50 0.56 21711 performance, with 90.26% test accuracy. BERT and GPT2 Repeating Words 0.96 0.99 0.98 21781 provide similar test accuracy, 89.27% and 88.67% respec- Table 2 tively. Precision, Recall and F1-Score of each sample class From the result, we can see that the simplest task is to 4.2. Ensemble learning with Random identify repeated words in the sentences (F1-score near 0.98). Identifying complete sentences is also a relatively Forest easy task, with a F1-score of 0.90. The hardest task for In this section, we combine the five trained models (in the model is detecting whether there are missing words Table 1) with random forest in order to produce one opti- in the sentence. It achieves only 64% precision and 50% mal predictive model. The idea of the ensemble learning recall on this task. is to train a random forest classifier with the combination The confusion matrix is drawn in Fig. 2. From this fig- of the predicted classes from the models. The random ure, we can further see that the classifier finds it difficult to classify between complete sentences and sentences to over 90.51%. Overall, the results suggest that using state-of-art transformer models can provide good quality for detecting the errors in speech recognition systems, and provide feedback on further improvements of speech recognition systems. In our future works, special adjust- ments might be needed to better cope with identifying missing words in recognized sentences. References [1] D. Yu, L. Deng, Automatic speech recognition, vol- ume 1, Springer, 2016. [2] N. Agarwal, M. A. Wani, P. Bours, Lex-pos feature- based grammar error detection system for the En- glish language, Electronics 9 (2020) 1686. [3] Z. He, English grammar error detection using re- Figure 2: Confusion matrix for BERT trained on Multi-Label current neural networks, Scientific Programming Dataset 2021 (2021). [4] Y. Liu, E. Shriberg, A. Stolcke, D. Hillard, M. Os- with missing words, even though in most of the cases tendorf, M. Harper, Enriching speech recogni- more than one word is missing in the erroneous sen- tion with automatic detection of sentence bound- tences. This is understandable because in most cases, not aries and disfluencies, IEEE Transactions on Au- every word is indispensable, even we lose some words, dio, Speech, and Language Processing 14 (2006) and maybe the meaning is not exactly the same but the 1526–1540. doi:10.1109/TASL.2006.878255 . sentence still makes sense grammatically. [5] Y. Liu, A. Stolcke, E. Shriberg, M. Harper, Using conditional random fields for sentence boundary 4.5. Result on real-world ASR outputs detection in speech, in: Proceedings of the 43rd An- nual Meeting of the Association for Computational Finally we test our trained multi-modal BERT model on Linguistics (ACL’05), 2005, pp. 451–458. the real-world ASR outputs from CEASR corpus [15]. The [6] D. Tuggener, A. Aghaebrahimian, The Sentence predictions are shown in Fig. 3, where we can see the End and Punctuation Prediction in NLG text (SEPP- model is able to capture real-world ASR errors correctly, NLG) shared task 2021, in: Swiss Text Analyt- while we also provide an example where the model fails. ics Conference–SwissText 2021, Online, 14-16 June 2021, CEUR Workshop Proceedings, 2021. [7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transform- ers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [8] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsuper- Figure 3: Prediction on real-world ASR outputs vised multitask learners, OpenAI blog 1 (2019) 9. [9] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, et al., Big Bird: Transformers for Longer 5. Conclusion Sequences., in: NeurIPS, 2020. In this paper, a dataset for detecting speech recognition [10] F. A. Gers, J. Schmidhuber, F. Cummins, Learning errors was created, where four different types of typi- to forget: Continual prediction with lstm, Neural cal speech recognition errors were taken into account. computation 12 (2000) 2451–2471. Experimental results show that transformer models are [11] A. Thompson, All the news: 143,000 articles capable of providing good performance on classification from 15 American publications, =https://www.kag- of the constructed dataset for speech recognition error, gle.com/snapcrack/all-the-news, 2017. reporting approximately 90% accuracy for BERT, GPT2 [12] N. Reimers, I. Gurevych, Making monolingual sen- and BIG-BIRD. A Random Forest was trained based on tence embeddings multilingual using knowledge the five models, and further improved the test accuracy distillation, arXiv preprint arXiv:2004.09813 (2020). [13] W. Foundation, Wikimedia downloads, ???? URL: https://dumps.wikimedia.org. [14] K. Gopalakrishnan, B. Hedayatnia, Q. Chen, A. Gottardi, S. Kwatra, A. Venkatesh, R. Gabriel, D. Hakkani-Tür, A. A. AI, Topical-chat: Towards knowledge-grounded open-domain conversations., in: INTERSPEECH, 2019, pp. 1891–1895. [15] M. A. Ulasik, M. Hürlimann, F. Germann, E. Gedik, F. Benites de Azevedo e Souza, M. Cieliebak, Ceasr: a corpus for evaluating automatic speech recogni- tion, in: 12th Language Resources and Evaluation Conference (LREC) 2020, European Language Re- sources Association, 2020, pp. 6477–6485.