Detection of Typical Sentence Errors in Speech Recognition
Output
Bohan Wang1,† , Ke Wang1,† , Siran Li1,† and Mark Cieliebak2,∗
1
    Section of Electrical and Electronic Engineering, École Polytechnique Fédérale (EPFL), Lausanne, Switzerland
2
    Centre for Artificial Intelligence, Zurich University of Applied Sciences (ZHAW), Winterthur, Switzerland


                                             Abstract
                                             This paper presents a deep learning based model to detect the completeness and correctness of a sentence. It’s designed
                                             specifically for detecting errors in speech recognition systems and takes several typical recognition errors into account,
                                             including false sentence boundary, missing words, repeating words and false word recognition. The model can be applied to
                                             evaluate the quality of the recognized transcripts, and the optimal model reports over 90.5% accuracy on detecting whether
                                             the system completely and correctly recognizes a sentence.


1. Introduction                                                                                                                       mance to learn high quality language representations
                                                                                                                                      from large amounts of raw text. The token represen-
Automatic Speech Recognition (ASR) systems develop                                                                                    tations produced by these transformers pre-trained on
technologies to recognize and translate spoken language                                                                               unsupervised tasks also help improve the performance
into text by machines [1]. Sentence error detection on                                                                                of a supervised downstream task.
ASR systems is important for the two reasons: a) This                                                                                    In this paper, we fine-tune the pre-trained transformers
can help to set proper punctuation marks; b) For multiple                                                                             (BERT, GPT2 and BIG-BIRD) on the speech recognition
speakers, speaker recognition often fails at the change                                                                               error detection task, to build a binary classification model
between two speakers, which results in single words at                                                                                detecting speech recognition errors. The performance of
beginning or end of an utterance being assigned to the                                                                                sequentially linking BERT embedding and a down-stream
wrong person. A practical application domain of our                                                                                   text classification network is also studied. We compare
work is to detect complete and correct sentences in ASR                                                                               and analyze the performances of several classification
systems to mitigate the aforementioned problems.                                                                                      models. The models are ensembled through a Random
   In prior works, research focused mainly on grammati-                                                                               Forest to further improve the performance. Finally, we
cal error detection [2, 3]. In this paper, we focus on deal-                                                                          analyse the performance of BERT-based classifier on a
ing with the specific errors emerging in speech recogni-                                                                              multi-label dataset.
tion, such as missing words or incorrect sentence bound-                                                                                 The paper is structured as follows: In Sec. 2, we ex-
aries (detailed in Sec. 3.3). In addition, previous works                                                                             plain the models and experimental design. In Sec. 3, we
on enriching speech recognition emphasize on finding                                                                                  describe how the dataset is generated. We discuss the
correct sentence boundaries in whole transcripts [4, 5].                                                                              experimental results in Sec. 4.
However, in real-time speech recognition, we have access
to only individual sentences instead of full transcripts,
and they don’t take other typical speech recognition er-                                                                              2. Methods
rors (apart from incorrect sentence boundaries) into ac-
count [6].                                                                                                                            2.1. Models
   Recently, transformer models have shown state-of-art
                                                                      In this section, we use three state-of-art transformer mod-
performance in generating word embeddings and extract-
                                                                      els BERT [7], GPT2 [8], BIG-BIRD [9] are considered.
ing intrinsic features of word sequences. In specific, Bidi-
                                                                         Besides, we also test the performance of using BERT
rectional Encoder Representations from Transformers
                                                                      embedding plus a downstream text classification net-
(BERT) [7], Generative Pre-trained Transformer (GPT)
                                                                      work. For the classification networks, we use either a
[8] and BIG-BIRD [9] have achieved promising perfor-
                                                                      bi-direction LSTM and a TextCNN. We use a one-layer
                                                                      TextCNN with kernels sizes to be 2, 3 and 4. For LSTM,
SwissText 2022: Swiss Text Analytics Conference, June 08–10, 2022, we use a one-layer bi-directional LSTM network [10], fol-
Lugano, Switzerland
∗
     Corresponding author.
                                                                      lowed by an attention layer and a fully connected layer.
†
    These authors contributed equally.                                The  number of hidden states is 256. Specifically, the
Envelope-Open bohan.wang@epfl.ch (B. Wang); k.wang@epfl.ch (K. Wang); attention layer is found to be essential.
siran.li@epfl.ch (S. Li); ciel@zhaw.ch (M. Cieliebak)
                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
2.2. Ensemble learning                                            For creating negative samples, we mimic typical errors
                                                              of the speak recognition system, which are detailed in
We ensemble the five trained classifiers with random
                                                              the following, and we propose corresponding methods
forest. Configuration and the final classification perfor-
                                                              to create negative samples with respect to typical errors.
mance are shown in Sec. 4.2.
                                                                  False sentence boundary: When a speech recogni-
                                                              tion system fails to correctly separate two sentences, the
3. Data preparation                                           first sentence would be cut off in the middle and part
                                                              of the sentence would be assigned to the next sentence
3.1. Dataset sources                                          (illustrated in Fig. 1 (a)). For such negative samples, we
                                                              group the sentences by three, and randomly separate
 For the model to have better generalizing capacity, a the three sentences into 2-4 sentences (so that on aver-
training set from diverse sources covering diverse topics age negative samples created in this way would have
and occasions is necessary. The following corpora are equal length with positive samples). While choosing ran-
included in our proposed dataset:                             dom separating points, the genuine sentence separations
   News reports [11]: 143, 000 articles from 15 American points, punctuation and typical words for starting sub-
publications                                                  sentences (e.g. that, which, because, etc.) are avoided,
   Ted 2020 Parallel Sentences Corpus [12]: around 4000 and thus reduce the probability that a generated sample
TED Talk transcripts from July 2020                           is still a complete sentence by chance (e.g. ‘I like you
   Wikipedia corpus [13]: over 10 million topics              because you are beautiful’ to ‘I like you’.)
   Topical-Chat [14]: nearly 10 thousand human dialog             Missing words: A speech recognition system can fail
conversations spanning 8 broad topics                         to recognize one or several words from a sentence, and
                                                              as a result some words may be missing in the produced
3.2. Dataset Creation                                         transcripts (Fig. 1 (b)). For such negative samples, we
                                                              randomly remove 1 word for sentences up to 3 words,
To make the selected datasets suit our speech recognition and 2-4 words from longer sentences.
model, we remove some non-English tokens, sentence                Repeating words: The system can record speakers’
ending symbols (‘.’, ‘!’, ‘?’), duplicated sentences and also unintended repeated words (Fig. 1 (c)). For such negative
short sentences (less or equal to 5 words) to avoid some samples, we randomly repeat 1 word for sentences within
recognition errors. After pre-processing on the data from 3 words, and 1-3 words from longer sentences.
the sources, we create the following two datasets:                False word recognition: The system can mistakenly
   Standard Dataset: contains 0.3 million sentences recognize one word as another word (Fig. 1 (d)). For
from News reports, 0.3 million sentences from Ted cor- such negative samples, we randomly replace 1 word for
pus, 0.3 million sentences from Wikipedia corpus, 0.2 sentences within 3 words, and 1-3 words from a longer
million sentences from Topical-Chat, in total 1.1 million sentences, by random words from another sentence.
sentences. We split the Standard Dataset randomly over            Finally, the punctuation is removed and words are
all data sources into train set, ablation set and test set, converted to lower case.
with a proportion of 8:1:1.
   Large Dataset: contains 2.3 million sentences from
News reports, 0.4 million sentences from Ted corpus,
2 million sentences from Wikipedia corpus, 0.2 million
sentences from Topical-Chat; in total 5 million sentences.
We split it into train and test set, with a proportion of
19:1.
   We train and compare performances of various models
on the Standard Dataset. As a comparison, we evaluate
the performance of BERT trained on the large dataset to
see how an enlarged training set affects generalization
ability for this task.                                        Figure 1: Typical errors in speech recognition system
                                                               After creating the positive and negative samples, the
3.3. Generate positive and negative                          sentences longer than 100 words are removed, for they
     samples                                                 are too long to appear in speech recognition. We create
                                                             the same number of negative samples as that of positive
 For creating positive samples, punctuation is removed       samples, so that we have a balanced dataset. The ratio be-
(except abbreviations such as it’s, Mr., I’ve, etc.) and     tween different types of negative samples is 2:1:1:1. The
words are converted to lower case.                           type False Sentence Boundary corresponds to two times
the number of other negative sample types since False        forest classifier can generate a final classification through
Sentence Boundary contains two types of false sentences,     a majority vote mechanism.
those which are cut off and those which are assigned            To prevent random forest from overfitting the train
with extra words.                                            set, we use a separate ablation set, instead of the train set
                                                             which the models are trained on. The best parameters
                                                             after 10-fold cross-validation are 100 decision trees, and
4. Experiments and Discussion                                a maximum depth of 3. The test accuracy of the random
                                                             forest reaches 90.51%, higher than the optimal accuracy
In this section, we report the results of our experiments.
                                                             among the individual models (90.26%), but not to a large
We describe below the setup, and then evaluate the dif-
                                                             extent. This is probably since the transformers (along
ferent models in Sec. 4.1. In Sec. 4.2, based on the models,
                                                             with their embedding) share similar structures and do
we train a Random Forest classifier to further aggregate
                                                             not diverge much on decisions.
the models and improve the performance. In Sec. 4.3, we
compare the performance of BERT trained on Standard
and Large Dataset. Finally, we show the result of BERT 4.3. Results on Large Dataset
trained on a Multi-Labeled Dataset in Sec. 4.4.              In this section, we train BERT on the large dataset (5 times
   Training details: We train each model for 5 epochs the size of the Standard Dataset) with less epochs (1 epoch
with batch size 64 using Adam optimizer. The initial in contrast to 5 epochs). Overall, the model is trained
learning rate is set as 3𝑒 − 5 for fine-tuning transformer with the same iterations as with Standard Dataset. With
models and 1𝑒 −3 for downstream classification networks. the same training details described before (but only for
To prevent overfitting, we only save the model with opti- one epoch), results show that training with Large Dataset
mal performance on test set after each epoch.                provides a higher test accuracy (90.36%), compared with
                                                           the accuracy trained with Standard Dataset (89.27%).
4.1. Results on Standard Dataset                              The results suggest that, provided with enough com-
                                                           putational capacity, we can further improve our model’s
As explained in Sec. 2, we train five models on the Stan-
                                                           generalization ability by training on a larger dataset.
dard Dataset containing 1 million proper sentences and
1 million non-proper sentences to evaluate their perfor-
mances.                                                    4.4. Result on multi-label dataset
  The results of this experiment are presented in Table 1. In this section, we further create a Multi-Label Dataset,
                                                              which contains the same samples as the Standard Dataset,
                   Model              Test Accuracy
        BERT                             89.27%               whereas the negative samples are distinctively labeled
        GPT-2                            88.67%               (including false sentence boundary, false word recognition,
        BIG-BIRD                         90.26%               missing words, and repeating words) instead of uniformly
        BERT embedding + Bi-LSTM         86.33%               labeled as negative.
        BERT embedding + TextCNN         81.40%                  We train a BERT model on this dataset, and it reached
Table 1                                                       85.01% classification test accuracy. The precision, recall
Test accuracy of five models on Standard Dataset              and F1-score of each class is given in Table 2.
   From the results, we can see that the transformers
provide much better results than the models sequen-                    Sample Class
                                                                    Complete Sentence
                                                                                         Precision Recall F1 Score Support
                                                                                           0.87     0.94    0.90   109857
tially linking BERT embedding and either a BiLSTM or             False Sentence Boundary   0.83     0.81    0.82    42677
                                                                  False Word Recognition   0.84     0.70    0.77    21897
TextCNN. Specifically, BIG-BIRD provides the optimal                  Missing Words        0.64     0.50    0.56    21711
performance, with 90.26% test accuracy. BERT and GPT2                Repeating Words       0.96     0.99    0.98    21781
provide similar test accuracy, 89.27% and 88.67% respec-
                                                            Table 2
tively.                                                     Precision, Recall and F1-Score of each sample class
                                                               From the result, we can see that the simplest task is to
4.2. Ensemble learning with Random                          identify repeated words in the sentences (F1-score near
                                                            0.98). Identifying complete sentences is also a relatively
        Forest
                                                            easy task, with a F1-score of 0.90. The hardest task for
In this section, we combine the five trained models (in the model is detecting whether there are missing words
Table 1) with random forest in order to produce one opti- in the sentence. It achieves only 64% precision and 50%
mal predictive model. The idea of the ensemble learning recall on this task.
is to train a random forest classifier with the combination    The confusion matrix is drawn in Fig. 2. From this fig-
of the predicted classes from the models. The random ure, we can further see that the classifier finds it difficult
                                                            to classify between complete sentences and sentences
                                                             to over 90.51%. Overall, the results suggest that using
                                                             state-of-art transformer models can provide good quality
                                                             for detecting the errors in speech recognition systems,
                                                             and provide feedback on further improvements of speech
                                                             recognition systems. In our future works, special adjust-
                                                             ments might be needed to better cope with identifying
                                                             missing words in recognized sentences.


                                                             References
                                                              [1] D. Yu, L. Deng, Automatic speech recognition, vol-
                                                                  ume 1, Springer, 2016.
                                                              [2] N. Agarwal, M. A. Wani, P. Bours, Lex-pos feature-
                                                                  based grammar error detection system for the En-
                                                                  glish language, Electronics 9 (2020) 1686.
                                                              [3] Z. He, English grammar error detection using re-
Figure 2: Confusion matrix for BERT trained on Multi-Label
                                                                  current neural networks, Scientific Programming
Dataset
                                                                  2021 (2021).
                                                              [4] Y. Liu, E. Shriberg, A. Stolcke, D. Hillard, M. Os-
with missing words, even though in most of the cases              tendorf, M. Harper, Enriching speech recogni-
more than one word is missing in the erroneous sen-               tion with automatic detection of sentence bound-
tences. This is understandable because in most cases, not         aries and disfluencies, IEEE Transactions on Au-
every word is indispensable, even we lose some words,             dio, Speech, and Language Processing 14 (2006)
and maybe the meaning is not exactly the same but the             1526–1540. doi:10.1109/TASL.2006.878255 .
sentence still makes sense grammatically.                     [5] Y. Liu, A. Stolcke, E. Shriberg, M. Harper, Using
                                                                  conditional random fields for sentence boundary
4.5. Result on real-world ASR outputs                             detection in speech, in: Proceedings of the 43rd An-
                                                                  nual Meeting of the Association for Computational
Finally we test our trained multi-modal BERT model on             Linguistics (ACL’05), 2005, pp. 451–458.
the real-world ASR outputs from CEASR corpus [15]. The        [6] D. Tuggener, A. Aghaebrahimian, The Sentence
predictions are shown in Fig. 3, where we can see the             End and Punctuation Prediction in NLG text (SEPP-
model is able to capture real-world ASR errors correctly,         NLG) shared task 2021, in: Swiss Text Analyt-
while we also provide an example where the model fails.           ics Conference–SwissText 2021, Online, 14-16 June
                                                                  2021, CEUR Workshop Proceedings, 2021.
                                                              [7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
                                                                  Bert: Pre-training of deep bidirectional transform-
                                                                  ers for language understanding, arXiv preprint
                                                                  arXiv:1810.04805 (2018).
                                                              [8] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
                                                                  I. Sutskever, et al., Language models are unsuper-
Figure 3: Prediction on real-world ASR outputs                    vised multitask learners, OpenAI blog 1 (2019) 9.
                                                              [9] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie,
                                                                  C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang,
                                                                  L. Yang, et al., Big Bird: Transformers for Longer
5. Conclusion                                                     Sequences., in: NeurIPS, 2020.
In this paper, a dataset for detecting speech recognition    [10] F. A. Gers, J. Schmidhuber, F. Cummins, Learning
errors was created, where four different types of typi-           to forget: Continual prediction with lstm, Neural
cal speech recognition errors were taken into account.            computation 12 (2000) 2451–2471.
Experimental results show that transformer models are        [11] A. Thompson, All the news: 143,000 articles
capable of providing good performance on classification           from 15 American publications, =https://www.kag-
of the constructed dataset for speech recognition error,          gle.com/snapcrack/all-the-news, 2017.
reporting approximately 90% accuracy for BERT, GPT2          [12] N. Reimers, I. Gurevych, Making monolingual sen-
and BIG-BIRD. A Random Forest was trained based on                tence embeddings multilingual using knowledge
the five models, and further improved the test accuracy           distillation, arXiv preprint arXiv:2004.09813 (2020).
[13] W. Foundation, Wikimedia downloads, ???? URL:
     https://dumps.wikimedia.org.
[14] K. Gopalakrishnan, B. Hedayatnia, Q. Chen,
     A. Gottardi, S. Kwatra, A. Venkatesh, R. Gabriel,
     D. Hakkani-Tür, A. A. AI, Topical-chat: Towards
     knowledge-grounded open-domain conversations.,
     in: INTERSPEECH, 2019, pp. 1891–1895.
[15] M. A. Ulasik, M. Hürlimann, F. Germann, E. Gedik,
     F. Benites de Azevedo e Souza, M. Cieliebak, Ceasr:
     a corpus for evaluating automatic speech recogni-
     tion, in: 12th Language Resources and Evaluation
     Conference (LREC) 2020, European Language Re-
     sources Association, 2020, pp. 6477–6485.