Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 RUSSIAN-LANGUAGE SPEECH RECOGNITION SYSTEM BASED ON DEEPSPEECH O.O. Iakushkin a, G.A. Fedoseev, A.S. Shaleva, A.B. Degtyarev, O.S. Sedova Saint Petersburg State University, 7/9 Universitetskaya nab., St. Petersburg, 199034, Russia E-mail: a o.yakushkin@spbu.ru The paper examines the practical issues in developing a speech-to-text system using deep neural networks. The development of a Russian-language speech recognition system based on DeepSpeech architecture is described. The Mozilla company’s open source implementation of DeepSpeech for the English language was used as a starting point. The system was trained in a containerized environment using the Docker technology. It allowed to describe the entire process of component assembly from the source code, including a number of optimization techniques for CPU and GPU. Docker also allows to easily reproduce computation optimization tests on alternative infrastructures. We examined the use of TensorFlow XLA technology that optimizes linear algebra computations in the course of neural network training. The number of nodes in the internal layers of neural network was optimized based on the word error rate (WER) obtained on a test data set, having regard to GPU memory limitations. We studied the use of probabilistic language models with various maximum lengths of word sequences and selected the model that shows the best WER. Our study resulted in a Russian-language acoustic model having been trained based on a data set comprising audio and subtitles from YouTube video clips. The language model was built based on the texts of subtitles and publicly available Russian-language corpus of Wikipedia’s popular articles. The resulting system was tested on a data set consisting of audio recordings of Russian literature available on voxforge.com—the best WER demonstrated by the system was 18%. Keywords: Deep Neural Network, Speech, Neural Network Training © 2018 Oleg O. Iakushkin, George A. Fedoseev, Anna S. Shaleva, Alexander B. Degtyarev, Olga S. Sedova 470 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 1. Introduction Currently there are few open source speech recognition systems with close-to-human quality of performance. It is especially true for the Russian language. This paper describes the use of deep neural networks to build an open source speech recognition system. The system is designed to recognize Russian-language speech segments with the average duration of 5-15 seconds in a cleaned dataset. The system is intended to deliver the average WER (Word Error Rate) under 30%. Mozilla DeepSpeech, an open source end-to-end DNN architecture, was chosen as the starting point for our system. Mozilla DeepSpeech is based on the papers of Baidu Research team [1]. It was launched in May 2016 and reached the lowest WER of 6.5% on the LibriSpeech ‘test-clean’ dataset for the English language in November 2017. The architecture of Mozilla DeepSpeech is not tied to any particular language and employs the TensorFlow machine learning framework to describe the structure of neural network and optimize computations. The Word Error Rate (WER) metric is used to evaluate the model’s quality at the word level. The WER is a normalized Levenshtein distance between two word sequences that is averaged for all samples. The WER is defined as follows: WER = (I + D + S) / N, where I is the number of insertions, D is the number of deletions, S is the number of substitutions, and N is the number of words in the reference. 2. Approach Many contemporary end-to-end speech recognition systems are based on the Connectionist Temporal Classification (CTC)—an approach that has become a major breakthrough since it was introduced by Alex Graves in [2]. It describes the output layer of the neural network and the loss function computed based on its values. The CTC does not restrict the choice of network architecture and eliminates the need for meticulous data annotation in sequence recognition tasks [3]. In speech recognition, the problem of data annotation lies in the need to align the sequences of audio frames and sequences of letters in conditions where one character may correspond to many audio frames. The CTC method also provides a differentiable function called the CTC loss function: it applies to machine learning by the stochastic gradient descent method. 2.1. Acoustic model Let us introduce a set of characters A (the alphabet) that has the length of [A] and consists of lowercase letters of the language in question. Let us also consider the alphabet A' = A∪{blank character} with the length of [A'] = n. The blank character here is required for the CTC method. Let us then consider the input sequence of x feature vectors and of the length T. Each feature vector in the sequence is the result of mel-frequency cepstral coefficients (MFCC) extraction from a short audio signal that usually has the duration of 20 milliseconds. The extracted MFCC features provide a quantitative description of the audio and will be further used for symbol prediction. The MFCC extraction algorithm allows to define the number of features (let it be m). Let us also define a recurrent neural network with m inputs, n outputs, and the weight vector ω. The weight vector determines the representation Nω that transforms the input sequence x to a sequence of length vectors n: Nω : (Rm)T → (Rn)T. 2.2. Neural network structure We use a neural network that consists of five hidden layers. The input layer has m inputs that correspond to m MFCC features derived from a short audio signal. The second and third layers are similar to the first one. As regards the first three layers, the nodes in the neighbouring layers are fully connected and use the activation function called ReLU (Rectifier Linear Unit): ReLU(x) = max(0,x). The forth layer is a bidirectional recurrent layer with LSTM units that uses the hyperbolic tangent as the activation function. The result goes to the fifth layer having ReLU as the activation function. Finally, there is the output layer sized n = [A'], where the output value of each node is proportional to the probability of the respective character of the alphabet. Dropout is applied to all five layers with the probability of 0.3 in order to prevent the neural network from overtraining. 471 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 3. Training Details The neural network is trained on aligned audio-transcript pairs. The training dataset is divided into three subsets (train set, dev set and test set) with the proportion of 60:20:20 respectively. The following algorithm is used for training: 1. The MFCC feature vectors extracted from an audio signal are fed to the input layer of the neural network. 2. The data are transformed by weight values, which results in a prediction matrix of symbol s. Each column in the matrix is a probability distribution of characters from the alphabet A' for the time t. 3. The CTC loss function is computed for a particular sample based on the prediction matrix and the expected (reference) transcript. The loss function’s value is used to adjust the weights ω in neural network N. The adjustment employs the Adam algorithm, which is a modification of stochastic gradient descent method [4]. 4. The described process is reiterated as long as the error value (obtained on a validation set) continues to decrease—that is, the acoustic model is being enhanced in an iterative manner. The neural network’s weights are updated based on the CTC loss function’s mean value obtained for a group of samples. The DeepSpeech architecture employs a probabilistic language model to improve the accuracy of speech recognition. This model is essentially a dataset that contains the estimate probabilities of word sequences in a language. Each sequence ranges in length from 1 to N and has a number assigned to it—this number corresponds to the probability of such sequence. Word sequences that consist of n words are called n-grams. The maximum number N that corresponds to the length of a word sequence defines the dimension of the model. If the length of a sequence exceeds the model’s dimension or the sequence is not found in a dataset, the probability of such sequence is estimated by reference to the probability of shorter sequences (down to a unigram) it consists of. Mozilla DeepSpeech uses KenLM toolkit [5] to make queries to the language model. The principles of building a probabilistic language model are described in [5; 6]. The language model is used in the beam search algorithm to increase the probability of beams that comprise word sequences used more frequently. It allows to reduce the probability of beams that contain mistakes or unrealistic sequences. 4. Experiments We used a computer provided by the SPbU Computing Center to make computations required by our task. The computer’s technical parameters are as follows: - CPU: 2 x Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60 GHz; - RAM: 256 GB; - GPUs: 2 x Nvidia Tesla P100, 16 GB each. 4.1. Training hyperparameters The following hyperparameters allow to fine-tune the training process: 1. n_hidden. The number of nodes in the hidden layers of the neural network. It is recommended that this parameter should be defined in accordance with the amount and variety of training data [7]. In Mozilla DeepSpeech, the default value was set at 2048 for the English-language training model. The maximum value is limited by the video memory available. 2. learning_rate. The standard learning rate in Mozilla DeepSpeech is 0.0001. 3. epoch. The number of epochs, where an epoch is one complete forward pass and one complete backward pass of all the training samples in a dataset. 472 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 4. train_batch_size, dev_batch_size, test_batch_size – the number of samples in each of the batches used respectively for training, validation and testing. This should be set at the maximum value allowed by the video memory (usually from 8 to 64). 5. dropout_rate – the probability of a node dropout during one training step. This papameter is believed to prevent the network from overtraining [8]. The default value is 0.3. 4.2. Data Deep neural networks require 500-2000 hours of recorded speech with corresponding transcripts for speech recognition training. We studied several speech sources in Russian and various methods of automatic dataset building. The resulting training dataset includes: - ‘yt-vad-1k’, a corpus containing about 1000 hours of audio recordings extracted from videos on YouTube that have been created by over 1000 people in a variety of recording conditions and with varying degree of background noise. The corpus also contains both user-provided and auto-generated subtitles (automatic captions) for the extracted audio; - ‘voxforge-ru-clean’, a corpus containing 11.5 hours of audio from voxforge.org with transcripts; - ‘yt-vad-650-clean’, a corpus of cleaned audio samples that contains 650 hours of audio. The audio files are in the ‘wav’ format and have one audio channel (mono), the sample rate of 16,000 Hz, and the depth of 16 bit for each value. The ratio between the audio length and the number of symbols in the transcript is subject to a restriction that follows from the CTC matrix’s decoding algorithm—namely, the number of steps in the CTC matrix must exceed the number of symbols in the transcript 4.3. Training We obtained the minimum WER of 27% as the result of training on the ‘yt-vad-1k-train’ dataset. The use of the language model allowed to reduce WER by 10% on ‘voxforge-ru-clean-test’ dataset. The model trained on the ‘yt-vad-650-clean’ corpus—which was cleaner than ‘yt-vad-1k’— showed a slight increase in the minimum WER. It can be explained by a smaller amount of data—650 hours compared to 1000 hours in the first experiment. However, testing with the language model delivers better WER in both experiments. We managed to reduce the minimum WER from 27% down to 22% by means of further training the model already pre-trained on ‘yt-vad-1k’. Table 1. Training Evaluation WER, % Testing without LM Testing with LM yt-vad- voxforge-ru- yt-vad-650- voxforge-ru- 650-clean- clean-test clean-test clean-test test Train yt-vad-1k-train 35.17 37 35.2 27 dataset yt-vad-650-clean-train 37.5 39.8 35 28 yt-vad-1k-train 35.3 33.2 33 22 + yt-vad-650-clean-train 473 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 5. Acknowledgement This research was supported by SPbU (Saint Petersburg State University) grant no. AAAA- A18-118071790047-9 (id: 28612502). The authors would like to acknowledge the Reviewers for the valuable recommendations that helped in the improvement of this paper. 6. Conclusion We have developed a Russian-language speech recognition system that delivers the WER of 22% on the ‘voxforge-ru-clean-test’ dataset. We have employed it to perform speech-based text search in a large collection of video. This system can be useful in a variety of tasks involving speech-based search, though a higher recognition accuracy is required in such tasks as speech recognition by voice assistants. The improvement of speech recognition accuracy will be the goal of further research. References [1] Hannun A., et. al Deep speech: Scaling up end-to-end speech recognition // arXiv preprint arXiv:1412.5567. — 2014. [2] Graves A., et. al Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks // Proceedings of the 23 rd international conference on Machine learning. ACM. 2006, pp. 369-376. [3] Graves A. Supervised sequence labelling // Supervised sequence labelling with recurrent neural networks. Springer, 2012, pp. 52—73. [4] Kingma D. P., Ba J. Adam: A method for stochastic optimization // arXiv preprint arXiv:1412.6980. — 2014. [5] Heafield K. KenLM: Faster and smaller language model queries // Proceedings of the Sixth Workshop on Statistical Machine Translation. — Association for Computational Linguistics. 2011., pp. 187-197. [6] Heafield K., et al. Scalable modified Kneser-Ney language model estimation // Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vol. 2, 2013., pp. 690-696 [7] Weigend A. On overfitting and the effective number of hidden units // Proceedings of the 1993 connectionist models summer school, Vol. 1, 1994, pp. 335-342. [8] Srivastava N., et al. Dropout: A simple way to prevent neural networks from overfitting // The Journal of Machine Learning Research 2014, Vol. 15, pp. 1929-1958. 474