Natural Language Technology to Ensure the Safety of Speech Information Ievgen Iosifov1,2, Olena Iosifova1, Volodymyr Sokolov2, Pavlo Skladannyi3, and Igor Sukaylo4 1 Ender Turing OÜ, ½ Padriku str., Tallinn, 11912, Estonia 2 Borys Grinchenko Kyiv University, 18/2 Bulvarno-Kudriavska str., Kyiv, 04053, Ukraine Abstract This paper is focused on Natural Language Processing (NLP) and speech area, describes the most prominent approaches and techniques, provides requirements to datasets for text and speech model training, compares major toolkits and techniques, and describes trends for NLP and speech domain. Keywords1 Neural network, natural language technology, natural language processing, automatic speech recognition, deep learning, encoder, decoder, word embedding, hidden Markov model. 1. Introduction Significant advances in Deep Learning (DL) of the last decade uncover new possibilities and demands for businesses, governments, and citizens. Such advances in Natural Language Technology (NLT) allow businesses to automate most routine and boring tasks in communication with customers and direct people’s minds to more exciting and creative tasks [001]. To fully leverage NLT, two central stacks of technologies should be combined:  Speech technologies—to translate speech into text and vice versa.  NLP—to understand, interpret, and generate information in a text. This work reviews existing knowledge, directions, and avenues for future research in this increasingly important domain/area NLT/NLP. NLP is a field of artificial intelligence that helps the computer to understand and generate text. NLP is broadly used in many tasks: dialogue systems, sentiment analysis, machine translation, information retrieval, summarization, question answering, etc. During the last decade, there were few breakthroughs in DL fields, first for image recognition and later for natural language, which attracted researchers and businesses' colossal interest. We will review the most prominent and fundamental techniques, which significantly improve machine skills in natural language: Recurrent Neural Networks (RNNs), embedding concept, the concept of decoder and encoder, and shortly attention and transformers. Without this technique, it is hard to imagine such interest in the NLP field. Automatic Speech Recognition (ASR) and speech generation are techniques to convert human speech to text and back. After a structured communication system called language evolved, speech is the main instrument of any communication between human beings. For machines, such language is digits, and it was many iterations to present the human speech to machine understandable language. Sect. 2 will review the latest and most promising techniques, like the hybrid Hidden Markov Model (HMM) and end-to-end systems, combined with Deep Neural Networks (DNNs). Additionally, we will review data requirements in Sect. 3. As with currently available frameworks, input data quality and relevance contribute a major if not overwhelming part of the resulting model quality. In Sect. 4, a CPITS-II-2021: Cybersecurity Providing in Information and Telecommunication Systems, October 26, 2021, Kyiv, Ukraine EMAIL: ei@enderturing.com (I. Iosifov); oi@enderturing.com (O. Iosifova); v.sokolov@kubg.edu.ua (V. Sokolov); p.skladannyi@ kubg.edu.ua (P. Skladannyi); i.sukailo.asp@kubg.edu.ua (I. Sukaylo) ORCID: 0000-0001-6203-9945 (I. Iosifov); 0000-0001-6507-0761 (O. Iosifova); 0000-0002-9349-7946 (V. Sokolov); 0000-0002-7775-6039 (P. Skladannyi); 0000-0003-1608-3149 (I. Sukaylo) ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 216 comparative analysis of approaches and frameworks is presented. Evaluation metrics to measure NLP systems and ASR presented in Sect. 5, end-to-end training approaches in Sect. 6, and trends in Sect. 7. 2. High-Level Overview of Natural Language Technology 2.1. Natural Language Processing Techniques Review Below we will review the main breakthrough points in the NLP area of the last decade. We will start with RNNs as the central concept in NLP (recurrent and combining information from previous iterations), then will present more advanced techniques to do feature engineering (not just a one-hot encoding of words in the dataset, but complicated vector representation with context and additional information related to word). As the main starting point of the NLP, the area was machine translation. It is natural that the concept of encoder-decoder and sequence-to-sequence evolved, which will be covered as well. Attention and transformer will be reviewed as the last achievements in the NLP area. 2.1.1. Recurrent Neural Networks RNNs were the primary building block for NLP tasks for an extended period. The main difference of RNNs from other DL architectures is the ability to remember data for sequence, not only for the last cell (word/token). The network takes X as the input vector (usually encoded-word representations) and produces Y as the output vector. Each RNN cell takes current input xt and previous hidden state (activation) ht – 1, which stores information extracted during previous iterations. The network learns weights (parameters) Wh, Wx, and bias ba through the weights learning process. At each iteration of forwarding propagation, non-linear activation function g such as tanh (or similar) is applied to calculate output hidden state (activation) ht = g(Wh ht–1 + Wx xt + ba). (1) Additionally, softmax (activation function g) may be applied in the end if output predictions needed by the task yt = g(Wy ht + by). (2) Most important for tasks in the NLP area is that output includes information from previous, not only last. It is crucial mainly because of the nature of language. One last word (token) is not enough to understand the context of the sentence. Such a type of connection is called a recurrent connection (see Fig. 1). Figure 1: RNN design This idea and concept of recurrent connection and context significantly affect the current state of the NLP area. RNNs have many disadvantages, though, like unidirectionality, the problem with capturing mid and long-term connections/dependencies inside a sequence. Today it is rare to find RNN as the underlying architecture. More complicated architectures arrived 217 based on RNN, like Gated Recurrent Units (GRU), Long Short-Term Memory (LSTM), and some architectures come to NLP from image recognition, like Convolutional Neural Networks (CNNs) [1]. 2.1.2. Word and Contextual Embedding Approaches The primary purpose of embedding is to represent tokens (documents, phrases, context, a piece of a word, or a character) as a numerical vector. Then neural networks can calculate and use the probability distributions or likelihoods to separate semantically similar categories. So that different tokens with similar meanings will have closer vectors and different by meaning groups of tokens can be separated in vector space. [4] made famous the idea that “a word is characterized by the company it keeps.” Lately, new approaches have appeared. Contextual embedding [5] is a representation of a token within its context. During the embedding processes, information of a token presence in different contexts is considered [6]. 2.1.3. High-Level Encoder-Decoder Architecture The encoder-decoder approach became a breakthrough and led to a significant increase in the performance of language models. The input sequence [“What’s”|“up”|“?”] at the embedding layer gets numerical representation. Then numerical representation is sequentially fed to the RNN. After all, inputs get through RNN produces output. This part is called an encoder—it encodes input sequences (see Fig. 2). The result of the encoder transfers to the decoder. The decoder generates predictions of a resulting sequence until it gets to the end of the sentence token. Figure 2: An example of an encoder-decoder architecture The most outstanding achievement of encoder-decoder is the possibility to use it to create end-to-end models correctly and the possibility to handle input and output sequences of different lengths. The problem of inconsistent input and output lengths is especially actual in neural machine translation. The encoder-decoder architecture is usually based on two RNNs or LSTM. The encoder encodes all input sequences and stores all information in the encoder vector, and the Decoder creates result predictions. 2.1.4. Attention The main limitation of RNNs is dependencies tracking in long sentences. Long sentences (more than 20 words) just can’t be stored effectively in the output vector of RNN. That is why researchers come up with the attention mechanism. 218 The idea of attention is the same as attention from the human reading process. For humans, a few words from a sentence are enough to understand the sentence well. During the translation process, humans need just a few main words to translate, all other words simply out of attention. The same for attention: the decoder focuses on some particular part of the source at each step. Decoder focuses only on particular words at each step (increased saturation represents more attention), not the whole input sequence. The attention mechanism uncovers such possibility to a decoder by attention weights and context vector [14]. 2.1.5. Transformers The transformer is one of the last breakthroughs which accelerates NLP significantly. The transformer is architecture builds on top of the encoder-decoder concept and is hugely based on the attention concept. The main breakthrough was parallelization by replacing sequential computation (RNNs or CNNs) with an attention-based network. The main components and concepts of this architecture will be presented and described below. The encoder consists of multiple stacked self-attention and feed-forward layers with residual connections and a positional encoder. The embedding layer is usually applied at the bottom to convert the input sequence to numerical representation. The feed-forward network does not have dependencies and thus can be parallelized. That is an essential concept behind the Transformer’s possibility to learn on a vast amount of data that LSTM and GRU can’t afford. The decoder also consists of multiple (equal to the encoder) stacked self-attention, feed- forward layers with residual connections, and additionally encoder-decoder attention layer in the middle. In comparison to the encoder, the decoder’s self-attention layer differs. The main idea here is masking future positions. In the encoder, each position can attend to all positions. Still, in the decoder, to prevent leftward information flow to preserve the auto-regressive property, each position can attend only to early positions in the output sequence [13]. 2.2. Speech Techniques Review The main goal of speech systems is to convert input audio wave sequence into text representation in the case of an ASR system and vice versa in the case of speech generation from the text (text-to-speech, TTS) [18]. There are two main approaches here:  Hybrid models based on HMM-DNN ASR systems.  End-to-end ASR systems. 2.2.1. Feature Representation To represent the input audio sequence in format machine understand, we have to ally some transformations. Research shows that it is not good enough to convert input waves into digits of corresponding amplitudes by sampling audio signals. Such features are very uninformative for the training process to squeeze as much information as possible to remember and generalize the audio signal. The spectrogram was obtained during Fast Fourier Transformation (FFT) to represent time (or similar non-linear), frequency, and energy at each point of time. Trans-formation represents features in acoustic frames format (20–40 ms). Mel-Frequency Cepstral Coefficients (MFCCs) or perceptual linear prediction is a common choice of non-linear transformation techniques for feature extraction for ASR data [19]. The classical FFT is not suitable for determining patterns (see Fig. 3). 219 (a) (b) Figure 3: Spectrogram of (a) FFT and (b) MFCC 2.2.2. Hidden Markov Model To reconstruct the utterance that has just been said by putting the correct phonemes after HMM was used. It is done by using statistical probabilities that one phoneme follows the other. In simplified words, HMM consists of three different layers: 1. The heart of the HMM model is an acoustic model that checks on the acoustic level the probability that the phoneme it recognizes is that phoneme. 2. After that lexicon (pronunciation) model is applied, checking the probability that recognized phonemes next to each other can stand next to each other. 3. In the end, the language model applied (usually in the way of n-grams) check on the word level and whether words standing next to each other make sense. As an example here, the model will choose “cat paws” instead of “cat pause” [21]. Figure 4: Hybrid three-level ASR 2.2.3. End-to-End Model As can be seen from above—one of the significant limitations of HMM models is phoneme to grapheme mapping. Especially this problem arrived for a low-resource language where no one tried to prepare such a model. It might be very time-consuming to the prepared dataset for such mapping. That is one reason that simplified end-to-end models evolved. The main inspiration was to train the model with as few labeling and intermediate steps as possible. The model should learn by themselves map phonemes to grapheme directly or indirectly using the same input data used for current training. Another motivation is to jump into an area of unsupervised training, to exploit the vast amount of unlabeled audio data stored on the Internet. There are few varieties of end-to-end ASR systems architectures. At the same time, all of them are built on two types: Connectionist Temporal Classification (CTC) and sequence-to- sequence (encoder-decoder-based). Before CTC main limitation of the end-to-end ASR system was that the model needed to have the whole sentence to start the translation. It means no streaming decoding possibility. CTC map input sequence X (MFCC) to output sequence Y (letters). 220 One of the CTC breakthroughs is that kind of local attention introduced, which splits the continuous speech, and then the current modeling unit using attention runs on each split segment. By doing this, the whole utterance is split into small segments, and local attention is used to predict features (letter) [26]. 2.3. Text-to-Speech Standard TTS systems consist of a few parts:  RNN modifications (LSTM, GRU, etc.) for recurrent sequence-to-sequence feature prediction network that maps character embedding to MFCC spectrograms (description of these components is familiar very similar to above-described components).  Vocoder system that synthesizes waveforms from those spectrograms. The relationship between linguistic features and vocoder parameters that represent the vocal cord and vocal tract characteristics learned by acoustic models. Vocoder parameters are generated (in the synthesis stage) from the trained acoustic models, and a speech waveform is synthesized using high-quality vocoder systems [31]. 3. Data Requirements Data preparation is a significant step in any DL area, and NLP is not an exclusion. As DL techniques are remembering and generalization of training datasets, it is hard to overestimate the impact of a good and bad quality dataset. As in any other DL task, data is presented in features and relevant label form. 3.1. Natural Language Processing Data Requirements Requirements for input data are the same as for other tasks in the DL area: input data should be as much as possible closer to the domain model will work in. In simple words, you can’t train a model for medical anamneses prediction using a financial data dataset. If the training process model didn’t see examples of token/word in the dataset, it will just not react to it. Over the past years, huge progress has been made to overcome such limitations, and embedding techniques with pre-trained token embedding helps a lot. Still, it is hard to overemphasize how better the trained model would be if you use relevant training data. The NLP area has some specific requirements: data should be split by sentences, which is a big problem for ASR. For such specific NLP tasks as punctuation, the typical case is to generate a synthetic dataset and label it in the most suitable task way [35]. Nowadays, there are many labeled and much more unlabeled datasets, which is a good starting point for significant tasks, so software engineers do not need to collect and label datasets on their own. 3.2. Speech Data Requirements Datasets for ASR are some audio files (usually, 3–20) and related text transcripts (labels). All digits should be denormalized to a text representation. For the model to get used to acoustic, were four and format sound almost the same, and if in training dataset it will be “4” and “format,” it will be much harder for the model to generalize acoustically. Such a denormalization task can be very complicated. Let’s imagine number “3,” which may represent “three,” “third,” etc. And such a task is even more complicated for non-English languages. 221 The above gives us some information on how hard to prepare a good dataset for ASR, especially for low resource languages. There are few directions to overcome such limitations, for example, (1) use non-labeled audio data (unsupervised learning), which arise new limitations of computational resources; (2) to generate such dataset iteratively using smaller to generate bigger where we have generated 2,500 hours dataset using just 100 hours as a starting point. As an example of language to test such automated ASR dataset generation pipeline, we took low resource Ukrainian language, which is very limited in terms of available ASR datasets (see Table 1). Table 1 Datasets and sources for Ukrainian speech corpus Dataset name Class Duration, Quality hours Ukrainian Corpus for Broadcast Speech [37] ASR dataset 366 — Multi-speaker Corpus “UkReco” [39] ASR dataset — — M-AILABS Ukrainian Corpus [40] Books 87 Excellent Ministry of Education, Culture and Science Lessons [41] YouTube 29 Good Deutsche Wellе in Ukrainian [42] YouTube 70 Ordinary Telebachennya Toronto [43] YouTube 60 Ordinary Mozilla Common Voice [44] Mozilla 22 Excellent TEDx Talks [45] TEDx <50 Good One of the main difficulties in data preparation for ASR is that many audios are stored in mono format, with mixed channels and hence mixed speakers. There are a few techniques to separate such audios by using voice activity detection and several speaker identifications. ASR systems are pretty demanding for data. You need approximately the following amount of data to train the good model:  Five thousand hours for a hybrid approach.  Ten thousand hours for an end-to-end approach.  Thirty thousand hours for an unsupervised learning approach. 4. Models and Frameworks 4.1. Natural Language Processing Models There are a few main divisions of approaches used to solve NLP tasks. Main are either to use pre-trained models like (BERT [46], RoBERTa [47]) or to train a model from scratch based on Bidirectional RNN (BRNN), LSTM, CNN architecture (see Fig. 5). Pre-trained models can be divided into encoder-based and decoder-based. We will not review non Transformer based pre-trained models as with distilled pre-trained Transformer based models, you can achieve almost the same model efficiency, like BRNN but with significantly higher accuracy. 222 Figure 5: Model classification If the task or domain is specific and you can’t exploit the pre-trained model, it is good to train the BRNN, LSTM model from scratch (see Fig. 6). Figure 6: Model training process 4.2. Speech Processing Toolkits A comparative analysis of a conventional Speech Processing Toolkit (SPT) and tools to build an ASR model is presented in Table 2. We compared in a few dimensions, and the most important ones are demand in data amount to train or fine-tune your model and learning curve to start working with the toolkit. Table 2 Comparative analysis of existing software Product Type Training type Data amount Learning curve Kaldi [56] Hybrid SPT Supervised Medium-low Hard Julius [57] Hybrid SPT Supervised Medium-low Hard DeepSpeech [58] End-to-end SPT Supervised High-medium Easy EspNet [59] End-to-end SPT Supervised High-medium Easy FairSeq [60] End-to-end SPT Unsupervised Medium-low Normal As can be seen from comparative analysis, no one toolkit can handle all steps of ASR data- preprocessing (including collection, splitting, labeling, preparing for ASR expected format), and model training. That is why we believe more teams contribute to creating more frameworks and toolkits to lower the learning curve and demand in training data. The results of speech recognition make it possible to automatically track the illegal activity of users of the information system using keywords [002]. 223 5. Evaluation Metrics 5.1. Natural Language Processing Metrics Because the NLP area is quite broad and the number of tasks in NLP is vast—there are no standard metrics for all tasks. We can split metrics by clustering tasks and highlight next:  Machine translation models: bilingual evaluation understudy is a performance metric to measure the performance of machine translation models. It evaluates how well a model translates from one language to another.  Language understanding evaluation: general language understanding evaluation is a benchmark based on different types of tasks rather than evaluating a single task. The three major categories of tasks are single-sentence tasks, similarity and paraphrase tasks, and inference tasks. At the same time, there are many fine-tuning tasks, like part of sentence tagging, named entity recognition, etc. In such tasks, the most common metric is to count accuracy through precision P = Ntp / (Ntp + Nfp) (3) where Ntp is a true positive, Nfp is a false positive. And recall the equality calculates the coefficient R = Ntp / (Ntp + Nfn) (4) where Nfn is a false negative. F1 Score calculates from (1) and (2) [62]: F1 = 2 · P · R / (P + R). (5) 5.2. Speech Processing Measurement Criteria It is much easier to come up with a unified metric for ASR than for NLP tasks, as we only have to measure if the word is recognized correctly or not. Hence, Word Error Rate (WER) is the most common to see an ASR accuracy metric. The lower WER is better than the ASR system. WER can then be computed as: WER = (NS + ND + NI) / (Ns + ND + NC) (6) where NS is the number of substitutions, ND is the number of deletions, NI is the number of insertions, NC is the number of correct words [63]. It is worth mentioning that WER is very sensitive to domain and acoustic. For example, low (good) WER of 5% for literature (trained on books) domain can have 20–30% WER for call center calls [64]. 6. End-to-End Training Approaches With the mentioned breakthrough in NLP and speech areas, more and more businesses see great opportunities for the implementation of NLT based systems. That drives demand for NLP / speech engineers. Such demand can’t be satisfied with the existing supply, so the learning curve should be lowered. And one of the significant steps to such supply increase is end-to-end systems, which is much simpler from the end-user (engineer perspective). We see a huge trend and opportunity in end-to-end training approaches. Yes, it still requires more data, is less effective, etc., but we believe it is the future of NLP and speech systems. Data labeling is the most comprehensive, long, and expensive process—last year’s trends are to use unlabeled data. It is much easier to collect data than to collect labeled data specifically for your domain 224 or low-resource language. However, even unlabeled data have to comply with requirements. Today’s unsupervised learning approaches for NLP became standard, and with approaches for distillation and pruning on top of base training, they become effective and practical. We are still looking forward to more effective unsupervised learning in the speech area that will require not 50,000 hours of speech and enormous resources to train the model but more practical. To teach the model to generalize using unlabeled data demands a much more significant amount of data and is thus very demanding in computational resources (50,000 hours of unlabeled data to train a state-of-the-art ASR model). That is why the third trend is the pre-trained models. For NLP, it is common to train a model on a significant amount of data once and then fine-tune for underlying tasks. As an example mentioned before—using a pre-trained BERT model to mark punctuation after ASR for NLP processing. Not to mention how huge a lower learning curve and time investments are for an engineer to prepare a production-ready model. You do not need to collect vast datasets only to train, for example, tokenizer. For speech, pre-training is a common cause for hybrid models and still have to be evolved for end-to-end approaches (especially for unsupervised learning), as it is clear that researcher or software engineer will not have the possibility to spend hundreds of thousands of USD to train the model using unsupervised data. 7. Conclusions As humans learn many languages to understand other humans (even within one country), we expect to see more advanced multilingual models that can accurately understand multilingual dialogues. We believe all these trends are possible because great-specialized frameworks for NLP and speech appeared, as a few examples: HuggingFace [65] for NLP, Kaldi [56], ESPnet [59], FairSeq [61] for speech-to-text recognition, Tacotron [31] for TTS synthesis. We expect to see advances in this area of Frameworks and toolkits, gathering last achievements and preparing interfaces to use them for more and more engineers. 8. References [1] S. Gnatyuk, et al., Method of cybersecurity level determining for the critical information infrastructure of the state, in: 2nd International Workshop on Control, Optimisation and Analytical Processing of Social Networks (2020) 332–341. [2] O. Iosifova, et al., Techniques Comparison for Natural Language Processing, in: Proceedings of the Modern Machine Learning Technologies and Data Science Workshop 2631 (2020) 57–67. [3] J. R. Firth, in: A Synopsis of Linguistic Theory (1957) 1930–1955. [4] M. Peters, et al., Deep Contextualized Word Representations, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: New Orleans, Louisiana 1 (2018) 2227–2237. [5] Q. Liu, M. J. Kusner, P. Blunsom, A Survey on Contextual Embeddings (2020) 1–13. arxiv:200307278. [6] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, in: MIT Press (2016) 462–480. [7] D. Bahdanau, K. Cho, Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate 2016, 1–15. arxiv:14090473. [8] K. Cho, et al., Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Doha, Qatar (2014) 1724–1734. [9] G. Hinton, et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups, IEEE Signal Process. Mag. 29 (2012) 82–97. doi:10.1109/ MSP.2012.2205597. [10] L. R. Rabiner, B. H. Juang, Fundamentals of Speech Recognition, in: PTR Prentice Hall (1993) 342–368. [11] L. E. Baum, T. Petrie, Statistical Inference for Probabilistic Functions of Finite State Markov Chains, Ann. Math. Stat. 37 (1966) 1554–1563. doi:10.1214/aoms/1177699147. 225 [12] Y. Wang, et al., Tacotron: Towards End-to-End Speech Synthesis (2017) 1–10. arxiv:170310135. [13] I. Iosifov, O. Iosifova, V. Sokolov, Sentence Segmentation from Unformatted Text Using Language Modeling and Sequence Labeling Approaches, in: Proceedings of the 2020 IEEE International Scientific and Practical Conference Problems of Infocommunications. Science and Technology; IEEE: Kharkiv, Ukraine (2020) 335–337. doi:10.1109/PICST51311.2020.9468084. [14] O. Romanovskyi, et al., Automated Pipeline for Training Dataset Creation from Unlabeled Audios for Automatic Speech Recognition, Advances in Computer Science for Engineering and Education IV 83 (2021) 25–36. doi:10.1007/978-3-030-80472-5_3. [15] T. Lyudovyk, V. Pylypenko, Code-Switching Speech Recognition for Closely Related Languages, in: Proceedings of the Workshop on Spoken Language Technologies for Under-Resourced (2014) 1–6. [16] N. B. Vasileva, et al., Corpus of Ukrainian On-Air Speech. Speech Technol. 2 (2012) 12–21. [17] J. Meyer, JRMeyer/Open-Speech-Corpora, 2021. URL: https://github.com/JRMeyer/open- speech-corpora. [18] MON Ukraine—YouTube, 2021. URL: https://www.youtube.com/channel/UCQR9sMWcZsh AwYX-EYH0qiA. [19] Deutsche Wellе in Ukrainian—YouTube, 2021. URL: https://www.youtube.com/channel/ UCQwVj4PyS5leCgEJY4I2t1Q. [20] Toronto TV—YouTube, 2021. URL: https://www.youtube.com/channel/UCF_ZiWz2Vcq1o5u 5i1TT3Kw. [21] Common Voice by Mozilla, 2021. URL: https://commonvoice.mozilla.org/. [22] TED Talks, 2021. URL: https://www.ted.com/talks. [23] J. Devlin, et al., BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding (2019) 1–16. arxiv:181004805. [24] Y. Liu, et al., RoBERTa: A Ro-bustly Optimized BERT Pretraining Approach (2019) 1–13. arxiv:190711692. [25] D. Povey, et al. The Kaldi Speech Recognition Toolkit, in: Proceedings of the ASRU (2011) 1–4. [26] A. Lee, T. Kawahara, K. Shikano, Julius—an Open Source Real-Time Large Vocabulary Recognition Engine, in: Proceedings of the Eurospeech (2001) 1–4. [27] A. Hannun, et al., Deep Speech: Scaling up End-to-End Speech Recognition (2014) 1–12. arxiv:14125567. [28] Watanabe, S.; et al., ESPnet: End-to-End Speech Processing Toolkit, in: Proceedings of the Interspeech (2018) 2207–2211. [29] S. Schneider, et al., Wav2vec: Unsupervised Pre-Training for Speech Recognition, in: Proceedings of the Interspeech (2019) 3465–3469. [30] S. Gnatyuk, et al., Modern method and software tool for guaranteed data deletion in advanced big data systems, Advances in Intelligent Systems and Computing (2019) 581–590. doi:10.1007/978- 3-030-12082-5_53. [31] L. Derczynski, Complementarity, F-Score, and NLP Evaluation, in: Proceedings of the 10th International Conference on Language Resources and Evaluation (2016) 261–266. [32] D. Klakow, J. Peters, Testing the Correlation of Word Error Rate and Perplexity, Speech Commun. 38 (2002) 19–28. [33] O. Iosifova, et al., Analysis of Automatic Speech Recognition Methods, in: Proceedings of the Workshop on Cybersecurity Providing in Information and Telecommunication Systems 2923 (2021) 252–257. [34] T. Wolf, et al. HuggingFace’s Transformers: State-of-the-Art Natural Language Processing (2020) 1–8. arxiv:191003771. [35] A. Baevski, et al., Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations 2020, 1–19. arxiv:200611477. 226