Prototyping Methodology of End-to-End Speech Analytics Software Oleh Romanovskyi 1, Ievgen Iosifov 1,2, Olena Iosifova 1, Volodymyr Sokolov 2, Pavlo Skladannyi 2 and Igor Sukaylo 2 1 Ender Turing OÜ, ½ Padriku str., Tallinn, 11912, Estonia 2 Borys Grinchenko Kyiv University, 18/2 Bulvarno-Kudriavska str., Kyiv, 04053, Ukraine Abstract This paper presents the prototype of end-to-end speech recognition, storage, and postprocessing tasks to build speech analytics, real-time agent augmentation, and other speech- related products. Moving ASR models from the dev environment into production requires both researcher and architectural knowledge, which slows down and limits the possibility of companies benefiting from speech recognition and NLP advances for fundamental business operations. This paper proposes a fast and flexible prototype that can be easily implemented and used to serve ASR/NLP-trained models to solve business problems. Various software solutions’ compatibility problems were solved during the experimental setup assembly, and a working prototype was built and tested. An architectural diagram of the solution was also shown. Performance, limitations, and challenges of implementation are also described. Keywords 1 Natural Language Processing, NLP, Automatic Speech Recognition, ASR, speech analytics. 1. Introduction The rise of speech technologies created a demand for the newest software architectures applicable for real business solutions on a scale. According to Gartner, by 2025, 40% of all world call centers will use speech-to-text technology to handle incoming communication. Automotive giants like Porsche Group, Volkswagen AG, Mercedes-Benz Group, and others invest hundreds of millions of USD in creating new experiences involving voice communication between a driver and their car. Many more industries jump onto natural language voice communication between a human and a machine: gaming, medical devices, industrial machines, and others. This shift in utilizing science in the real world created a demand for higher-level software frameworks with pre-built architectures rather than low-level ASR and TTS frameworks. The new level of frameworks should include parts that ensure easy integration in existing IT infrastructures, fault tolerance, data security, and others. We have built the prototype for a system that handles both recoded voice processing and real-time voice stream handling [1–3]. When it comes to business cases, more is needed to have an overwhelming number of building blocks, which is okay for proof of value. It should be an end-to-end research-friendly solution that is flexible and reliable. 1 MoMLeT+DS 2022: 4th International Workshop on Modern Machine Learning Technologies and Data Science, November, 25-26, 2022, Leiden-Lviv, The Netherlands-Ukraine EMAIL: or@enderturing.com (O. Romanovskyi); ei@enderturing.com (I. Iosifov); oi@enderturing.com (O. Iosifova); v.sokolov@kubg.edu.ua (V. Sokolov); p.skladannyi@kubg.edu.ua (P. Skladannyi); i.sukailo.asp@kubg.edu.ua (I. Sukaylo) ORCID: 0000-0003-3420-5621 (O. Romanovskyi); 0000-0001-6507-0761 (I. Iosifov); 0000-0001-6203-9945 (O. Iosifova); 0000-0002-9349- 7946 (V. Sokolov); 0000-0002-7775-6039 (P. Skladannyi); 0000-0003-1608-3149 (I. Sukaylo) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 2. Related works There are many ASR toolkits on the market, like NeMo: a toolkit for building AI applications using Neural Modules [4], ESPNet [5], and Kaldi [6]. The limitation of these frameworks is that they are very focused on ASR-related tasks and are only one building block for business applicability. NVIDIA put a lot more effort than others and developed the NVIDIA TAO Toolkit [7] and NVIDIA [8], which also contains pre-/post-processing components for speech-related tasks. NVIDIA Triton Inference Server is an open-source inference serving software that simplifies inference serving for an organization by addressing the above complexities. Triton provides a single standardized inference platform which can support running inference on multi-framework models, on both CPU and GPU, and in different deployment environments such as datacenter, cloud, embedded devices, and virtualized environments. Also Triton requires NVIDIA GPU and proprietary (can't be modified) but nice features like pipelines and dynamic batching [9]. Still, there need to be more high-quality implementations of prototypes to build software based on. Hence, we decided to put effort into building a modular prototype that can be used to develop any business software. At the ASR level, we put our models trained with an automated pipeline [10] which can be trained and served with any of [4–6] toolkits. As postprocessing, we have put punctuation [11] and emotion recognition [12]. 3. Prototyping Prototyping stage includes major uncertainties in both technical implementation and client’s business requirements. At the same time, we need to evolve it into reliable and scalable solution to recognize, store, and post-process speech-related tasks, without rewriting it from scratch. Thus we decided to focus on three main areas: - Loose coupling. We should be able to replace each functional block either without changes to the rest of the system, or with minimal changes. For example, using a different tool or a different service for speech recognition should not require backend or UI changes. - Pipeline flexibility. During prototyping we don’t know which processing steps customer needs (like punctuation or emotion detection). So processing pipeline should be easy to add/remove steps or to change execution graph. - Interoperability: the solution should be easy to integrate with existing enterprise software (like CRM, Customer Support system, etc.). 3.1. Defining loosely-coupled components We’ve defined the following requirements to achieve loose coupling: - Data formats and protocols between components are independent of data-structures of underlying frameworks - Asynchronous processing for non-interactive tasks - Being able to deploy and update components independently We also needed to select protocols and frameworks to receive and process audio for the prototyping. After evaluation, we decided that to satisfy pipeline scalability requirements, we will implement the solution in microservice architecture [13–15] to add new services. We also decided to divide ASR and postprocessing tasks for real-time processing; there is no need for most postprocessing time, and any additional time consumption is critical for real-time tasks [16]. 3.2. Protocol selection One of the decision for prototyping was protocols selection. Criteria for protocols was: easy to implement and use on different programming languages; generic, so we can implement adapters for other protocols later as separate services selection to accept the audio signal and provide results. We evaluated and compared real-time and non-real-time approaches and protocols to build prototypes in Table 1. Table 1 Real-time and non-real-time comparison Connection type Complexity for Applicable cases Computational prototyping range requirements Real-time Hard Low High Non-real-time Medium High Medium After evaluation, we decided to approach prototyping in two steps, starting from non-real-time because of its easiness of prototype and broad application, and after that, enlarging the prototype to Version 2 with the support of real-time audio processing. 3.3. Non-real-time protocol selection For non-real-time, we decided to implement the asynchronous method using REST API. This will be the most flexible approach to building complex integrations with enterprise systems like Customer Relationship Management, Business Intelligence, email and chat system, etc. REpresentational State Transfer (REST) is a software architectural style that describes a uniform interface in a client-server architecture [17]. An Application Programming Interface (API) that complies with some or all of the six guiding constraints of REST is considered to be RESTful [18]. An API establishes a connection between programs so they can transfer data. A program with an API implies that some parts of its data are exposed for the client to use. The client could be the front end of the same program or an external program. 3.4. Real-time protocol selection The decision was more complicated for real-time processing because various methods exist, like Websocket, MQTT over WebSocket, gRPC, etc. We decided to go with the most widely used and simplest one—WebSocket for the real-time prototype [16, 17]. WebSocket is a computer communications protocol providing full-duplex communication channels over a single TCP connection. The main advantage of WebSocket is simplicity and prevalence [18]. Real-time alternatives are GRPC or MQTT. 3.5. Server-side frameworks selection As the main idea behind it is to build flexible ready in terms of integration prototype, was to prepare prototype as natural for researchers as possible, that is why we selected Python as the main language because of popularity in research community. After evaluating REST API Python frameworks (Django, Flask, FastAPI) [19], we decided to go with FastAPI running under Uvicorn as one of the fastest Python frameworks available. FastAPI is a modern, fast (high-performance) web framework for building APIs [20]. Their API-first approach will best integrate future solutions with any enterprise solutions. 3.6. Speech recognition serving framework Server-side framework decision not only affects how we serve speech recognition tasks but also creates limitations that speech recognition models should be trained using the same framework. Hence decision on the server side Speech recognition framework was the most important during prototyping. We were chosen between two commonly used frameworks: Kaldi and Nemo. 3.6.1. Nemo framework NVIDIA Nemo [4] is a conversational AI toolkit built for researchers working on automatic speech recognition (ASR) [2, 3, 21–23], natural language processing (NLP), and text-to-speech synthesis (TTS) [24], freely available under the Apache License v2.0. The main advantages: ● Comparatively, easy model finetuning. ● No need for grapheme to phoneme models. ● Ready to go server-side framework to serve models. 3.6.2. Kaldi framework Kaldi [6] is an open-source speech recognition toolkit written in C++ for speech recognition and signal processing, freely available under the Apache License v2.0. The main advantages: ● The Acoustic Model (AM) is not biased for the Language Model (LM) task (AM of end-to- end frameworks absorbs some part of AM inside AM). ● Fewer input audios need to train the same accuracy model. After careful consideration, we have decided to build a prototype with the Nemo framework as it already has a server side. 4. Implementation After consideration, we split the prototype into two versions. The first version should be simple and fast, supporting only non-real-time audio processing; because it is more stable, tasks can be repeated, postponed, and use lower resources. For the second version of the prototype, we decided to dedicate to real-time processing, indexing, and architecture comprehension to be more production ready. To get production-ready results, we need to guarantee stability and provide end-to-end results, which means that we need not only to recognize speech but also separate speakers for mono audio where speakers are in one channel and apply punctuation and inverse text normalization to convert words into easily readable numbers and signs (e.g., @, #, etc.). Almost all postprocessing tasks should be done after speech recognition results are obtained. Hence, we implemented a pipeline. 1. Speech recognition. 2. Diarization (mono-to-stereo). 3. Normalization (word-to-number conversion). 4. Punctuation. 4.1. First version of prototype For the first version of the prototype, we decided to go with the minor functionality to check end - to-end work. We designed a straightforward algorithm that takes the file into processing and returns the future ID of the result immediately. Hence, after some time using the GET method, the customer can get results. The main limitation of this algorithm is that users do not know when exactly the results will be ready and may need to check a few times and get just the ‘IN PROGRESS’ status. The algorithm for prototype Version 1 is shown in Fig. 1. Figure 1: Prototype algorithm (Version 1) To implement offered algorithm, we develop the most straightforward architecture shown in Fig. 2. Figure 2: Prototype structure (Version 1) The main building blocks of Version 1 architecture is: 1. ASR level: to accept audio file/stream and return raw text results. 2. Post-processing level: to implement additional tasks based on recognized text, e.g., Punctuation, Classification, Diarization, etc. 3. Storage level: where a database can store all results to serve them back to the user on requests. 4. API level: serving as the main gateway interface for customers to send audio files to and get results back. It is also a logic center to decide which steps to take and store results in the database at the end of file processing. 4.2. Second version of prototype After evaluation of Version 1, which is described in detail in the results part, it was clear that we need to implement queue logic to build the Version 2 prototype because without overcoming of blocking of BE/API, it is clear that we cannot serve real-time requests. As a result, we develop a prototype Version 2 algorithm shown in Fig. 3. Figure 3: Prototype algorithm (Version 2) To implement indicated algorithm, we designed the architecture shown in Fig. 4. Figure 4: Prototype structure (Version 2) The main building blocks of Version 2 architecture is: 1. All features from Version 1. 2. Queue level: where we can put non-real-time tasks for future processing when the CPU is ready and be sure the task will wait until processing. 3. Indexing level: for search functionality. Code example of creating a basic API endpoint to accept audio file or streams can be found below and results: @router.post("/audio-file") def upload_audio_file( *, file: UploadFile = File(...), params: schemas.RecognizerEndpointParams = Depends(endpoint_parameters), ) -> Any: session_id = uuid.uuid4() asr_instances = get_asr_by_language(language) with tempfile.NamedTemporaryFile( prefix="uploaded_", delete=False, dir=TMP_DATA_DIR ) as fp: fp.write(file.file.read()) fp.seek(0) tempfile_name = fp.name if params.realtime: return recognize_file( filepath=tempfile_name, asr_instances=asr_instances, ) else: return queue_pipeline( session_id=session_id, audio_location=tempfile_name, asr_instances=asr_instances, ) def endpoint_parameters( language: str = None, realtime: bool = False, ) -> schemas.RecognizerEndpointParams: return schemas.RecognizerEndpointParams( language=language, realtime=realtime, ) In the above code snippet, we declare the URI path to server code at http(s)://{server_ip_address}/recognizer/audio-file with @router.post("/recognizer/audio- file"), and declare parameters to be accepted: audio file and dictionary parameters to parse as metadata. During the processing of a request, we store the received audio stream into a temporary file with tempfile.NamedTemporaryFile . We do this to have the possibility to work with queueing mechanism. Hence, we can’t keep all in the memory and need to exploit storage. In the end, we examine metadata parameters to identify whether we should return results in real- time return recognize_file() or we can queue as a task and just return identification return queue_pipeline() to obtain results in the future by requesting the GET method. The presented code will result in generating the API endpoint shown in Fig. 5. Figure 5: API endpoint example for audio file uploading 5. Comparison results 5.1. Prototype performance Using Version 1 of the prototype on a server with 4×A100 GPUs, we were able to process 8000 hours of audio in 24 hours or 333 hours in 1 hour, which represents 0.003 Real-Time Factor which is far behind with state-of-the-art models [25, 26]. We will not go deeply into the limitations of Version 1 (described below) because the main goal is to develop a prototype that will align with the state-of-the- art models. Using Version 2 of the prototype on a server with 4×A100 GPUs, we were able to process 45,000 hours of audio in 24 hours or 1875 hours in 1 hour, which represents 0.0006 Real-Time Factor which is in line with the state of art models [27, 28]. The main components of such a tremendous end-to-end performance are ASR and WEB frameworks [4, 19]. The main performance limitation is ASR prediction. The performance of ASR models serving on CPU was minimal, and with the big model with WER, 4–6% is almost one real-time factor per one vCPU. Hence, it makes sense to use CPU-based deployments only for limited (approximately up to 100 on the most significant AWS instance) concurrent sessions [29–31]. 5.2. Difficulties with the first version implementation The main limitation was API (back-end) as a logic center that decided which steps to take next and waited for the results of any current task. Hence, API/BE was blocked from accepting any new files before the current audio file was inside the pipeline. And while each file was 5–10 minutes long, API was blocked for approximately this period. 5.3. Difficulties with the second version implementation With Version 2, the prototype became much more stable and predictable, but still, there were a lot of limitations and difficulties during prototype testing. 1. Prioritization of real-time: current prototype does not prioritize real-time tasks over queued ones, which means there is no confidence that real-time tasks will be accepted if queue and recognizers have a lot to do. 2. Stability: to support both streaming and batch file recognition (with one ASR server prototype) current prototype version is recognized through streaming of audio files from the worker directly to ASR, which is not recommended for a production-ready solution and should be refined. 3. Scalability: there is no load balancing opportunity in the current prototype (neither for API block nor for ASR and Post-processing block). This should be refined for the production readiness of the prototype. 4. Feedback and visibility: no feedback information is sent back to the user other than the ID of future results. This can be tricky if services are busy, as users will not know when exactly the task will be finished and will not get any estimations and progress info. Hence prototype should be refined with added visibility and webhooks for feedback when results are ready and in case of any issue with production readiness. 6. Acknowledgements The research team is grateful to Ender Turing OÜ for defining the business problem, comments, corrections, inspiration, and computational resources. 7. Conclusion and future works The article presents the prototype of an end-to-end speech recognition framework with a storage database and results from indexing. For this framework and protocols for real-time and non-real-time were selected in the way to be ready to scale for a production-ready solution. Various software solutions' compatibility problems were solved during the experimental setup assembly, and a working prototype was built. An architectural diagram of the solution was also shown. As a result, it was tested that the prototype delivers stability if the number of processed audios is less than one hour of audio per one vCPU. Hence, managing and load balancing of connections and audios to process is a task to be solved in the following versions of the prototype, as well as prioritization of real-time processing against non-real-time tasks. Future research will be focused on optimization issues, such as the scaling of speech recognizers, parallelization of the pipeline, fault tolerance, pipeline progress visibility, security, and webhooks implementation to inform result readiness and make it easier to deploy. 8. References [1] I. Iosifov, et al., Natural Language Technology to Ensure the Safety of Speech Information, in: Proceedings of the Workshop on Cybersecurity Providing in Information and Telecommunication Systems 3187(1) (2022) 216–226. [2] O. Iosifova, et al., Analysis of Automatic Speech Recognition Methods, in: Proceedings of the Workshop on Cybersecurity Providing in Information and Telecommunication Systems 2923 (2021) 252–257. [3] O. Iosifova, et al., Techniques Comparison for Natural Language Processing, in: Proceedings of the Modern Machine Learning Technologies and Data Science Workshop 2631 (2020) 57– 67. [4] O. Kuchaiev, et al., NeMo: A Toolkit for Building AI Applications using Neural Modules, arXiv (2019) 36–44. doi:10.48550/arXiv.1909.09577. [5] S. Watanabe, et al., ESPnet: End-to-End Speech Processing Toolkit, arXiv (2018) 1–5. doi:10.48550/arXiv.1804.00015. [6] D. Povey, et al., The Kaldi Speech Recognition Toolkit, in: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (2011) 1–4. [7] NVIDIA Developer, Endless Ways to Adapt and Supercharge Your AI Workflows with Transfer Learning (2022). URL: https://developer.nvidia.com/tao-toolkit-usecases- whitepaper/1-introduction [8] S. Chandrasekaran and M. Salehi, Simplifying AI Inference in Production with NVIDIA Triton (Apr 12, 2021). URL: https://developer.nvidia.com/blog/simplifying-ai-inference-in- production-with-triton/ [9] NVIDIA Developer, NVIDIA Triton Inference Server (2022). URL: https://developer.nvidia.com/nvidia-triton-inference-server [10] O. Romanovskyi, et al., Automated Pipeline for Training Dataset Creation from Unlabeled Audios for Automatic Speech Recognition, Advances in Computer Science for Engineering and Education IV 83 (2021) 25–36. doi:10.1007/978-3-030-80472-5_3. [11] I. Iosifov, O. Iosifova, V. Sokolov, Sentence Segmentation from Unformatted Text Using Language Modeling and Sequence Labeling Approaches, in: Proceedings of the 2020 IEEE International Scientific and Practical Conference Problems of Infocommunications. Science and Technology; IEEE: Kharkiv, Ukraine (2020) 335–337. doi:10.1109/PICST51311.2020.9468084. [12] I. Iosifov, et al., Transferability Evaluation of Speech Emotion Recognition Between Different Languages, Advances in Computer Science for Engineering and Education 134 (2022) 413–426. doi:10.1007/978-3-031-04812-8_35. [13] J. Thönes, Microservices, IEEE Software 32(1) (2015) 113-115. doi:10.1109/MS.2015.11. [14] N. Alshuqayran, N. Ali, and R. Evans, A Systematic Mapping Study in Microservice Architecture, in: 2016 IEEE 9th International Conference on Service-Oriented Computing and Applications (SOCA) (2016) 44–51. doi:10.1109/SOCA.2016.15. [15] N. Dragoni, et al., Microservices: Yesterday, Today, and Tomorrow, Present and Ulterior Software Engineering (2017) 195–216. doi:10.1007/978-3-319-67425-4_12. [16] H. Vural, M. Koyuncu, and S. Guney, A Systematic Literature Review on Microservices, Computational Science and Its Applications (ICCSA) (2017) 203–217. doi:10.1007/978-3-319- 62407-5_14. [17] A. Neumann, N. Laranjeiro, and J. Bernardino, An Analysis of Public REST Web Service APIs, IEEE Transactions on Services Computing 14(4) (2021) 957–970. doi:10.1109/ tsc.2018.2847344. [18] A. Ehsan, et al., RESTful API Testing Methodologies: Rationale, Challenges, and Solution Directions, Applied Sciences 12(9) (2022) 4369. doi:10.3390/app12094369. [19] TechEmpower, Web Framework Benchmarks (2022). URL: https://www.techempower.com/benchmarks/#section=test&runid=7464e520-0dc2-473d- bd34-dbdfd7e85911&hw=ph&test=query&l=zijzen-7 [20] V. Pimentel and B. G. Nickerson, Communicating and displaying real-time data with WebSocket, IEEE Internet Computing 16(4) (2012) 45–53. doi:10.1109/MIC.2012.64. [21] J. Li, et al., Jasper: An End-to-End Convolutional Neural Acoustic Model, arXiv (2019) 1–5. doi:10.48550/arXiv.1904.03288. [22] S. Kriman, et al., QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions, arXiv (2019) 1–5. doi:10.48550/arXiv.1910.10261. [23] O. Hrinchuk, M. Popova, and B. Ginsburg, Correction of Automatic Speech Recognition with Transformer Sequence-To-Sequence Model, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020) 7074–7078. doi: 10.1109/icassp40776.2020.9053051. [24] S. Beliaev and B. Ginsburg, TalkNet: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis, Proc. Interspeech (2021) 3760-3764. doi:10.21437/Interspeech.2021-1770. [25] A. Georgescu, et al., Performance vs. Hardware Requirements in State-of-the-Art Automatic Speech Recognition. J. Audio Speech Music 28 (2021). doi:10.1186/s13636-021-00217-4. [26] A. Dutta, G. Ashishkumar, and Ch. V. R. Rao, Improving the Performance of ASR System by Building Acoustic Models using Spectro-Temporal and Phase-Based Features, Circuits, Systems, and Signal Processing 41(3) (2021) 1609–1632. doi:10.1007/s00034-021-01848-w. [27] S. Gondi and V. Pratap, Performance and Efficiency Evaluation of ASR Inference on the Edge, Sustainability 13(22) (2021) 12392. doi:10.3390/su132212392. [28] S. Li, et al., Improving Transformer-Based Speech Recognition Systems with Compressed Structure and Speech Attributes Augmentation, Interspeech (2019). doi:10.21437/ interspeech.2019-2112. [29] A. Rahmatulloh, I. Darmawan, and R. Gunawan, Performance Analysis of Data Transmission on Websocket for Real-Time Communication, in: 16th International Conference on Quality in Research (QIR) (2019) 1–5. doi:10.1109/QIR.2019.8898135. [30] P. Murley, et al., Websocket Adoption and the Landscape of the Real-Time Web, in: World Wide Web Conference (WWW) (2021) 1192–1203. doi:10.1145/3442381.3450063. [31] P. Bansal and A. Ouda, Study on Integration of FastAPI and Machine Learning for Continuous Authentication of Behavioral Biometrics, in: International Symposium on Networks, Computers and Communications (ISNCC) (2022) 1–6. doi:10.1109/ ISNCC55209.2022.9851790.