1. Introduction

iASSIST: Low-cost, portable and embedded assistants for on-premise automated transcription and translation services

Aitor Álvarez

Víctor Ruiz

Iván G. Torre

Thierry Etchegoyhen

Harritxu Gete

Joaquín Arellano

0 0 Fundación Vicomtech, Basque Research and Technology Alliance (BRTA) , Donostia-San Sebastián, 20009 , Spain

75 78

We present iASSIST, a low-cost, portable and embedded solution for on-premise automated neural transcription and translation services, currently for the English, Spanish and Basque languages. The system is fully operational, embedded in Jetson boards, and accessible via a user-friendly interface to perform real-time transcription and translation with high-quality neural models. ence. Acquiring this type of hardware resources for local computing, or renting appropriate infrastrucRecent advances in deep neural networks (DNNs) ture in the cloud, can represent a significant budget have led to significant improvements in both Auto- that many companies cannot cover. matic Speech Recognition (ASR) and Neural Ma- Thirdly, deep AI models are significantly impactchine Translation (NMT) [1, 2]. However, these ing energy consumption worldwide, with serious conadvances are mainly achieved with large neural ar- sequences on the increasing climate crisis. Reducing chitectures, trained on massive volumes of data and the ecological footprint of current AI technology is typically deployed on high-end expensive servers in a critical part of the current research agenda. the cloud to provide eficient services, which raises Finally, latency issues and information loss can a number of critical issues. impact cloud computing services, making it difiFirst, privacy is an important concern, since send- cult at times to deploy responsive and robust AI ing personal or confidential data over the Internet solutions. makes the information vulnerable to attacks and Edge computing aims to move computational breaches. The General Data Protection Regulation power and data processing closer the originat(GDPR) and similar policies set to protect sensitive ing data [3], with AI algorithms running on lodata also need to be taken into account. cal networks or embedded devices to guarantee Secondly, high-quality AI models typically require data privacy and reduce latency, energy consumpservers with significant computational capacity and tion and network load. However, integrating highGPU acceleration cards for both training and infer- performance AI models into embedded systems with low computational capabilities requires system and model optimization. Within this context, we present iASSIST, a lowcost, portable and embedded solution for on-premise automated neural transcription and translation services for the English, Spanish and Basque languages. This solution has been developed within the applied research project iASSIST, partially supported by the Department of Economic Development of the Basque Government. The project started in September 2019 and finalised in December 2021, and

eol>edge computing embedded AI neural transcription neural translation

1. Introduction

was carried out by the following consortium: SPC1 designed from a usability and user experience per(project coordinator), MondragonLingua2, Serikat3, spective, prioritizing simplicity. The GUI provides Natural Vox4, Haresi5 and Vicomtech6. users with diferent input options, from text to audio file (batch mode) and audio source (streaming mode), and allows them to select diferent tran2. iASSIST scription and translation models to perform the corresponding tasks. Additionally, it integrates two The core architecture of iASSIST is shown in Fig- main text-boxes to present the transcription and ure 1. It consists of the following main components: translation results and a graphical interface to man• A front-end, composed of a web-based graph- age model loading and unloading in memory. It is ical user interface (GUI). worth noting that the transcription results can be downloaded in diefrent formats (txt, rtf, xml, srt, • A REST API, which exposes the functionali- vtt) that can be used for diferent applications such ties of the back-end. as subtitling, keyword spotting and rich transcription. The GUI was developed using the Angular • A back-end, which orchestrates all the func- framework7 and deployed via a Nginx web server8. tionalities of the solution, including automatic transcription and translation, client request management, model loading and un- 2.2. REST API loading, and operational modes (batch and streaming).

The REST API serves as the main interface between

the GUI and the back-end. In addition, it provides an alternative way for the user to directly access

Among the diefrent options for embedded sys- all the features of the solution via http requests, tems ofered by the market (e.g. Raspberry Pi, allowing third party systems to be built on top of NVIDIA Jetson, Google Coral or Intel Movidius, iASSIST and thus extend its functionality. among others), we selected the NVIDIA Jetson embedded computing boards for the project. Specifically, we focused on two specific devices with dif- 2.3. Back-end ferent capabilities: Jetson TX2 and Jetson AGX The iASSIST back-end is composed of several modXavier. Although these two boards were relatively ules which encompass the features of the solution. similar prices at the time, the AGX Xavier (32 The main modules are described in turn in the next TOPS, 512-core GPU, 8-core CPU, 32 GB of shared subsections. memory) ofered significantly more computational power than the TX2 system (1.3 TOPS, 256-core 2.3.1. Orchestrator GPU, dual-core CPU, 8 GB of shared memory), while also being more energy eficient. During the project, we explored the capacities of both boards and evaluated the integration of diferent AI models depending on their architecture, size, number of parameters and performance in each embedded system.

In the following subsections, each of the main components of the iASSIST solution is presented in more detail.

This module encompasses the automated configuration, management, and coordination of the main components and services of the back-end. At its core, it manages user requests, communication between modules and I/O interaction. The module also implements the logic and interfaces for the batch and streaming applications, manages automatic language identification for translation with bilingual ASR models, and controls the input sources, devices and audio streams. The iASSIST solution is able to process audio files, texts or stream2.1. Front-end ing audio coming from any microphone connected The iASSIST GUI aims to facilitate the communi- to the board or machine where the GUI is launched. cation between the user and the back-end. It was 1https://www.spc.es/ 2https://www.mondragonlingua.com/en 3https://www.serikat.es/ 4https://www.naturalvox.eu/en/home/ 5https://haresi.es/ 6https://www.vicomtech.org/en 2.3.2. Model management

Running applications composed of several AI mod

els on embedded systems requires dynamically controlling model activation and memory usage, given

7https://angular.io/ 8https://www.nginx.com/

common limitations of the supporting boards. The tiple blocks with residual connections in between. model management module ensures proper model Each unique block consists of one or more modules loading and unloading in memory, allowing users to with 1D time-channel separable convolutional layers, enable or disable the relevant functionality depend- batch normalisation, and ReLU layers. ing on the AI task at hand. For each of the selected Jetson embedded systems, we experimented with diferent versions of 2.3.3. Automatic transcription the Quartznet architecture. After evaluating their performance in terms of latency and quality, we deThe Automatic Transcription module is managed by cided to deploy the Quartznet Q15×5 based model the Triton Inference Server9, which is in charge of on the Jetson AGX Xavier, and the Q10×5 based handling workloads and integrating the three main model on the Jetson TX2 board. The main difmodules of the transcription pipeline. The first ference lies in the number of times the Quartznet module processes the raw audio input by extracting models repeat the five unique blocks, which modfeatures as spectrogram chunks, which are sent to ifies the total number of parameters from 18.9M an acoustic model for probabilistic classification in (Q15×5) to 12.8M (Q10×5). To further optimize a second stage. The final module, composed by the the performance of the Quartznet acoustic models, decoder, determines the most likely transcription quantization and layer fusion techniques were also for that audio using the likelihoods produced by the applied via the TensorRT library [5]. previous classification with the help of a language Finally, the raw transcriptions are enriched with model. capitalisation and punctuation marks generated by

For iASSIST, we developed acoustic models based the BERT-based AutoPunct engine [6]. In addition on the NVIDIA’s Quartznet E2E architecture [4], to enhancing readability, splitting the raw text into designed by the need to reduce the size and com- correctly punctuated sentences increases the quality plexity of the recognition models, making them of machine translation results. lighter, faster and more easily deployed on embedded systems. This architecture is composed of mul

9https://developer.nvidia.com/nvidia-triton-inference

server

Acknowledgments

iASSIST is partially funded by the Basque Business

2.4.

Machine

Translation

Development Agency, SPRI, under grant agreement ZL-

2021 /00103.