<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>iASSIST: Low-cost, portable and embedded assistants for on-premise automated transcription and translation services</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aitor Álvarez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Víctor Ruiz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iván G. Torre</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thierry Etchegoyhen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harritxu Gete</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joaquín Arellano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fundación Vicomtech, Basque Research and Technology Alliance (BRTA)</institution>
          ,
          <addr-line>Donostia-San Sebastián, 20009</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <fpage>75</fpage>
      <lpage>78</lpage>
      <abstract>
        <p>We present iASSIST, a low-cost, portable and embedded solution for on-premise automated neural transcription and translation services, currently for the English, Spanish and Basque languages. The system is fully operational, embedded in Jetson boards, and accessible via a user-friendly interface to perform real-time transcription and translation with high-quality neural models. ence. Acquiring this type of hardware resources for local computing, or renting appropriate infrastrucRecent advances in deep neural networks (DNNs) ture in the cloud, can represent a significant budget have led to significant improvements in both Auto- that many companies cannot cover. matic Speech Recognition (ASR) and Neural Ma- Thirdly, deep AI models are significantly impactchine Translation (NMT) [1, 2]. However, these ing energy consumption worldwide, with serious conadvances are mainly achieved with large neural ar- sequences on the increasing climate crisis. Reducing chitectures, trained on massive volumes of data and the ecological footprint of current AI technology is typically deployed on high-end expensive servers in a critical part of the current research agenda. the cloud to provide eficient services, which raises Finally, latency issues and information loss can a number of critical issues. impact cloud computing services, making it difiFirst, privacy is an important concern, since send- cult at times to deploy responsive and robust AI ing personal or confidential data over the Internet solutions. makes the information vulnerable to attacks and Edge computing aims to move computational breaches. The General Data Protection Regulation power and data processing closer the originat(GDPR) and similar policies set to protect sensitive ing data [3], with AI algorithms running on lodata also need to be taken into account. cal networks or embedded devices to guarantee Secondly, high-quality AI models typically require data privacy and reduce latency, energy consumpservers with significant computational capacity and tion and network load. However, integrating highGPU acceleration cards for both training and infer- performance AI models into embedded systems with low computational capabilities requires system and model optimization. Within this context, we present iASSIST, a lowcost, portable and embedded solution for on-premise automated neural transcription and translation services for the English, Spanish and Basque languages. This solution has been developed within the applied research project iASSIST, partially supported by the Department of Economic Development of the Basque Government. The project started in September 2019 and finalised in December 2021, and</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;edge computing</kwd>
        <kwd>embedded AI</kwd>
        <kwd>neural transcription</kwd>
        <kwd>neural translation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>was carried out by the following consortium: SPC1 designed from a usability and user experience
per(project coordinator), MondragonLingua2, Serikat3, spective, prioritizing simplicity. The GUI provides
Natural Vox4, Haresi5 and Vicomtech6. users with diferent input options, from text to
audio file (batch mode) and audio source (streaming
mode), and allows them to select diferent
tran2. iASSIST scription and translation models to perform the
corresponding tasks. Additionally, it integrates two
The core architecture of iASSIST is shown in Fig- main text-boxes to present the transcription and
ure 1. It consists of the following main components: translation results and a graphical interface to
man• A front-end, composed of a web-based graph- age model loading and unloading in memory. It is
ical user interface (GUI). worth noting that the transcription results can be
downloaded in diefrent formats (txt, rtf, xml, srt,
• A REST API, which exposes the functionali- vtt) that can be used for diferent applications such
ties of the back-end. as subtitling, keyword spotting and rich
transcription. The GUI was developed using the Angular
• A back-end, which orchestrates all the func- framework7 and deployed via a Nginx web server8.
tionalities of the solution, including
automatic transcription and translation, client
request management, model loading and un- 2.2. REST API
loading, and operational modes (batch and
streaming).</p>
      <sec id="sec-1-1">
        <title>The REST API serves as the main interface between</title>
        <p>the GUI and the back-end. In addition, it provides
an alternative way for the user to directly access</p>
        <p>Among the diefrent options for embedded sys- all the features of the solution via http requests,
tems ofered by the market (e.g. Raspberry Pi, allowing third party systems to be built on top of
NVIDIA Jetson, Google Coral or Intel Movidius, iASSIST and thus extend its functionality.
among others), we selected the NVIDIA Jetson
embedded computing boards for the project.
Specifically, we focused on two specific devices with dif- 2.3. Back-end
ferent capabilities: Jetson TX2 and Jetson AGX The iASSIST back-end is composed of several
modXavier. Although these two boards were relatively ules which encompass the features of the solution.
similar prices at the time, the AGX Xavier (32 The main modules are described in turn in the next
TOPS, 512-core GPU, 8-core CPU, 32 GB of shared subsections.
memory) ofered significantly more computational
power than the TX2 system (1.3 TOPS, 256-core 2.3.1. Orchestrator
GPU, dual-core CPU, 8 GB of shared memory),
while also being more energy eficient. During the
project, we explored the capacities of both boards
and evaluated the integration of diferent AI
models depending on their architecture, size, number
of parameters and performance in each embedded
system.</p>
        <p>In the following subsections, each of the main
components of the iASSIST solution is presented in
more detail.</p>
        <p>This module encompasses the automated
configuration, management, and coordination of the main
components and services of the back-end. At its
core, it manages user requests, communication
between modules and I/O interaction. The
module also implements the logic and interfaces for
the batch and streaming applications, manages
automatic language identification for translation
with bilingual ASR models, and controls the input
sources, devices and audio streams. The iASSIST
solution is able to process audio files, texts or
stream2.1. Front-end ing audio coming from any microphone connected
The iASSIST GUI aims to facilitate the communi- to the board or machine where the GUI is launched.
cation between the user and the back-end. It was
1https://www.spc.es/
2https://www.mondragonlingua.com/en
3https://www.serikat.es/
4https://www.naturalvox.eu/en/home/
5https://haresi.es/
6https://www.vicomtech.org/en
2.3.2. Model management</p>
      </sec>
      <sec id="sec-1-2">
        <title>Running applications composed of several AI mod</title>
        <p>els on embedded systems requires dynamically
controlling model activation and memory usage, given</p>
        <sec id="sec-1-2-1">
          <title>7https://angular.io/ 8https://www.nginx.com/</title>
          <p>common limitations of the supporting boards. The tiple blocks with residual connections in between.
model management module ensures proper model Each unique block consists of one or more modules
loading and unloading in memory, allowing users to with 1D time-channel separable convolutional layers,
enable or disable the relevant functionality depend- batch normalisation, and ReLU layers.
ing on the AI task at hand. For each of the selected Jetson embedded
systems, we experimented with diferent versions of
2.3.3. Automatic transcription the Quartznet architecture. After evaluating their
performance in terms of latency and quality, we
deThe Automatic Transcription module is managed by cided to deploy the Quartznet Q15×5 based model
the Triton Inference Server9, which is in charge of on the Jetson AGX Xavier, and the Q10×5 based
handling workloads and integrating the three main model on the Jetson TX2 board. The main
difmodules of the transcription pipeline. The first ference lies in the number of times the Quartznet
module processes the raw audio input by extracting models repeat the five unique blocks, which
modfeatures as spectrogram chunks, which are sent to ifies the total number of parameters from 18.9M
an acoustic model for probabilistic classification in (Q15×5) to 12.8M (Q10×5). To further optimize
a second stage. The final module, composed by the the performance of the Quartznet acoustic models,
decoder, determines the most likely transcription quantization and layer fusion techniques were also
for that audio using the likelihoods produced by the applied via the TensorRT library [5].
previous classification with the help of a language Finally, the raw transcriptions are enriched with
model. capitalisation and punctuation marks generated by</p>
          <p>For iASSIST, we developed acoustic models based the BERT-based AutoPunct engine [6]. In addition
on the NVIDIA’s Quartznet E2E architecture [4], to enhancing readability, splitting the raw text into
designed by the need to reduce the size and com- correctly punctuated sentences increases the quality
plexity of the recognition models, making them of machine translation results.
lighter, faster and more easily deployed on
embedded systems. This architecture is composed of
mul</p>
        </sec>
        <sec id="sec-1-2-2">
          <title>9https://developer.nvidia.com/nvidia-triton-inference</title>
          <p>server</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Acknowledgments</title>
      <p>iASSIST is partially funded by the Basque Business</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          2.4.
          <string-name>
            <given-names>Machine</given-names>
            <surname>Translation</surname>
          </string-name>
          <article-title>Development Agency, SPRI, under grant agreement ZL-</article-title>
          <year>2021</year>
          /00103.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>