A Novel User-Friendly Pipeline for Enhanced Natural
                                Language Understanding in Human-Robot Interaction
                                Dorin Clisu1,†, Iulia Farcas1,∗,†, Andrei Rusu1,∗,† and Mihai Hulea2,∗,†

                                1 NTT DATA Romania

                                2 Technical University of Cluj-Napoca, Romania


                                                     Abstract
                                                     This paper presents an innovative Natural Language Understanding (NLU) pipeline for human-
                                                     robot interactions (HRI), optimized for on-premises deployment in industrial settings. The
                                                     proposed system integrates an end-to-end Automated Speech Recognition (ASR) system, a
                                                     transformer-based model for intent and entity recognition, and a dynamic dialogue management
                                                     system. These components operate on commodity hardware, ensuring real-time responsiveness
                                                     without cloud dependency. The pipeline is uniquely extensible via an automated, offline training
                                                     module that uses large language models like ChatGPT to generate datasets, reducing the need for
                                                     specialized machine learning expertise.

                                                     Keywords
                                                     speech-to-text, text-to-speech, transformers, human-machine collaboration1


                                1. Introduction
                                    Human-robot interaction is becoming increasingly critical in industrial settings where
                                efficiency, accuracy, and adaptability are paramount [1]. As industries shift towards greater
                                automation, the demand for more intuitive and natural communication methods between
                                humans and robots grows. NLU serves as a vital tool in bridging this gap, allowing robots to
                                interpret and respond to human language [2]. However, many existing NLU systems depend
                                on cloud-based services [3], which can introduce unwanted latency and security risks—
                                issues particularly problematic in industrial environments.
                                    This paper introduces a specialized NLU pipeline designed for human-robot interaction
                                within industrial settings and ASR system [4], a transformer-based model for intent and
                                entity recognition, and a dynamic dialogue management system. These components are
                                optimized to function on commodity hardware, forming a robust and scalable solution for
                                real-time HRI.


                                RuleML+RR’24: Companion Proceedings of the 8th International Joint Conference on Rules and Reasoning, September 16–22, 2024,
                                Bucharest, Romania
                                ∗ Corresponding author.
                                † These authors contributed equally.

                                   dorin.clisu@nttdata.com (D. Clisu); iulia.farcas@nttdata.com (I. Farcas); rusu.andrei@nttdata.com (A. Rusu)
                                ; mihai.hulea@aut.utcluj.ro (M. Hulea)
                                                © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Solution architecture
To develop a robust and efficient NLU pipeline for human-robot interaction, we have
employed a state-of-the-art technology stack: Python with deep learning libraries PyTorch
and Transformers for working with BERT [5], OpenAI Whisper [6] for ASR and Mozilla TTS
[7] for Text-to-Speech, Pydantic for data validation, FastAPI and Streamlit for creating user-
friendly interfaces. These components are used to create the training and inference
pipelines and are detailed in Figure 1 below.


                Figure 1 – High level architecture of the proposed system.

2.1. Inference pipeline
The Inference Pipeline is designed for on-edge deployment, enabling real-time processing
of vocal commands with a target maximum end-to-end delay of two seconds. Key
components include: Audio I/O: Captures and outputs audio, serving as the interface for
human-robot communication; Speech to Text (ASR): Converts spoken language into text,
optimized for minimal latency; NLU: Extracts intents and entities from the transcribed text
to understand and execute commands; Dialog Manager: Manages conversational context
and directs the flow of interactions; Text to Speech: Converts text responses back into
speech, allowing for seamless communication with operators; Robot Output: Executes the
understood commands, affecting robot actions directly. This component is out-of-scope.

2.2. Training pipeline
The Training Pipeline effectively trains the ML models, enables new voice commands and is
configurable to operate either via a cloud-based LLM API or via an open-source model like
Llama-3-8b on a powerful standalone computing station. This flexibility ensures an
optimum balance between performance, reliability and security. Main components are:
Config Manager: Provides technical user interface for updating commands, creates data
models with validation and LLM prompt according to the commands structure; LLM:
Generates annotated data in the form of natural utterances expressing the commands,
according to the prompt.; NLU Training: Trains the NLU model using the generated data.
The hyperparameters are fixed during the research phase, so that users get a usable model
without any ML engineer intervention.

2.3. Speech to Text
After studying the state-of-the art (as of 2024) in ASR models, we chose Whisper from
OpenAI as it has very good accuracy with a real-time factor > 1 on commodity hardware.
Additionally, it comes in multiple model sizes for an optimum compromise between
accuracy and resource consumption depending on application.
    The fundamental limitation of Whisper is that it can only transcribe a pre-recorded audio
clip. Because a robot needs to continuously listen for commands, we tried some
workarounds and found Silero VAD (voice activity detection) model to be satisfactory.

2.4. NLU
Transformer based language models have revolutionized NLP in the last few years [8], from
language translation, sentiment analysis, knowledge extraction to complex text generation.
Despite being relatively old (2018), the BERT language model remains a solid workhorse
for many NLP tasks because it has a low inference cost. One uses BERT as a pre-trained text
encoder which captures an abstract understanding of the language in its hidden state
vectors. Then, according to the task, one trains a relatively small neural network on top of
the BERT encodings, requiring a similarly small amount of task-specific data. In our case of
intent and slot recognition, we can use one sentence classifier head for the intent, and
another token-level classifier head for the slot identification. Then for each slot that must fit
to a pre-defined list of values we can run the extracted tokens through a zero-shot classifier
leveraging vector similarity.

2.5. Text to Speech
After studying the state-of-the art (as of 2024) in TTS models, it resulted that cloud API’s
offer more than enough speech quality, low latency and very low price for this application
(OpenAI, Google, AWS, etc.). Since most utterances can be generated as part of the training
flow, there is not much reliance on the internet for regular operation. However, in some
cases such as mentioning the run-time slot options, we need on-the-fly generation therefore
we investigated some open-source models that can run locally, of which Mozilla TTS seems
promising. If the latency of generating locally is higher than the cloud latency, then the cloud
API will be used primarily, with an automatic fallback (when offline) to the local model.

2.6. Dialog Manager
To have a coherent system, a core module is needed to stitch together the AI functionalities
of the individual modules. A rule-based implementation is developed, covering the
following fixed scenarios with appropriate templates applicable to any command that is
later added: Too much audio noise: When the recognized speech is unintelligible, ask the
operator to repeat or simply ignore and keep listening; Unclear intent: When the speech is
recognized but cannot be classified as one of the existing intents, ask the operator to revise
and repeat (optionally informing them about the list of intents); Missing slot: When the
intent is clear but one of the mandatory slots for this intent is missing, ask the operator to
say the respective slot, or repeat the entire utterance if there is more than one missing slot
(to be able to separate which slot is which); Unclear slot option: When everything is clear
except an invalid slot option (ex. A, B, C are the options, but they provide D), ask the operator
to revise and say the slot (optionally informing them about the list of values).
    The dialog manager uses a local database and memory variables to keep track of the
conversation. It also logs all successes and failures (tied to the user input and context) so
that the system can be analyzed, improved and updated.

3. Conclusions
In this paper, we have presented a novel Natural Language Understanding (NLU) pipeline
designed for human-robot interaction in industrial settings.
   Future work will focus on further enhancing the system's capabilities, including
improving the accuracy and latency of the ASR and TTS components, expanding the range
of supported commands, and integrating more advanced dialogue management features.
Additionally, we aim to conduct extensive field testing in various industrial settings to refine
the system and ensure its reliability and effectiveness in real-world applications.

4. Acknowledgements
This work was supported by the European Union’s Horizon Europe research and innovation
programme under grant agreement No 101058589 (AI-PRISM).

References
[1] “Human-Robot Interaction Review: Challenges and Solutions for Modern Industrial
    Environments | IEEE Journals & Magazine | IEEE Xplore.” Accessed: Jul. 15, 2024.
    [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9493209
[2] R. Bamdale, S. Sahay, and V. Khandekar, “Natural Human Robot Interaction Using
    Artificial Intelligence: A Survey,” in 2019 9th Annual Information Technology,
    Electromechanical Engineering and Microelectronics Conference (IEMECON), Mar.
    2019, pp. 297–302. doi: 10.1109/IEMECONX.2019.8877044.
[3] S. Pais, J. Cordeiro, and M. L. Jamil, “NLP-based platform as a service: a brief review,” J
    Big Data, vol. 9, no. 1, p. 54, Apr. 2022, doi: 10.1186/s40537-022-00603-5.
[4] D. Yu and L. Deng, Automatic Speech Recognition: A Deep Learning Approach. in Signals
    and Communication Technology. London: Springer, 2015. doi: 10.1007/978-1-4471-
    5779-3.
[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep
    Bidirectional Transformers for Language Understanding.” arXiv, May 24, 2019. doi:
    10.48550/arXiv.1810.04805.
[6] “Introducing Whisper.” Accessed: Jul. 15, 2024. [Online]. Available:
    https://openai.com/index/whisper/
[7] “mozilla/TTS.” Mozilla, Jul. 15, 2024. Accessed: Jul. 15, 2024. [Online]. Available:
    https://github.com/mozilla/TTS
[8] B. Min et al., “Recent Advances in Natural Language Processing via Large Pre-trained
    Language Models: A Survey,” ACM Comput. Surv., Sep. 2023, doi: 10.1145/3605943.