A Novel User-Friendly Pipeline for Enhanced Natural Language Understanding in Human-Robot Interaction

A Novel User-Friendly Pipeline for Enhanced Natural Language Understanding in Human-Robot Interaction DorinClisu clisu@nttdata.com NTT DATA Romania IuliaFarcas iulia.farcas@nttdata.com NTT DATA Romania AndreiRusu rusu.andrei@nttdata.com NTT DATA Romania MihaiHulea mihai.hulea@aut.utcluj.ro Technical University of Cluj-Napoca

Romania

Bucharest Romania

A Novel User-Friendly Pipeline for Enhanced Natural Language Understanding in Human-Robot Interaction 1613-0073 3159036253895BE4C6BD4F0EB11A3CE8 GROBID - A machine learning software for extracting information from scholarly documents speech-to-text text-to-speech transformers human-machine collaboration1

This paper presents an innovative Natural Language Understanding (NLU) pipeline for humanrobot interactions (HRI), optimized for on-premises deployment in industrial settings. The proposed system integrates an end-to-end Automated Speech Recognition (ASR) system, a transformer-based model for intent and entity recognition, and a dynamic dialogue management system. These components operate on commodity hardware, ensuring real-time responsiveness without cloud dependency. The pipeline is uniquely extensible via an automated, offline training module that uses large language models like ChatGPT to generate datasets, reducing the need for specialized machine learning expertise.

Introduction

Human-robot interaction is becoming increasingly critical in industrial settings where efficiency, accuracy, and adaptability are paramount [1]. As industries shift towards greater automation, the demand for more intuitive and natural communication methods between humans and robots grows. NLU serves as a vital tool in bridging this gap, allowing robots to interpret and respond to human language [2]. However, many existing NLU systems depend on cloud-based services [3], which can introduce unwanted latency and security risksissues particularly problematic in industrial environments.

This paper introduces a specialized NLU pipeline designed for human-robot interaction within industrial settings and ASR system [4], a transformer-based model for intent and entity recognition, and a dynamic dialogue management system. These components are optimized to function on commodity hardware, forming a robust and scalable solution for real-time HRI.

Solution architecture

To develop a robust and efficient NLU pipeline for human-robot interaction, we have employed a state-of-the-art technology stack: Python with deep learning libraries PyTorch and Transformers for working with BERT [5], OpenAI Whisper [6] for ASR and Mozilla TTS [7] for Text-to-Speech, Pydantic for data validation, FastAPI and Streamlit for creating userfriendly interfaces. These components are used to create the training and inference pipelines and are detailed in Figure 1 below.

Inference pipeline

The Pipeline is designed for on-edge deployment, enabling real-time processing of vocal commands with a target maximum end-to-end delay of two seconds. Key components include: Audio I/O: Captures and outputs audio, serving as the interface for human-robot communication; Speech to Text (ASR): Converts spoken language into text, optimized for minimal latency; NLU: Extracts intents and entities from the transcribed text to understand and execute commands; Dialog Manager: Manages conversational context and directs the flow of interactions; Text to Speech: Converts text responses back into speech, allowing for seamless communication with operators; Robot Output: Executes the understood commands, affecting robot actions directly. This component is out-of-scope.

Training pipeline

The Training Pipeline effectively trains the ML models, enables new voice commands and is configurable to operate either via a cloud-based LLM API or via an open-source model like Llama-3-8b on a powerful standalone computing station. This flexibility ensures an optimum balance between performance, reliability and security. Main components are: Config Manager: Provides technical user interface for updating commands, creates data models with validation and LLM prompt according to the commands structure; LLM: Generates annotated data in the form of natural utterances expressing the commands, according to the prompt.; NLU Training: Trains the NLU model using the generated data. The hyperparameters are fixed during the research phase, so that users get a usable model without any ML engineer intervention.

Speech to Text

After studying the state-of-the art (as of 2024) in ASR models, we chose Whisper from OpenAI as it has very good accuracy with a real-time factor > 1 on commodity hardware. Additionally, it comes in multiple model sizes for an optimum compromise between accuracy and resource consumption depending on application.

The fundamental limitation of Whisper is that it can only transcribe a pre-recorded audio clip. Because a robot needs to continuously listen for commands, we tried some workarounds and found Silero VAD (voice activity detection) model to be satisfactory.

NLU

Transformer based language models have revolutionized NLP in the last few years [8], from language translation, sentiment analysis, knowledge extraction to complex text generation. Despite being relatively old (2018), the BERT language model remains a solid workhorse for many NLP tasks because it has a low inference cost. One uses BERT as a pre-trained text encoder which captures an abstract understanding of the language in its hidden state vectors. Then, according to the task, one trains a relatively small neural network on top of the BERT encodings, requiring a similarly small amount of task-specific data. In our case of intent and slot recognition, we can use one sentence classifier head for the intent, and another token-level classifier head for the slot identification. Then for each slot that must fit to a pre-defined list of values we can run the extracted tokens through a zero-shot classifier leveraging vector similarity.

Text to Speech

After studying the state-of-the art (as of 2024) in TTS models, it resulted that cloud API's offer more than enough speech quality, low latency and very low price for this application (OpenAI, Google, AWS, etc.). Since most utterances can be generated as part of the training flow, there is not much reliance on the internet for regular operation. However, in some cases such as mentioning the run-time slot options, we need on-the-fly generation therefore we investigated some open-source models that can run locally, of which Mozilla TTS seems promising. If the latency of generating locally is higher than the cloud latency, then the cloud API will be used primarily, with an automatic fallback (when offline) to the local model.

Dialog Manager

To have a coherent system, a core module is needed to stitch together the AI functionalities of the individual modules. A rule-based implementation is developed, covering the following fixed scenarios with appropriate templates applicable to any command that is later added: Too much audio noise: When the recognized speech is unintelligible, ask the operator to repeat or simply ignore and keep listening; Unclear intent: When the speech is recognized but cannot be classified as one of the existing intents, ask the operator to revise and repeat (optionally informing them about the list of intents); Missing slot: When the intent is clear but one of the mandatory slots for this intent is missing, ask the operator to say the respective slot, or repeat the entire utterance if there is more than one missing slot (to be able to separate which slot is which); Unclear slot option: When everything is clear except an invalid slot option (ex. A, B, C are the options, but they provide D), ask the operator to revise and say the slot (optionally informing them about the list of values).

The dialog manager uses a local database and memory variables to keep track of the conversation. It also logs all successes and failures (tied to the user input and context) so that the system can be analyzed, improved and updated.

Conclusions

In this paper, we have presented a novel Natural Language Understanding (NLU) pipeline designed for human-robot interaction in industrial settings.

Future work will focus on further enhancing the system's capabilities, including improving the accuracy and latency of the ASR and TTS components, expanding the range of supported commands, and integrating more advanced dialogue management features. Additionally, we aim to conduct extensive field testing in various industrial settings to refine the system and ensure its reliability and effectiveness in real-world applications.

Figure 1 -1Figure 1 -High level architecture of the proposed system.

Acknowledgements

This work was supported by the European Union's Horizon Europe research and innovation programme under grant agreement No 101058589 (AI-PRISM).

Human-Robot Interaction Review: Challenges and Solutions for Modern Industrial Environments IEEE Journals & Magazine | IEEE Xplore Jul. 15, 2024 Natural Human Robot Interaction Using Artificial Intelligence: A Survey RBamdale SSahay VKhandekar 10.1109/IEMECONX.2019.8877044 2019 9th Annual Information Technology, Electromechanical Engineering and Microelectronics Conference (IEMECON) Mar. 2019 NLP-based platform as a service: a brief review SPais JCordeiro MLJamil 10.1186/s40537-022-00603-5 J Big Data 9 1 Apr. 2022 Automatic Speech Recognition: A Deep Learning Approach DYu LDeng 10.1007/978-1-4471-5779-3 Signals and Communication Technology

London

Springer 2015 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding JDevlin M.-WChang KLee KToutanova 10.48550/arXiv.1810.04805 arXiv May 24, 2019 Introducing Whisper Jul. 15, 2024 mozilla/TTS Mozilla Jul. 15, 2024. Jul. 15, 2024 Recent Advances in Natural Language Processing via Large Pre-trained Language Models: A Survey BMin 10.1145/3605943 ACM Comput. Surv Sep. 2023