<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A Novel User-Friendly Pipeline for Enhanced Natural Language Understanding in Human-Robot Interaction</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Dorin</forename><surname>Clisu</surname></persName>
							<email>clisu@nttdata.com</email>
							<affiliation key="aff0">
								<orgName type="institution">NTT DATA Romania</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Iulia</forename><surname>Farcas</surname></persName>
							<email>iulia.farcas@nttdata.com</email>
							<affiliation key="aff0">
								<orgName type="institution">NTT DATA Romania</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Andrei</forename><surname>Rusu</surname></persName>
							<email>rusu.andrei@nttdata.com</email>
							<affiliation key="aff0">
								<orgName type="institution">NTT DATA Romania</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mihai</forename><surname>Hulea</surname></persName>
							<email>mihai.hulea@aut.utcluj.ro</email>
							<affiliation key="aff1">
								<orgName type="institution">Technical University of Cluj-Napoca</orgName>
								<address>
									<country key="RO">Romania</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<address>
									<settlement>Bucharest</settlement>
									<country key="RO">Romania</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">A Novel User-Friendly Pipeline for Enhanced Natural Language Understanding in Human-Robot Interaction</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">3159036253895BE4C6BD4F0EB11A3CE8</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:27+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>speech-to-text</term>
					<term>text-to-speech</term>
					<term>transformers</term>
					<term>human-machine collaboration1</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper presents an innovative Natural Language Understanding (NLU) pipeline for humanrobot interactions (HRI), optimized for on-premises deployment in industrial settings. The proposed system integrates an end-to-end Automated Speech Recognition (ASR) system, a transformer-based model for intent and entity recognition, and a dynamic dialogue management system. These components operate on commodity hardware, ensuring real-time responsiveness without cloud dependency. The pipeline is uniquely extensible via an automated, offline training module that uses large language models like ChatGPT to generate datasets, reducing the need for specialized machine learning expertise.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Human-robot interaction is becoming increasingly critical in industrial settings where efficiency, accuracy, and adaptability are paramount <ref type="bibr" target="#b0">[1]</ref>. As industries shift towards greater automation, the demand for more intuitive and natural communication methods between humans and robots grows. NLU serves as a vital tool in bridging this gap, allowing robots to interpret and respond to human language <ref type="bibr" target="#b1">[2]</ref>. However, many existing NLU systems depend on cloud-based services <ref type="bibr" target="#b2">[3]</ref>, which can introduce unwanted latency and security risksissues particularly problematic in industrial environments.</p><p>This paper introduces a specialized NLU pipeline designed for human-robot interaction within industrial settings and ASR system <ref type="bibr" target="#b3">[4]</ref>, a transformer-based model for intent and entity recognition, and a dynamic dialogue management system. These components are optimized to function on commodity hardware, forming a robust and scalable solution for real-time HRI.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Solution architecture</head><p>To develop a robust and efficient NLU pipeline for human-robot interaction, we have employed a state-of-the-art technology stack: Python with deep learning libraries PyTorch and Transformers for working with BERT <ref type="bibr" target="#b4">[5]</ref>, OpenAI Whisper <ref type="bibr" target="#b5">[6]</ref> for ASR and Mozilla TTS <ref type="bibr" target="#b6">[7]</ref> for Text-to-Speech, Pydantic for data validation, FastAPI and Streamlit for creating userfriendly interfaces. These components are used to create the training and inference pipelines and are detailed in Figure <ref type="figure" target="#fig_0">1 below</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Inference pipeline</head><p>The Pipeline is designed for on-edge deployment, enabling real-time processing of vocal commands with a target maximum end-to-end delay of two seconds. Key components include: Audio I/O: Captures and outputs audio, serving as the interface for human-robot communication; Speech to Text (ASR): Converts spoken language into text, optimized for minimal latency; NLU: Extracts intents and entities from the transcribed text to understand and execute commands; Dialog Manager: Manages conversational context and directs the flow of interactions; Text to Speech: Converts text responses back into speech, allowing for seamless communication with operators; Robot Output: Executes the understood commands, affecting robot actions directly. This component is out-of-scope.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Training pipeline</head><p>The Training Pipeline effectively trains the ML models, enables new voice commands and is configurable to operate either via a cloud-based LLM API or via an open-source model like Llama-3-8b on a powerful standalone computing station. This flexibility ensures an optimum balance between performance, reliability and security. Main components are: Config Manager: Provides technical user interface for updating commands, creates data models with validation and LLM prompt according to the commands structure; LLM: Generates annotated data in the form of natural utterances expressing the commands, according to the prompt.; NLU Training: Trains the NLU model using the generated data. The hyperparameters are fixed during the research phase, so that users get a usable model without any ML engineer intervention.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Speech to Text</head><p>After studying the state-of-the art (as of 2024) in ASR models, we chose Whisper from OpenAI as it has very good accuracy with a real-time factor &gt; 1 on commodity hardware. Additionally, it comes in multiple model sizes for an optimum compromise between accuracy and resource consumption depending on application.</p><p>The fundamental limitation of Whisper is that it can only transcribe a pre-recorded audio clip. Because a robot needs to continuously listen for commands, we tried some workarounds and found Silero VAD (voice activity detection) model to be satisfactory.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.">NLU</head><p>Transformer based language models have revolutionized NLP in the last few years <ref type="bibr" target="#b7">[8]</ref>, from language translation, sentiment analysis, knowledge extraction to complex text generation. Despite being relatively old (2018), the BERT language model remains a solid workhorse for many NLP tasks because it has a low inference cost. One uses BERT as a pre-trained text encoder which captures an abstract understanding of the language in its hidden state vectors. Then, according to the task, one trains a relatively small neural network on top of the BERT encodings, requiring a similarly small amount of task-specific data. In our case of intent and slot recognition, we can use one sentence classifier head for the intent, and another token-level classifier head for the slot identification. Then for each slot that must fit to a pre-defined list of values we can run the extracted tokens through a zero-shot classifier leveraging vector similarity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.5.">Text to Speech</head><p>After studying the state-of-the art (as of 2024) in TTS models, it resulted that cloud API's offer more than enough speech quality, low latency and very low price for this application (OpenAI, Google, AWS, etc.). Since most utterances can be generated as part of the training flow, there is not much reliance on the internet for regular operation. However, in some cases such as mentioning the run-time slot options, we need on-the-fly generation therefore we investigated some open-source models that can run locally, of which Mozilla TTS seems promising. If the latency of generating locally is higher than the cloud latency, then the cloud API will be used primarily, with an automatic fallback (when offline) to the local model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.6.">Dialog Manager</head><p>To have a coherent system, a core module is needed to stitch together the AI functionalities of the individual modules. A rule-based implementation is developed, covering the following fixed scenarios with appropriate templates applicable to any command that is later added: Too much audio noise: When the recognized speech is unintelligible, ask the operator to repeat or simply ignore and keep listening; Unclear intent: When the speech is recognized but cannot be classified as one of the existing intents, ask the operator to revise and repeat (optionally informing them about the list of intents); Missing slot: When the intent is clear but one of the mandatory slots for this intent is missing, ask the operator to say the respective slot, or repeat the entire utterance if there is more than one missing slot (to be able to separate which slot is which); Unclear slot option: When everything is clear except an invalid slot option (ex. A, B, C are the options, but they provide D), ask the operator to revise and say the slot (optionally informing them about the list of values).</p><p>The dialog manager uses a local database and memory variables to keep track of the conversation. It also logs all successes and failures (tied to the user input and context) so that the system can be analyzed, improved and updated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Conclusions</head><p>In this paper, we have presented a novel Natural Language Understanding (NLU) pipeline designed for human-robot interaction in industrial settings.</p><p>Future work will focus on further enhancing the system's capabilities, including improving the accuracy and latency of the ASR and TTS components, expanding the range of supported commands, and integrating more advanced dialogue management features. Additionally, we aim to conduct extensive field testing in various industrial settings to refine the system and ensure its reliability and effectiveness in real-world applications.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 -</head><label>1</label><figDesc>Figure 1 -High level architecture of the proposed system.</figDesc><graphic coords="2,156.50,212.15,281.97,178.30" type="bitmap" /></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Acknowledgements</head><p>This work was supported by the European Union's Horizon Europe research and innovation programme under grant agreement No 101058589 (AI-PRISM).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Human-Robot Interaction Review: Challenges and Solutions for Modern Industrial Environments</title>
		<ptr target="https://ieeexplore.ieee.org/abstract/document/9493209" />
	</analytic>
	<monogr>
		<title level="j">IEEE Journals &amp; Magazine | IEEE Xplore</title>
		<imprint>
			<date type="published" when="2024-07-15">Jul. 15, 2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Natural Human Robot Interaction Using Artificial Intelligence: A Survey</title>
		<author>
			<persName><forename type="first">R</forename><surname>Bamdale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sahay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Khandekar</surname></persName>
		</author>
		<idno type="DOI">10.1109/IEMECONX.2019.8877044</idno>
	</analytic>
	<monogr>
		<title level="m">2019 9th Annual Information Technology, Electromechanical Engineering and Microelectronics Conference (IEMECON)</title>
				<imprint>
			<date type="published" when="2019-03">Mar. 2019</date>
			<biblScope unit="page" from="297" to="302" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">NLP-based platform as a service: a brief review</title>
		<author>
			<persName><forename type="first">S</forename><surname>Pais</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cordeiro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">L</forename><surname>Jamil</surname></persName>
		</author>
		<idno type="DOI">10.1186/s40537-022-00603-5</idno>
	</analytic>
	<monogr>
		<title level="j">J Big Data</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="issue">1</biblScope>
			<date type="published" when="2022-04">Apr. 2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Automatic Speech Recognition: A Deep Learning Approach</title>
		<author>
			<persName><forename type="first">D</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Deng</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-1-4471-5779-3</idno>
	</analytic>
	<monogr>
		<title level="m">Signals and Communication Technology</title>
				<meeting><address><addrLine>London</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</title>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.1810.04805</idno>
	</analytic>
	<monogr>
		<title level="j">arXiv</title>
		<imprint>
			<date type="published" when="2019-05-24">May 24, 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Introducing Whisper</title>
		<ptr target="https://openai.com/index/whisper/" />
		<imprint>
			<date type="published" when="2024-07-15">Jul. 15, 2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">mozilla/TTS</title>
		<ptr target="https://github.com/mozilla/TTS" />
	</analytic>
	<monogr>
		<title level="j">Mozilla</title>
		<imprint>
			<date type="published" when="2024-07-15">Jul. 15, 2024. Jul. 15, 2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Recent Advances in Natural Language Processing via Large Pre-trained Language Models: A Survey</title>
		<author>
			<persName><forename type="first">B</forename><surname>Min</surname></persName>
		</author>
		<idno type="DOI">10.1145/3605943</idno>
	</analytic>
	<monogr>
		<title level="j">ACM Comput. Surv</title>
		<imprint>
			<date type="published" when="2023-09">Sep. 2023</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
