<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>AURALYS: Smart Glasses to Improve Audio Selection and Perception in Educational and Working Contexts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gianluca Filippini</string-name>
          <email>gianluca.filippini@unimore.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guido Borghi</string-name>
          <email>guido.borghi@unimore.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrico Giliberti</string-name>
          <email>enrico.giliberti@unimore.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paola Damiani</string-name>
          <email>paola.damiani@unimore.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Vezzani</string-name>
          <email>roberto.vezzani@unimore.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Educazione e Scienze Umane, University of Modena and Reggio Emilia</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dipartimento di Ingegneria "Enzo Ferrari", University of Modena and Reggio Emilia</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The ability to discern multiple sound sources in complex environments is an innate auditory skill that varies across individuals due to diverse personal and contextual factors. Conditions such as aging, disabilities, or neurodevelopmental disorders - now more widely recognized - highlight the need for inclusive approaches. Hearing impairment is commonly understood as deafness or hearing loss, but numerous conditions afect not the quantity (how much one hears) but the quality of auditory perception (how one hears). This calls for interdisciplinary research on how technological and AI tools can support diverse users, promoting inclusion and improving quality of life, particularly for those with vulnerabilities. Therefore, in this paper, we introduce and discuss the adoption of AURALYS, smart glasses expressively designed to improve audio capabilities in educational and working scenarios. In particular, this device is intended to enhance audio selection and perception in dynamic contexts, in which multiple competing voices and background noises are present. We also introduce the VERSE framework to create and collect synthetic audio data to train machine learning systems for audio selection and perception implemented on the smart glasses.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Audio Capability</kwd>
        <kwd>Selective Hearing</kwd>
        <kwd>Smart Glasses</kwd>
        <kwd>Artificial Intelligence</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>The ability to perceive and distinguish multiple sound</title>
        <p>sources in complex acoustic environments – recognized
as an innate auditory skill, akin to other cognitive
processes – manifests diferently among individuals
depending on various endogenous and exogenous factors,
leading to diverse profiles of competence and functioning.</p>
        <p>Some of these conditions are universal, such as aging,
while others stem from specific individual circumstances,
such as sensory disabilities or neurodevelopmental
disorders. These conditions are increasingly present in
educational and work settings, due both to greater recognition
and the spread of an inclusive culture.</p>
        <p>It is therefore essential to foster interdisciplinary reflec- Figure 1: The prototype of AURALYS
tion closely tied to the purposes of using technological glasses placed on a 3D printed
devices and to the characteristics of individuals and con- head.
texts. This should be done through a multidisciplinary
approach, starting from a pedagogical perspective that values the emancipatory potential of technology
and AI in enhancing quality of life for all, with those who experience vulnerability.</p>
        <p>
          For example, in schools, students spend many hours in noisy environments that can create
challenging acoustic conditions, with background noise and overlapping voices interfering with lesson
comprehension and the ability to distinguish individual speakers [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] (e.g., crowded classrooms, gyms,
hallways, cafeterias). Similar conditions can be found in workplace settings, such as ofices. Industrial
environments – such as factories and production plants – pose additional challenges, where the use of
loud machinery requires appropriate hearing protection. However, such protective equipment may also
reduce auditory sensitivity or impair the ability to discriminate sounds and identify their sources.
        </p>
        <p>Considering these elements, we introduce and discuss a technological solution called AURALYS, i.e.,
smart glasses provided with embedded audio capturing and analysis capabilities (see Fig. 1). From
a hardware point of view, the proposed system is mainly made up of six microphones, placed on a
frame that, in future developments, could also integrate cameras and other sensors. The number and
positioning of microphones is itself an object of study and research to maximize the ability to localize
audio sources and, at the same time, limit the computational capabilities required for the corresponding
processing. AURALYS also integrates several software components capable of processing the signals
coming from the microphones in real time. Most of them are realized through innovative machine
learning techniques, which exhibit excellent performance but, at the same time, require large amounts
of data to obtain valid training. With current technologies, the ability of the system to generalize to any
condition and situation is incompatible with a low-latency embedded system suitable for integration
in AURALYS. Therefore, it is necessary to create specific datasets that contain only the degrees of
freedom strictly necessary for the application in question. For this purpose, the VERSE framework was
introduced, a complete framework able to generate suitable datasets of synthetic but realistic recordings
of human voices together with all the annotations required to train machine learning algorithms. The
specific dataset can be configured to include, for example, specific languages, types of voice, mutual
positioning, and motion between audio sources and the listener.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Inclusive Perspectives of Audio Capability</title>
        <p>
          Recent studies have investigated the impact of hearing loss on other functions essential for living a quality
life, confirming that it can significantly impair learning processes, afecting language development [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
and broader language skills [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. These impacts potentially afect a wide range of individuals across the
lifespan and in various settings, from schools to workplaces and elderly care environments.
        </p>
        <p>
          Hearing impairment is commonly understood as deafness or hearing loss, but numerous conditions
afect not the quantity ( how much one hears) but the quality of auditory perception (how one hears).
As noted by Bérard [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], these include auditory slowness, painful hearing, lack of auditory selectivity,
auditory dis-laterality, auditory distortions, residual hearing efects, and tinnitus. All of these anomalies
can significantly impact attention and learning. During developmental age, beyond more evident
conditions such as deafness or hearing loss, the ability to focus selectively can also be compromised in
cases of neurodevelopmental disorders, such as Attention Deficit Hyperactivity Disorder (ADHD) and
Specific Learning Disabilities (SLDs) – including dyslexia, dysgraphia, and dyscalculia – and Autism
Spectrum Disorder. Research has highlighted the key role of Executive Functions [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and the dificulties
associated with deficits in attentional and perceptual abilities [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>
          Although auditory processes have been studied less frequently than visual ones, impairments in
auditory attention have been shown to be significant in students with ADHD, SLDs, and Disruptive
Behavior Disorders. Low scores in auditory attention are associated with reduced selective and sustained
attention [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], leading to notable consequences for both learning quality and active participation.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Selective Hearing</title>
        <p>
          Selective Hearing [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] becomes less efective with increasing complexity of the environment, which
means a larger number of competing sound sources and a higher background noise level. A key aspect
of this perceptual process involves localizing where sounds are coming from, which the human brain
achieves by using a combination of spatial, spectral, and temporal cues. Among the most critical spatial
cues are Interaural Time Diferences (ITD) and Interaural Level Diferences (ILD). ITD refers to the tiny
diferences in the time it takes for a sound to reach each ear; for example, if a sound source is located
to the left of a listener, it will reach the left ear slightly earlier than the right. This time diference,
often in the range of microseconds, is processed by the auditory system to infer the direction of the
sound in the horizontal plane. ILD, on the other hand, refers to diferences in sound pressure level
(or loudness) between the ears, which occur because the head acts as a physical barrier, casting an
acoustic shadow that attenuates sounds arriving at the far ear. These interaural cues are most efective
for high-frequency sounds (ILD) and low-frequency sounds (ITD), respectively, and are combined by
the brain to localize sound sources with remarkable precision.
        </p>
        <p>In near-field scenarios, where sound sources are located close to the listener, the auditory system
can also exploit additional spatial cues, such as variations in the shape and timing of reverberations
and subtle changes in binaural cues due to head movement. Moreover, proximity of the source often
increases the signal-to-noise ratio and preserves finer acoustic details, making it easier to distinguish
between individual voices. In contrast, far-field conditions introduce challenges such as increased
reverberation, reduced spatial separation, and signal degradation due to distance, all of which blur
spatial and spectral distinctions between sources. Auditory spectral diferences become predominant
for subjects with hearing aid devices. Diferences in spectrum perception between the two ears will
afect the acoustic signal arriving at the two eardrums.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Technical Background for Selective Hearing</title>
        <p>
          Humans are able to localize audio sources as a combination of multiple senses. Audio cues are the
fundamental part of this process, even if it is proven that the interaction with visual information
enhance the capability to distinguish sounds sources [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. For this reason, the usage of multi-microphone
techniques have raised the interest of researchers, exploring techniques like beamforming to improve
source localization in combination with head orientation and gaze [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Arrays of microphones have
been used to collect data for hearing aid applications, opening new scenarios [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>
          In recent years, deep learning-based models have significantly advanced the state-of-the-art in both
sound source separation and localization. This is also the case for more complex scenarios involving
hearing aid implants [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. However, the fields face persistent challenges that hinder systematic progress
and fair benchmarking, primarily related to open-source dataset availability and reproducibility of
results. Despite the availability of reference benchmarks like CHiME [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and DCASE [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], it is
still dificult to obtain the same results starting from a single product scenario, facing diferences on
microphones, geometry, and calibration. Reference dataset do not fully capture the complexities of
realworld acoustic environments, such as custom reverberation, dynamic source movement, background
noise, and overlapping speech from multiple direction; all applied to a real, specific receiver (human,
binaural or multi-microphone array) that is diferent from the one used the reference dataset.
        </p>
        <p>To mitigate the efort required for acquiring huge set of real recording, despite the complexity
of measurements and data processing, the usage of synthetic data in combination with direct audio
recordings presents advantages in scalability and reproducibility, allowing for solving some challenges
presented by "fixed recording audio datasets". Datasets with accurate spatial annotations of all the
components of the audio chain (e.g., microphone array geometries, source coordinates in space, sound
levels with calibration) became available only in recent times, limiting the capability to train or evaluate
models that rely on spatial cues for localization.</p>
        <p>
          However, even when datasets are available, there is a lack of consistency in evaluation protocols,
metrics, and data splits. This leads to a reproducibility gap in which results reported in the studies
cannot be directly compared. Furthermore, some datasets used in high-profile publications are not
made publicly available due to licensing restrictions or privacy concerns, making it challenging for
researchers to validate or extend previous work. Finally, when considering the usage of synthetic data,
it is important to address the reverberation of the environment to properly simulate the audio signal as
close as possible to the real-life scenario. The shape and size of the environment surrounding sound
sources influence early reverberations, which are predominant in the source localization process [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Synthetic Audio Data Generation</title>
        <p>The AURALYS glasses and the developed software rely on the assumption that synthetic datasets can
be created and correctly used to train deep learning algorithms. In particular, one of the most important
is the generation of realistic asset items to combine in the rendering framework, and the study of the
intrinsic characteristics of the listener is one of them.</p>
        <p>
          The scientific study of binaural hearing has its roots in psychoacoustics and auditory physiology,
dating back to the early 20th century [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. By the mid-20th century, advances in signal processing and
acoustics enabled more rigorous measurement of Head-Related Transfer Functions (HRTFs), a critical
component in binaural modeling. Indeed, HRTFs characterize how the listener’s head, torso, and pinnae
iflter incoming sound before reaching the eardrums, and provide direction-dependent spectral and
temporal cues essential for precise localization [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. With improvements in audio recording equipment
and microphones, and thanks to digital signal processing applied to audio signals, it has been possible to
define multiple techniques to measure the HRTF function of a given subject [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. Diferent techniques
have diferent performance regarding signal-to-noise ratio (SNR) and spatial accuracy [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], but the
Exponential Sine Sweep stands out as one of the most used techniques up to modern times [20].
        </p>
        <p>Physical dependencies of the HRTF from human body characteristics have been studied with
measurements for ear pinna, torso, and head, forming databases of morphological measurements. Yet
the collection of these measurements requires careful setup and calibration to properly retrieve the
transfer function. Measuring HRTFs is still a time consuming and complex procedure, often requiring
anechoic chambers. New techniques have been proposed to simplify constraints on the recording
environment [21].</p>
        <p>HRTF measurements are even more important when we focus on human voice and speech
intelligibility. Using non individualized HRTFs will introduce a significant diference between the reference
data and the real life scenario. This is critical for synthetic datasets, which are widely used for datasets
related to neural network development [22].</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. AURALYS: smart glasses with AUdio captuRing and anALYSis capabilities</title>
      <p>AURALYS 1 represents a cutting-edge research
project that aims to design a tool to enhance
human auditory perception in dynamic,
realworld environments. The idea behind the
adoption of glasses to improve audio capabilities
derives from he necessity to include an array of
microphones, in which each devices is slightly
distanced from the others. In our opinion, the
temples of glasses represent a good solution to
place this microphone array, due to the
physical space available on a rigid surface, and the
proximity to the ears. In addition, the use of Figure 2: The 3D rendering of the AURALYS glasses;
glasses is largely accepted in society, especially highlighted in red, the position of the
miin contexts related to school and work. We also crophones.
note that, as future work, there is the
possibility of easily expanding the functionalities of glasses through the use of vision systems in terms of, for
instance, two cameras placed on the lenses.
1https://github.com/iot-unimore/Auralys</p>
      <p>AURALYS integrate a custom-designed microphone array mounted on 3D-printed frames (see Fig. 2),
precisely positioned to capture spatial audio cues around the wearer. Unlike standard binaural recording
setups, the glasses leverage six analog microphones – strategically placed to maximize directional
sensitivity – enabling advanced real-time audio processing, including source localization, selective
hearing, and speech enhancement. The six microphones are not only spatially distributed along
the transverse (horizontal) plane, but also have a diference in positioning on the craniocaudal axis
(distance from the ground), thus enabling a more correct localization of the audio sources in the entire
three-dimensional space.</p>
      <p>Thanks to the use of analog microphones and high-fidelity acquisition systems, the glasses achieve
tight synchronization between input channels, which is essential for computing accurate HRTFs
(see Sect. 2.4). These functions capture how sound interacts with the unique geometry of the user’s
head and ears, ensuring highly individualized and realistic spatial audio rendering. This makes the
AURALYS glasses particularly efective in acoustically challenging scenarios, such as environments
with reverberation, overlapping voices, or moving sound sources.</p>
      <p>By combining hardware precision with the VERSE software framework – which simulates and
processes dynamic scenes with high realism – AURALYS is not only a platform for research but also a
promising assistive device. Indeed, as mentioned, they open up new possibilities for augmented hearing,
enhanced situational awareness, and robust speech understanding in settings like crowded streets,
classrooms, or public transportation. In this way, AURALYS stands at the intersection of wearable
technology, human-centered design, and advanced acoustic modeling.</p>
      <sec id="sec-3-1">
        <title>3.1. Hardware Prototype</title>
        <p>The 3D printed glasses prototype is the base to place the six microphones in specific positions (see Fig. 2).
Analog mems microphones are used to simplify the synchronization of recorded signals with the source
stimuli. As shown in Figure 1, glasses are placed on a 3D printed head, in order to replicate a realistic
setting during the sound acquisition. Industry does provide standardized mannequins replicating
the human torso and head. Few products are available on the market like the well known Kemar
mannequin2. These apparatuses are a must to have in acoustic and physic research, but there are use
cases where the subject is custom and it is not comparable to the standard specification.</p>
        <p>In our work, we use a 3D printed head from the open-source project OpenAural [23], available under
common-creative license3. The OpenAural head has been selected for its license and reproducibility at
low cost, but with modern tools it is possible to perform a 3D scan and print of any subject, robotic
device or human and, in particular, child head. For this project the printed head is combined with a
commercial torso mannequin, similar to the ones used for store display. The receivers are built using an
analog mems microphone model KNOWLES SPM0687LR5H-1. The full schematic is released as part
of the repository, for reproducibility, and provide a small microphone with 48 volts phantom power
capabilities.</p>
        <p>The technique used to compute the HRTF function is based on the common sine sweep method,
where the stimuli is produced by a calibrated (equalized) speaker and the receivers and source signal are
recorded with a digital audio card on a computer. AURALYS project uses a FAITAL PRO audio speaker
4FE32 (8 ohms)4. The combination of speaker and audio amplifier has been equalized for a flat audio
response via external equalizer Beheringer UltraCurve DEQ24965</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Software modules</title>
        <p>The software developed for the project and capable of processing the audio streams coming from the six
microphones of the glasses must include the components shown in Figure 3. For each specific application,
2https://www.grasacoustics.com/industries/audiology/kemar
3https://www.thingiverse.com/thing:4691843
4https://faitalpro.com/it/products/LF_Loudspeakers/product_details/index.php?id=401005100
5https://www.behringer.com/product.html?modelCode=0821-AAD
a dataset will be created as described in Section 4 which is essential for training or configuring the
parameters of each of the modules described here.</p>
        <p>In particular, the main software components are the following:
• Human Voice Detection: identifies the presence and the number of speakers. It is useful to
start the following steps only when required (also for energy saving) and to enable appropriate
algorithms depending on the number of audio sources.
• Source localization: estimates and tracks over time the positions of the speakers in the 3D
space.
• Selection: this module allows a (semi)-automatic selection of which speaker the interest of the
subsequent modules should fall on.
• Isolation: outputs an audio stream where only the voice of the selected speaker is audible.
• Processing: depending on the application, the final step could be a speech-to-text translation, a
re-equalization, a frequency shift or any useful task.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Use Cases</title>
        <p>The proposed use of AURALYS focuses on two application domains: a general workplace setting and
a specific learning environment. Both contexts ofer the possibility to detect sound features and to
develop an appropriate response model, enabling the anticipation of suitable scenarios.</p>
        <p>The first context is the general workplace, where an individual operates within a defined environment
with known characteristics. In such settings, relevant auditory stimuli that are dificult to discriminate
may occur sporadically or originate from directions that are challenging to localize.</p>
        <p>The second context is educational, where multiple individuals interact simultaneously, and activities
vary in nature. In this scenario, it may be particularly useful to isolate one voice from others, for
example, by filtering based on intensity, timbre, or direction of origin.</p>
        <p>In the following, we further discuss some real-world use-cases.</p>
        <p>Dificulties. Several everyday environments present continuous and uniform background noise,
intermittently punctuated by other sounds, only some of which are relevant. The relevance of these
sounds may depend on their type (e.g., equipment alarms) or their direction (e.g., voices in crowded
environments, such as communication among workers, or auditory signals perceived by drivers). In
educational settings, such as classrooms, group activities often involve multiple overlapping voices
and background noise, making it dificult for students to understand the teacher, or conversely, to
follow what peers are saying when the teacher is interacting with others. This issue also extends to
informal moments, such as recess or time spent in the schoolyard, where the ability to selectively
focus on a specific auditory stimulus becomes challenging. In university contexts, similar dificulties
can arise during collaborative activities, such as group discussions, or study sessions, where multiple
simultaneous conversations may interfere with efective communication and participation.
Disorders. Here, the use case focus is on the case of ADHD, in which individuals are more prone to
distraction during tasks due to dificulty in inhibiting irrelevant auditory stimuli – such as background
voices or environmental sounds – and in shifting attention eficiently between stimuli. To support
attentional performance in individuals with ADHD, several strategies can be adopted. One approach
involves filtering or reducing the volume of non-relevant sounds or voices, based on characteristics
such as direction, intensity, or timbre. This can help the individual maintain focus on the primary
auditory stimulus, such as the teacher’s voice or the voices of peers seated nearby or directly in front.
Another strategy is to enhance or emphasize target voices within noisy environments – for example,
during group discussions held outdoors – so that relevant speech stands out from background noise.
A further option is to suppress or minimize all environmental sounds to create an artificially quieter
setting. This can support concentration on cognitively demanding tasks, such as reading or studying,
in otherwise noisy environments.</p>
        <p>Disabilities. The performance of existing hearing aids can be improved by integrating their current
software with the specific capabilities ofered by the proposed software, without requiring the use
of AURALYS glasses. Additionally, an even greater enhancement can be achieved by combining
the software with AURALYS glasses, which provide directional sound capture from the surrounding
environment, thereby further refining the device’s ability to process relevant auditory stimuli.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. The Verse framework</title>
      <p>The “Virtual Environment for Rendering of Speech Emissions” (VERSE) framework6 contains a platform
to generate synthetic datasets of voice recordings and real environment characterization measurements.
Among the others, the main goal is to study the advantages of an array of microphones versus binaural
audio signals, in the context of real embedded devices and including machine learning algorithms for
signal processing, with a particular focus on human voices.</p>
      <p>VERSE is based on the abstraction of main components for an audio scene: voice sound sources from
speakers, one listener and reverberation generated by the environment itself, meaning the room hosting
speakers and listener. Specific to the definition of a scene is the concept of motion: the scene defines
how sound sources are placed around the listener and how they move in space.</p>
      <p>The basic resources defined and used in the VERSE framework are defined as follows.
• Speakers: these are audio recordings in a digital format capturing a single speaker recorded
on a single track (mono) in a non-reverberant room (or as much low reverberation as possible).
The absence of reverberation is important since the scene itself will define the type of room
reverberation that must be applied. Each source can be static (does not move in space during
playback) or dynamic (will move along a specific path in space), defining an audio scene.
• Room: environment defines early and late reflections of sounds arriving from walls and objects.</p>
      <p>The sum of all sound reflections to the listener’s head (and microphones) defines the final audio
perceived by the subject. Multiple techniques have been developed to properly measure the
impulse response of a room, using energetic “impulsive” stimulus and sine sweep [24].
• Listener: the current version provides for a single listener. The listening subject is assumed to
be at the centre of the three-dimensional reference system adopted to model the source paths. A
listener can be defined as the coupling between a head and a pair of AURALYS glasses. The most
important item contained in the listener definition is the HRTF of the microphone array (binaural
or multi-microphone or both). The HRTF function is stored using the SOFA format (Spatially
Oriented Format for Acoustics) [25] defined by the Audio Engineering Society (AES) in a specific
standard (AES69-2022: AES standard for file exchange - Spatial acoustic data file format) [26].
• Dataset configuration : the dataset configuration contains all the information for the audio
scene definition, i.e.the mix of speakers, room and listener.</p>
      <p>Given a dataset configuration, the VERSE framework allows to render a complete dataset of dynamic
scenes, moving the sources along a pre-defined path, thanks to the convolution engine by 3D TuneIn.</p>
      <p>This flexibility of composing virtual audio scenes by swapping the resources into a definition is the
essence of the VERSE dataset and framework: it is possible to generate a wide set of data with precise,
repeatable positioning and with reference files (ground truth), skipping the cost and reducing the time
for laborious recording sessions from real-life audio setup.</p>
      <p>VERSE is also modular: the resource definition abstraction is done at a file level using YAML files
to describe each resource with a common syntax and folder structure. This allows to add resources
(sounds, heads or rooms) from other dataset to expand the possibilities for final audio generation.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In conclusion, the ability to perceive and selectively attend to relevant sounds in complex acoustic
environments is a critical challenge, especially for individuals facing age-related, sensory, or
neurodevelopmental vulnerabilities. These challenges are increasingly evident in everyday contexts such as
schools and workplaces, where noisy conditions can significantly impair communication and learning.
In these scenarios, the proposed AURALYS smart glasses represent a promising technological solution
to address these issues. By enabling real-time directional sound capture and selective auditory
filtering, AURALYS has the potential to enhance auditory perception and attention in noisy settings. The
complementary VERSE framework supports this efort by providing targeted datasets optimized for
the device’s embedded processing constraints, facilitating eficient and context-specific sound source
localization.</p>
      <p>Future work will focus on refining hardware design, improving model generalization, and conducting
extensive user-centered evaluations to fully realize the system’s potential in promoting inclusive,
accessible environments that support diverse auditory needs. This interdisciplinary approach, grounded
in pedagogical principles and technological innovation, aims to improve quality of life for all individuals,
particularly those who experience auditory vulnerabilities.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <sec id="sec-6-1">
        <title>The author(s) have not employed any Generative AI tools.</title>
        <p>response measurement approaches, The Journal of the Acoustical Society of America 134 (2013)
EL223–EL229. doi:10.1121/1.4813592.
[20] A. Farina, Simultaneous measurement of impulse response and distortion with a swept-sine
technique (2000).
[21] Hrtf measurements with recorded reference signal, in: 129th Audio Engineering Society
Convention 2010, 129th Audio Engineering Society Convention 2010, 2010, pp. 533–540. 129th Audio
Engineering Society Convention 2010 ; Conference date: 04-11-2010 Through 07-11-2010.
[22] M. Cuevas-Rodriguez, D. Gonzalez-Toledo, A. Reyes-Lecuona, L. Picinali, Impact of
nonindividualised head related transfer functions on speech-in-noise performances within a
synthesised virtual environment, The Journal of the Acoustical Society of America 149 (2021) 2573–2586.
doi:10.1121/10.0004220.
[23] D. O’Connor, J. Kennedy, An evaluation of 3d printing for the manufacture of a binaural recording
device, Applied Acoustics 171 (2021) 107610. URL: http://dx.doi.org/10.1016/j.apacoust.2020.107610.
doi:10.1016/j.apacoust.2020.107610.
[24] R. San Martín, M. Arana, J. Machín, A. Arregui, Impulse source versus dodecahedral loudspeaker
for measuring parameters derived from the impulse response in room acoustics, The Journal of
the Acoustical Society of America 134 (2013) 275–284. URL: http://dx.doi.org/10.1121/1.4808332.
doi:10.1121/1.4808332.
[25] A. E. S. (AES), Sofa (spatially oriented format for acoustics), 2022. URL: https://www.</p>
        <p>sofaconventions.org/mediawiki/index.php/SOFA_(Spatially_Oriented_Format_for_Acoustics).
[26] A. E. S. (AES), Aes69-2022: Aes standard for file exchange - spatial acoustic data file format, 2022.</p>
        <p>URL: https://www.aes.org/publications/standards/search.cfm?docID=99.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Vakili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vakili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Ajilian</given-names>
            <surname>Abbasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Masoudi</surname>
          </string-name>
          ,
          <article-title>Overcrowded classrooms: Challenges, consequences, and collaborative solutions for educators: A literature review</article-title>
          ,
          <source>Medical Education Bulletin</source>
          <volume>5</volume>
          (
          <year>2024</year>
          )
          <fpage>961</fpage>
          -
          <lpage>972</lpage>
          . doi:
          <volume>10</volume>
          .22034/meb.
          <year>2024</year>
          .
          <volume>492269</volume>
          .1103.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Savegnago</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Franz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gubernale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gallo</surname>
          </string-name>
          , C. de Filippis, G. Marioni,
          <string-name>
            <given-names>E.</given-names>
            <surname>Genovese</surname>
          </string-name>
          ,
          <article-title>Learning disabilities in children with hearing loss: A systematic review</article-title>
          ,
          <source>American Journal of Otolaryngology</source>
          <volume>45</volume>
          (
          <year>2024</year>
          )
          <article-title>104439</article-title>
          . doi:https://doi.org/10.1016/j.amjoto.
          <year>2024</year>
          .
          <volume>104439</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Carlucci</surname>
          </string-name>
          , E. Martinelli,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sapone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cotroneo</surname>
          </string-name>
          ,
          <article-title>The sound of silence: quanto il cervello non sente</article-title>
          ,
          <source>ACSA MAGAZINE</source>
          (
          <year>2024</year>
          ). URL: https://www.acsamedical.it/ the-sound
          <article-title>-of-silence-quando-il-cervello-non-sente/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bérard</surname>
          </string-name>
          , Hearing Equals Behavior, New Cannan, Conn. : Keats Pub,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Diamond</surname>
          </string-name>
          , Executive functions,
          <source>Annual Review of Psychology</source>
          <volume>64</volume>
          (
          <year>2013</year>
          )
          <fpage>135</fpage>
          -
          <lpage>168</lpage>
          . URL: https: //www.annualreviews.org/content/journals/10.1146/annurev-psych-
          <volume>113011</volume>
          -
          <fpage>143750</fpage>
          . doi:https: //doi.org/10.1146/annurev-psych-
          <volume>113011</volume>
          -143750.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Conte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Marzocchi</surname>
          </string-name>
          ,
          <article-title>Specific executive function profile of children with adhd, learning disabilities or odd; [profili specifici di funzioni esecutive nei ragazzi con adhd, dsa o dop]</article-title>
          ,
          <source>Psicologia Clinica dello Sviluppo</source>
          <volume>24</volume>
          (
          <year>2020</year>
          )
          <fpage>401</fpage>
          -
          <lpage>436</lpage>
          . doi:
          <volume>10</volume>
          .1449/98293, cited by:
          <volume>1</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Marimpietri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Carmignani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graziani</surname>
          </string-name>
          , E. Sechi,
          <article-title>Profili neuropsicologici e funzioni esecutive nei bambini con disturbo da deficit di attenzione/iperattività (adhd) e disturbo specifico di apprendimento (dsa) 79 (</article-title>
          <year>2012</year>
          )
          <fpage>159</fpage>
          -
          <lpage>177</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Cano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lukashevich</surname>
          </string-name>
          ,
          <article-title>Selective hearing: A machine listening perspective</article-title>
          ,
          <source>in: 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .1109/ MMSP.
          <year>2019</year>
          .
          <volume>8901720</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bent</surname>
          </string-name>
          ,
          <article-title>I can't hear you without my glasses</article-title>
          ,
          <source>J. Acoust. Soc. Am</source>
          .
          <volume>157</volume>
          (
          <year>2025</year>
          )
          <fpage>R5</fpage>
          -
          <lpage>R6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Culling</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. F. C. D'Olne</surname>
            ,
            <given-names>B. D.</given-names>
          </string-name>
          <string-name>
            <surname>Davies</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Powell</surname>
            ,
            <given-names>P. A.</given-names>
          </string-name>
          <string-name>
            <surname>Naylor</surname>
          </string-name>
          ,
          <article-title>Practical utility of a headmounted gaze-directed beamforming system</article-title>
          ,
          <source>The Journal of the Acoustical Society of America</source>
          <volume>154</volume>
          (
          <year>2023</year>
          )
          <fpage>3760</fpage>
          -
          <lpage>3768</lpage>
          . doi:
          <volume>10</volume>
          .1121/10.0023961.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Caversaccio</surname>
          </string-name>
          , W. Wimmer,
          <article-title>Multichannel acoustic source and image dataset for the cocktail party efect in hearing aid and implant users</article-title>
          ,
          <source>Scientific Data</source>
          <volume>7</volume>
          (
          <year>2020</year>
          )
          <article-title>440</article-title>
          . URL: https://doi.org/10.1038/s41597-020-00777-8. doi:
          <volume>10</volume>
          .1038/s41597-020-00777-8.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gaultier</surname>
          </string-name>
          , T. Goehring,
          <article-title>Recovering speech intelligibility with deep learning and multiple microphones in noisy-reverberant situations for people using cochlear implants</article-title>
          ,
          <source>The Journal of the Acoustical Society of America</source>
          <volume>155</volume>
          (
          <year>2024</year>
          )
          <fpage>3833</fpage>
          -
          <lpage>3847</lpage>
          . doi:
          <volume>10</volume>
          .1121/10.0026218.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Barker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Vincent</surname>
          </string-name>
          , N. Ma, H. Christensen,
          <string-name>
            <given-names>P.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <article-title>The pascal chime speech separation and recognition challenge</article-title>
          ,
          <source>Computer Speech Language</source>
          <volume>27</volume>
          (
          <year>2013</year>
          )
          <fpage>621</fpage>
          -
          <lpage>633</lpage>
          . doi:https://doi. org/10.1016/j.csl.
          <year>2012</year>
          .
          <volume>10</volume>
          .004, special Issue on
          <article-title>Speech Separation and Recognition in Multisource Environments</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Stowell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Giannoulis</surname>
          </string-name>
          , E. Benetos,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lagrange</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Plumbley</surname>
          </string-name>
          ,
          <article-title>Detection and classification of acoustic scenes and events</article-title>
          ,
          <source>IEEE Transactions on Multimedia</source>
          <volume>17</volume>
          (
          <year>2015</year>
          )
          <fpage>1733</fpage>
          -
          <lpage>1746</lpage>
          . doi:
          <volume>10</volume>
          .1109/ TMM.
          <year>2015</year>
          .
          <volume>2428998</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Stefens</surname>
          </string-name>
          , S. van de Par,
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Ewert</surname>
          </string-name>
          ,
          <article-title>The role of early and late reflections on perception of source orientation</article-title>
          ,
          <source>The Journal of the Acoustical Society of America</source>
          <volume>149</volume>
          (
          <year>2021</year>
          )
          <fpage>2255</fpage>
          -
          <lpage>2269</lpage>
          . doi:
          <volume>10</volume>
          .1121/10.0003823.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>E.-G. N.</given-names>
            <surname>Erwin</surname>
          </string-name>
          <string-name>
            <surname>Meyer</surname>
          </string-name>
          , Physical and Applied Acoustics: An Introduction, Academic Press, New York,
          <year>1972</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>V.</given-names>
            <surname>Pulkki</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. HUOPANIEMI</surname>
          </string-name>
          ,
          <article-title>Analyzingvirtual sound source attributes using a binaural auditory model</article-title>
          ,
          <source>Journal of the Audio Engineering Society. Audio Engineering Society</source>
          <volume>47</volume>
          (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Zhang,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kennedy</surname>
          </string-name>
          , T. Abhayapala,
          <source>Hrtf measurement on kemar manikin, Annual Conference of the Australian Acoustical Society 2009 - Acoustics</source>
          <year>2009</year>
          : Research to Consulting (
          <year>2009</year>
          )
          <fpage>10</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rothbucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Veprek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Paukner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Habigt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Diepold</surname>
          </string-name>
          , Comparison of head-related impulse
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>