<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Selective 3D Audio Presentation System for a Moving Individual Tracking Using a Pair of Parametric Speakers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hiroyuki Minematsu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hyuma Auchi</string-name>
          <email>auchi.hyuma.24@aclab.esys.tsukuba.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ayuto Togashi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rina Masuda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yohei Shida</string-name>
          <email>shida@sk.tsukuba.ac.jp</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Keiichi Zempo</string-name>
          <email>zempo@iit.tsukuba.ac.jp</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Graduate School of Science and Technology, University of Tsukuba</institution>
          ,
          <addr-line>Tsukuba</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Systems and Information Engineering, University of Tsukuba</institution>
          ,
          <addr-line>Tsukuba</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Computing, Institute of Science Tokyo</institution>
          ,
          <addr-line>Yokohama</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>With the rise of multi-user AR/MR environments, there is an increasing demand for auditory interfaces that can provide individualized spatial audio without adding to environmental noise or cognitive load, not only in public spaces but also in interactive digital contexts. Conventional loudspeakers disperse sound broadly, disturbing non-target listeners, while headphones isolate users from their surroundings and conflict with the open and multimodal nature of AR/MR. Parametric array loudspeakers (PALs) ofer extremely high directivity; however, previous research has primarily focused on static users, leaving unresolved the technical challenge of achieving both selective acoustic intervention and stable sound localization for moving individuals in multi-user scenarios. Here, we present a system that employs a pair of tracking PALs, guided by depth-camera-based motion capture, to deliver spatialized 3D audio exclusively to a walking target. Two experiments evaluated (i) selective acoustic intervention and (ii) localization accuracy while walking. Results showed that only the tracked target consistently received stable sound pressure, while non-target individuals experienced minimal exposure, and that localization accuracy during walking was more stable compared with fixed PALs. These findings demonstrate that tracking PALs can simultaneously achieve selectivity and stability in dynamic multi-user environments, paving the way for immersive and noise-conscious auditory interfaces in public guidance and AR/MR applications.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;spatial audio</kwd>
        <kwd>auditory perception</kwd>
        <kwd>human motion tracking</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        With the advancement of Mixed Reality (MR) and Augmented Reality (AR) technologies, interactive
experiences that merge real-world and virtual information are expanding across diverse domains [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
Among these, auditory information, particularly spatial audio in combination with visual information,
has been shown to greatly enhance realism and immersion [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6 ref7">3, 4, 5, 6, 7</xref>
        ]. In the context of MR and
AR, there is a growing demand for real-time and high-precision audio presentation that adapts to
user movements and gaze shifts [ 8, 9, 10]. Furthermore, MR/AR systems designed for multiple users
have recently emerged [11, 12]. Consequently, in addition to highly accurate spatial audio rendering,
it is becoming increasingly important to deliver audio individually tailored to each user within the
same physical space. Applications of spatial audio extend beyond MR/AR to public environments such
as train stations, commercial facilities, exhibitions, and digital signage in urban spaces [13, 14, 15].
However, conventional loudspeaker-based methods indiscriminately difuse sound to large audiences,
where unintended listeners may perceive it as unwanted noise, thereby increasing cognitive load and
stress [16, 17, 18]. Thus, there is a growing necessity for selective and high-precision spatial audio
presentation tailored to individual users, not only in MR/AR but also in public spaces.
      </p>
      <p>Conventional approaches to spatial audio reproduction with external loudspeakers have been
designed for fixed configurations, targeting sound localization within a so-called “sweet spot.”However,
in dynamic scenarios where users are walking or changing orientation, maintaining consistent
localization is dificult, often degrading the accuracy of audio presentation [ 19, 20]. In addition, MR/AR use
cases frequently involve multiple simultaneous users, necessitating techniques to provide
individualized audio information. To address this issue, the parametric array loudspeaker (PAL), which exhibits
extremely high directivity, has drawn attention. For example, Kuratomo et al. controlled an ultrasonic
directional loudspeaker toward both ears of a static user, presenting spatial audio exclusively to that
individual [21, 22]. However, these studies were limited to single, stationary users, and the efectiveness
of selective presentation in multi-user environments or the stability of localization during walking has
not been suficiently verified.</p>
      <p>To address these challenges, this study develops a system that recognizes the positions and postures
of multiple users in real time and dynamically steers directional loudspeakers to follow a target user.
Even in scenarios where multiple users are walking within the same space, the system delivers audio
exclusively to the designated individual, while suppressing sound leakage to non-target users and
maintaining stable spatial audio presentation.</p>
      <p>The objectives of this study are to investigate the following research questions:
RQ1: In walking scenarios with multiple users, can the proposed system achieve selective audio
delivery to a specific individual?
RQ2: To what extent can the proposed system maintain sound localization accuracy for users while
walking?</p>
      <p>By addressing these research questions, we verify the efectiveness of the proposed system in
dynamic and multi-user environments. This work demonstrates the potential for a new form of spatial
audio presentation applicable to MR/AR contexts involving multiple users and mobile conditions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Selective Acoustic Intervention</title>
        <p>With the rise of AR and MR technologies, scenarios involving multiple users working within the same
space have become increasingly common, thereby requiring methods for individualized audio
presentation [11, 12]. In public spaces, conventional audio presentation has primarily relied on loudspeakers,
which can result in increased stress and cognitive load due to noise [16, 17, 18].</p>
        <p>To address this issue, extensive research has been conducted on selective acoustic presentation
methods that deliver sound only to a specific area or individual. A promising technology for achieving such
selectivity is the parametric array loudspeaker (PAL), which utilizes ultrasonic waves to generate
audible sound in midair, thereby forming highly directional acoustic beams [23]. Many studies have
focused on enhancing PAL directivity. For example, Fan et al. proposed a method that employs
phaserandomized arrays to suppress grating lobes, improving beam-steering accuracy toward the desired
direction [24]. Kinjo et al. developed a spot-delivery system capable of controlling the irradiation point
with an error margin of within ±1° based on 3D position estimation using stereo cameras,
demonstrating that users could clearly perceive audio beams targeted at themselves [25]. Furthermore, Zhuang
et al. proposed a sound-zone control method using a minimal setup of a single PAL to generate
multiple audible zones, achieving performance comparable to conventional multi-loudspeaker systems [26].
In addition, simulation studies on sound-zone control using PAL arrays have shown superior
performance and robustness compared to electrodynamic loudspeakers under high-frequency and low-SNR
conditions [27].</p>
        <p>However, most of these studies have focused on static single users. Systematic verification of
whether it is possible to continuously track and selectively intervene with a specific user in a multi-user
environment, while suppressing sound leakage to surrounding individuals, has not been suficiently
conducted. In this study, we quantitatively evaluate the feasibility of selective intervention in
environments where multiple users are in motion using the proposed system.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Spatial Audio</title>
        <p>
          Spatial audio is an indispensable technology for creating high levels of presence and immersion, and
its importance is widely recognized in MR/AR research [
          <xref ref-type="bibr" rid="ref7">7, 10</xref>
          ]. For example, Kern et al. demonstrated
that incorporating natural environmental sounds and footsteps synchronized with user actions into
VR environments significantly enhances presence and realism, proving that spatial audio complements
visual information and deepens immersion [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Similarly, Rumiński et al. reported that in an AR
hiddenobject search task, spatialized sound presentation significantly improved task completion speed and
eficiency compared to non-spatial conditions, demonstrating the efectiveness of spatial audio for
navigation support in AR [10].
        </p>
        <p>The two primary approaches to spatial audio presentation in AR/MR environments are headphones
and loudspeakers. While headphones provide highly accurate localization, they block real-world
sounds and hinder the fusion with reality, which is central to AR/MR. Loudspeakers, on the other hand,
ofer a more open auditory experience but sufer from the limitation that accurate sound perception
is confined to a narrow sweet spot [ 19]. Furthermore, cross-talk—where sound from one loudspeaker
reaches the opposite ear—is known to degrade localization accuracy [20].</p>
        <p>One approach to addressing these issues is to leverage the high directivity of PALs. Kuratomo et
al. demonstrated that, by steering directional loudspeakers toward both ears of a static user based on
depth-camera position estimation, it is possible to present spatialized sound exclusively to a specific
individual while maintaining stable localization even during head rotation [21, 22]. Nakayama et al.
proposed a method that combines PALs with conventional loudspeakers, controlling the ratio of direct
sound to reverberation in order to manipulate perceived source distance and reduce cross-talk [28].</p>
        <p>However, these prior studies primarily focused on static users. The extent to which sound
localization accuracy and tracking performance are maintained for users while walking remains insuficiently
explored. Therefore, in this study, we use the proposed system to continuously present spatial audio
to users in motion and quantitatively evaluate localization accuracy under dynamic conditions.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Method</title>
      <sec id="sec-3-1">
        <title>3.1. Overview</title>
        <p>In this study, we developed a spatial audio system designed to track a specific individual among
multiple users in dynamic environments involving movement and rotation, and to present a clear sound
image exclusively to that person. The system estimates the positions of the user’s ears and head
orientation, and based on this information, it controls the direction of parametric array loudspeakers
(PALs) in real time, thereby realizing selective acoustic presentation that delivers sound precisely to
any arbitrary point in space.</p>
        <p>The system consists of two PALs, a depth camera (Azure Kinect), and FMOD Studio for audio
playback, all integrated under unified control in C++. The depth camera captures skeletal information of
users, enabling target user selection, ear coordinate computation, angle calculation, loudspeaker
orientation control, and physical sound playback to operate in real time. Furthermore, playback status is
controlled depending on the presence of a tracked user: when the target exits the field of view, audio is
immediately muted. This design enables precise sound image presentation to a single user even within
interactive spatial environments.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Tracking and Target Switching Logic</title>
        <p>Through skeletal tracking by the depth camera, multiple joint positions such as the user’s left and right
ears, head, and neck are obtained frame by frame. Target user selection is managed using the Body ID
assigned by Kinect; when tracking begins, the first detected person within the frame is registered as
the target.
pearance.</p>
        <p>If the current Body ID is no longer detected in subsequent frames (e.g., when the user leaves the
camera’s field of view), the system reassigns the target to the first newly detected person. This
sequential switching ensures continuous audio presentation to one user even in dynamic public spaces where
people are frequently entering and exiting.</p>
        <p>If no individuals are detected at all, ongoing audio playback is paused. This prevents unintended
acoustic presentation to non-targets and minimizes sound leakage. When a user is detected again,
playback automatically resumes, providing autonomous responsiveness to user appearance and
disap</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Speaker Angle Calculation and Control</title>
        <p>In this system, two PALs are fixed on the left and right sides of the Kinect camera. The left speaker is
controlled to direct sound toward the left ear, and the right speaker toward the right ear, respectively.
Each speaker is connected to an Arduino via an independent serial port, through which real-time
horizontal and vertical angles are transmitted.</p>
        <p>For control, the diference vector between each speaker position and the target ear coordinates is
computed. After applying rotational correction to transform into the local coordinate system
considering the speaker’s mounting angle, the angles are calculated as follows:
 pan = clamp (90 −
 tilt = clamp (90 −
⋅ arctan (</p>
        <p>) , 0, 180)
180
180


⋅ arctan ( ′′ ) , 0, 180)</p>
        <p>√ ′2 +  ′2
(1)
(2)</p>
        <p>Here, ( ′,  ′) are the local coordinates after compensating for the physical tilt of the speaker, and
clamp( , , )</p>
        <p>is a function restricting a value  within  – . A rotation of +45∘ is applied for the left
speaker and −45∘ for the right speaker, so that the ear-directed vectors are recalculated in each
respective local coordinate system. In our implementation, the servo motors allowed the PALs to rotate
within a range of ±90∘ horizontally and ±45∘ vertically, which was suficient to cover typical head and
body movements during walking.</p>
        <p>The computed results are converted to integer angles, serialized as strings, and transmitted through
the corresponding serial port to each speaker. The Arduino receives these values, generates PWM
signals, and drives the motors to control the physical speaker angles in real time.</p>
        <p>This computation is continuously performed in synchronization with skeletal frame updates from
Kinect (approximately 30 Hz), enabling the speakers to follow the ear positions even while the user is
moving.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. FMOD-based Sound Playback</title>
        <p>FMOD Studio is used for sound playback, where pre-prepared audio files are looped in spatial audio
mode. The virtual sound source is fixed at the user ’s frontal position at  = 1.0 [m]. The user’s head
position detected by Kinect is converted into meters and set as the FMOD listener position. Additionally,
the head orientation (yaw angle) is estimated from the quaternion ( , ,  , ) of the neck joint using
the following equation:
 yaw = arctan 2 (2(  +  ), 
2 −  2 −  2 +  2)
(3)</p>
        <p>This yaw angle is used to update the forward vector of the FMOD listener, ensuring that sound
images are perceived from the correct direction relative to the user’s orientation. Thus, even when the
user rotates, the perception of a frontal sound image is maintained.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Temporal Update Loop and Synchronization</title>
        <p>The entire control algorithm is executed in synchronization with body frame updates from Kinect,
operating at approximately 30 Hz. The following processes are repeated for each frame:
1. Determine the presence of a tracked target
2. Acquire ear coordinates and compute horizontal/vertical angles
3. Send loudspeaker control angles via serial communication
4. Toggle audio playback ON/OFF
5. Update FMOD listener position and orientation</p>
        <p>Through this processing loop, smooth and accurate sound image presentation is achieved, enabling
continuous and precise auditory tracking in environments where users are constantly in motion.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>4.1.1. Method</p>
      <sec id="sec-4-1">
        <title>4.1. Evaluation of Selective Acoustic Interventions</title>
        <p>To evaluate the feasibility of selective acoustic intervention, we conducted a sound pressure
measurement experiment. The experimental setup is shown in Fig. 2(a). The experiment was conducted in an
anechoic chamber with dimensions of approximately 3.0 m × 3.0 m. Three participants were positioned
in the space, labeled as person 1, person 2, and person 3 from front to back. They were spaced 1.5 m
apart, and each walked a distance of 3.0 m at a speed of approximately 0.5 m/s.</p>
        <p>The experimental conditions included three setups: two-channel loudspeakers, fixed PAL, and
tracking PAL. In the fixed PAL condition, sound was directed toward the center of the space, approximately
1.5 m along the walking line of person 2. In the tracking PAL condition, the sound was continuously
directed in real time toward the ear position of person 2.</p>
        <p>As the test sound, we used white noise, which is frequently employed in prior studies on noise
evaluation, and the sound pressure level was adjusted to approximately 55 dB SPL at the point of maximum
sound pressure, corresponding to typical voice-guidance levels [29, 30, 13]. An omnidirectional
microphone (Behringer ECM8000) was used for measurement, which each participant held in front of their
face. The recorded sound pressure levels were used to evaluate sound leakage to person 1 and person 3
when the sound image was directed at person 2.</p>
        <p>(a) Expt. 1: sound pressure</p>
        <p>(b) Expt. 2: sound imagelocalization</p>
        <p>Conditions
①
②
③
Tracking Parametric Speakers</p>
        <p>Fixed Parametric Speakers
2ch loudspeaker
~</p>
        <p>3m
90°
-90°
3m ~
~
~
4.1.2. Results
The results of this experiment are shown in Fig. 3. The graph illustrates the sound pressure distribution
while participants walked from the starting point (0 m) to the end point (3.0 m).</p>
        <p>In the two-channel loudspeaker condition (Fig. 3(a)), person 2 consistently exhibited a sound
pressure level of approximately −30 dB, while person 1 and person 3 started near −30 dB but gradually
experienced increasing sound pressure as they walked. In the fixed PAL condition (Fig. 3(b)), person 1
and person 3 consistently experienced sound pressure levels below −30 dB, while person 2 showed
sound pressure above −30 dB only around the region 1–2 m, where the PAL beam was directed.
Finally, in the tracking PAL condition (Fig. 3(c)), person 1 and person 3 were almost always below −30 dB,
while person 2, the designated target, consistently experienced sound pressure levels above −30 dB.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Evaluation of Sound Localization while Walking</title>
        <p>4.2.1. Method
To evaluate sound localization accuracy, we conducted an experiment in the same anechoic chamber as
Experiment 1, where sound sources were presented from diferent directions, and participants indicated
2
φ = 0°</p>
        <p>3 0
φ = 45°
how accurately they perceived the direction. The experimental setup is shown in Fig. 2(b). Taking the
camera’s frontal direction as  = 0 ∘, pink noise—commonly used in localization experiments [31]—was
presented from five angles: 0∘, ±45∘, and ±90∘. The sound pressure level was set to approximately
55 dB SPL, consistent with Experiment 1.</p>
        <p>While walking, participants indicated the perceived direction of the sound image in real time using
an evaluation application. The participants included three males and one female (N = 4). The same
three conditions as in Experiment 1 were compared: tracking PAL (proposed method), fixed PAL, and
two-channel loudspeakers.
4.2.2. Results
The results of the sound localization accuracy experiment are shown in Fig. 4 and Tab. 1. The graph
illustrates the localization error angles perceived by participants while walking from 0 m to 3.0 m for
each of the five presentation angles (  = 0 ∘, ±45∘, ±90∘). Table 1 presents the root mean square error
(RMSE) for each angle under each condition, as well as the overall average RMSE across all directions.</p>
        <p>In the two-channel loudspeaker condition (Fig. 4(a)), the overall average RMSE was the smallest
among the three conditions, at 26.97. In particular, at 0∘, the RMSE was the lowest and most stable
compared to the other two conditions. For ±90∘ and ±45∘, the error decreased gradually as participants
approached the sound source from the initial position.</p>
        <p>In the fixed PAL condition (Fig. 4(b)), the overall average RMSE was the largest among the three
conditions, at 39.08. At 0∘, errors were relatively small as the user passed through the beam’s focal
region (1.5–2.0 m), but beyond that range, the error increased sharply, producing a distinctive pattern
in the graph.</p>
        <p>In the proposed tracking PAL condition (Fig. 4(c)), the overall average RMSE was 30.54. Although
this was larger than that of the two-channel loudspeakers, the errors at 90∘ and −45∘ were smaller.
Furthermore, the error variation remained relatively stable across all presentation angles.</p>
        <p>The violin plots in Fig. 5 further illustrate these results. In the fixed PAL condition (Fig. 5(b)), the
distribution exhibited large variability for all presentation angles. Additionally, for 0∘ in the tracking
PAL condition (Fig. 5(c)), the distribution was wider than that of the fixed PAL condition.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <sec id="sec-5-1">
        <title>5.1. Selective Acoustic Intervention (Answer to RQ1)</title>
        <p>RQ1: Can the proposed system achieve selective acoustic intervention for a specific user while multiple
individuals are walking?</p>
        <p>In this experiment, the feasibility of selective acoustic intervention was evaluated by analyzing the
sound pressure recorded using omnidirectional microphones held in front of participants’faces as they
walked at equal intervals in the anechoic chamber. Three speaker conditions were compared:
twochannel loudspeakers, fixed PAL, and tracking PAL.</p>
        <p>As shown in Fig. 3(a), in the two-channel loudspeaker condition, the sound pressure for person 2
remained nearly constant at approximately −30 dB. This result can be attributed to the broad
directivity of loudspeakers, which distribute sound evenly throughout the space, leading to uniform sound
pressure regardless of distance. In contrast, for person 1 and person 3, the sound pressure increased as
they approached the speakers during walking, reaching nearly −30 dB at around 2.0–3.0 m. This was
likely due to their proximity to the speakers installed at both ends. These findings suggest that
twochannel loudspeakers disperse sound across the entire space, making selective acoustic intervention
for a single individual dificult.</p>
        <p>As shown in Fig. 3(b), in the fixed PAL condition where sound was directed at person 2 around
1.5 m, the sound pressure for person 1 and person 3 consistently remained below −30 dB. In contrast,
only person 2 exhibited sound pressure levels above −30 dB within the 1.0–2.0 m range, where the
PAL beam was directed. This demonstrates that the high directivity of PALs enables selective acoustic
intervention for person 2 within this range. However, for person 2, sound pressure fell below −30 dB in
the 0–1.0 m and 2.0–3.0 m ranges, indicating reduced audibility. Thus, selective presentation by fixed
PALs is limited to approximately ±0.5 m around the beam’s focal point.</p>
        <p>As shown in Fig. 3(c), in the tracking PAL condition, the sound pressure for person 1 and person 3
consistently remained below −30 dB, while the designated target, person 2, always exhibited sound
pressure above −30 dB. These results demonstrate that tracking PALs can provide continuous and
selective acoustic intervention during walking, regardless of the target’s distance.</p>
        <p>In summary, the proposed tracking PAL method enables selective acoustic intervention more
effectively than two-channel loudspeakers or fixed PALs. Importantly, even under multi-user walking
conditions, it allows continuous selective presentation while reducing noise for surrounding
individuals.
RQ2: To what extent can the proposed system maintain sound localization accuracy when the target
user is walking?</p>
        <p>In this experiment, the maintenance of sound localization accuracy during walking was evaluated
by comparing three conditions: two-channel loudspeakers, fixed PAL, and tracking PAL. While the
overall average RMSE indicated that conventional two-channel loudspeakers achieved the best results,
diferences in characteristics beyond simple accuracy rankings were revealed.</p>
        <p>As shown in Fig. 4 and Tab. 1(a), the two-channel loudspeaker condition achieved the lowest
average RMSE of 26.97 among the three conditions. This was primarily due to the exceptionally high
localization accuracy at 0∘. Additionally, because the experimental setup placed the loudspeakers at
the 3.0 m endpoints, participants experienced increased sound pressure as they approached the
speakers, thereby enhancing acoustic cues and contributing to improved accuracy. These results indicate
that conventional loudspeakers excel at frontal localization and can provide stable localization within
close proximity to the speakers (approximately 1.0 m in this experiment). However, the localization
error varied greatly with distance, making it dificult to consistently present sound from a fixed direction
to moving users.</p>
        <p>As shown in Fig. 4 and Tab. 1(b), the fixed PAL condition exhibited the largest average RMSE of
39.08. This large error was particularly evident for the 0∘ direction. In this condition, when users
passed through the sweet spot of the PAL beam (1.5–2.0 m), localization accuracy was high, but once
they moved beyond this region, the source physically shifted behind them, leading to a sharp increase
in error. For angles other than 0∘, the error decreased as users approached the physical loudspeaker
positions, similar to the two-channel condition. Thus, due to the highly restricted efective localization
range, fixed PALs are also unsuitable for presenting sound images to moving users.</p>
        <p>As shown in Fig. 4 and Tab. 1(c), the tracking PAL condition resulted in an average RMSE of 30.54,
which was higher than that of two-channel loudspeakers. However, the key feature of this method
was that the error variation remained relatively stable across the entire walking path, independent
of the user’s position. Unlike the other two conditions, where localization error fluctuated greatly
with distance, the proposed method continuously tracked the user and maintained consistent sound
pressure, thereby avoiding abrupt error changes. This explains why the overall RMSE was larger than
that of the two-channel loudspeakers, as the error remained constant rather than being reduced near
the speakers.</p>
        <p>The violin plots provide further insights. In Fig. 5(a), the 45∘ and −45∘ conditions showed relatively
small mean and median errors, yet the distributions were spread approximately ±45°, indicating that
some participants perceived the sound as frontal or lateral at diferent times. Moreover, Tab. 1(b)(c)
shows that at 0∘, the tracking PAL achieved lower RMSE than the fixed PAL; however, Fig. 5(c) indicates
that the tracking PAL distribution was wider. While most errors clustered around 0∘, a few outliers
degraded accuracy. These distribution issues are likely attributable to individual diferences in HRTFs.
Given the small sample size of four participants, outliers had a greater impact on the results.</p>
        <p>Additionally, both Fig. 4 and Fig. 5 show that errors were particularly large for ±90∘ under all three
conditions. Although some participants reported errors closer to 0∘, suggesting the influence of outliers
due to the small sample size, the overall trend remained consistent. A major factor contributing to the
increased error is the small number of loudspeakers (two) used in this experiment. In contrast, Brungart
et al. [32] evaluated walking sound localization using 64 loudspeakers and reported average errors
below 9∘. This comparison suggests that the particularly low lateral localization accuracy observed
here was due to the limited number of loudspeakers, which made lateral localization more dificult
than in multi-speaker systems.</p>
        <p>From these findings, we conclude that the proposed tracking PAL system successfully overcomes the
sweet-spot limitation of fixed PALs and achieves significantly improved localization accuracy.
Compared to conventional two-channel loudspeakers, it yielded slightly larger average error but
maintained nearly comparable accuracy while preventing sound difusion to non-target users—an
advantage unique to parametric loudspeakers. Therefore, the proposed method represents a highly
promising approach for public spaces and multi-user AR/MR environments where both noise reduction and
accurate localization are desired. On the other hand, limitations such as reduced accuracy at ±90∘ and
inter-individual variability in HRTFs highlight areas for future improvement.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Limitations and Future Work</title>
      <p>This study demonstrated that the tracking PAL system is efective for selective acoustic intervention
toward a moving target user and for maintaining stable sound localization. However, several
limitations remain in both the evaluation and the system itself. Future work should address the following
points.</p>
      <p>First, there are challenges related to sound localization accuracy. In our experiments, the average
RMSE of the proposed method was larger than that of two-channel loudspeakers. One reason is that
this study represents a proof-of-concept stage, where the number of speakers and participants was
limited. Moreover, the stability of the tracking PAL maintained a consistent baseline error, in contrast
to conventional methods that showed extremely small errors near the speakers, which resulted in
higher average error for our method. Beyond these factors, however, the fundamental causes of this
baseline error remain unidentified. Possible contributing factors include system-wide latency from
skeletal estimation by Kinect to loudspeaker motor actuation, as well as hardware limitations such as
servo motor precision. As future work, these delays and hardware constraints should be quantitatively
measured to identify the primary sources of error. Based on these findings, improvements such as
faster and more precise tracking systems, or software-based compensation that predicts user motion
to reduce latency, could be implemented. Additionally, experiments with larger numbers of speakers
and participants will be necessary for more robust evaluation.</p>
      <p>Second, the simplicity of the experimental environment poses limitations. The experiments were
conducted in an anechoic chamber, free of acoustic reflections and external noise, under the simplified
condition of linear walking. However, MR/AR environments and public spaces, which are the intended
applications of this system, are acoustically complex, filled with noise and reverberation. Moreover,
user movements in these contexts may include turning, stopping, and changing directions, beyond
simple linear motion. In particular, when users move freely, the distance from the speakers can vary
greatly, and if they move too far away, the perceived loudness may decrease. A potential solution is to
install multiple PAL units at elevated positions such as the ceiling and dynamically switch or hand over
the active speaker based on the tracked user’s position. This multi-speaker handover approach would
enable the system to maintain audibility and scalability in larger spaces without relying solely on
distance compensation. Future research should therefore include evaluations in real-world environments
such as ofices and commercial facilities, as well as assessments of the system ’s tracking performance
and localization accuracy under more complex user behaviors.</p>
      <p>Finally, a limitation lies in target selection and switching in multi-user environments. In this study,
target identification relied solely on Kinect ’s skeletal tracking. This approach was chosen for its
robustness in detecting users even when faces were not visible, its eficiency with low computational load
and real-time performance, and its anonymity in avoiding personal identification, thereby respecting
privacy. Based on this policy, the system reassigns the target to the oldest detected ID when the
current target leaves the detection area. However, this mechanism does not allow for intentional
dynamic target selection. In crowded environments where users frequently enter and exit, maintaining
selectivity becomes dificult. To address this issue, future work may explore intuitive interfaces such
as gesture-based or gaze-based target switching. Furthermore, attention should also be given to the
act of delivering sound itself. Potential directions include methods to reduce discomfort when sound
unintentionally reaches non-target users, and approaches to deliver notifications perceivable only by
the intended recipient. These aspects highlight opportunities for further exploration in the design of
acoustic presentation.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>This study addressed the challenge of delivering selective and stable spatial audio to specific moving
users in dynamic environments with multiple people, such as public spaces and commercial facilities.
To tackle this problem, we developed and evaluated a system that dynamically steers a pair of PALs
toward a user’s ears, based on real-time tracking with Kinect.</p>
      <p>The evaluation of the proposed system yielded two key findings. First, regarding selective acoustic
intervention (RQ1), the system successfully delivered sound exclusively to a specific walking user while
minimizing sound leakage to surrounding individuals. Second, with respect to maintaining sound
localization accuracy (RQ2), although the proposed method did not outperform conventional two-channel
loudspeakers in terms of average error, it provided stable and consistent localization performance that
was independent of user position, unlike existing methods.</p>
      <p>In summary, the contribution of this study lies in demonstrating the efectiveness of a method that
simultaneously fulfills two essential values in acoustic presentation for moving users: “selectivity”and
“stability.”Ensuring stability such that auditory information is not disrupted by user motion is
critically important for all forms of dynamic acoustic interaction. The proposed tracking PAL system is
expected to serve as a foundational technology for next-generation acoustic interfaces in dynamic and
multi-user environments, including voice guidance in public spaces such as train stations to provide
individualized navigation instructions while reducing noise, personalized advertisements in
commercial facilities, exhibition and museum spaces ofering visitor-specific audio explanations, and AR/MR
experiences such as collaborative design sessions, remote maintenance support, or educational field
trips in shared mixed reality environments where selective auditory presentation enhances immersion
without disturbing others.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT, Gemini in order to: grammar and
spelling checks, Paraphrase and translation. After using this tool/service, the authors reviewed and
edited the content as needed and take full responsibility for the publication’s content.
eighth author 4, nineth author 4, tenth author 5, eleventh author 4, twelfth author 4, thirteenth
author 4*, fourteenth author 4, Frontiers in Virtual Reality 6 (2025) 1629908.
[8] N. Kuratomo, H. Uchida, T. Ebihara, N. Wakatsuki, K. Mizutani, K. Zempo, Spatialphonic360:
Accuracy of the arbitrary sound image presentation using surrounding parametric speakers, in:
Companion Proceedings of the 2022 Conference on Interactive Surfaces and Spaces, ISS
Companion ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 32–36. URL:
https://doi.org/10.1145/3532104.3571462. doi:10.1145/3532104.3571462.
[9] X. Su, J. E. Froehlich, E. Koh, C. Xiao, Sonifyar: Context-aware sound generation in augmented
reality, in: Proceedings of the 37th Annual ACM Symposium on User Interface Software and
Technology, 2024, pp. 1–13.
[10] D. Rumiński, An experimental study of spatial sound usefulness in searching and
navigating through ar environments, Virtual Real. 19 (2015) 223–233. URL: https://doi.org/10.1007/
s10055-015-0274-4. doi:10.1007/s10055- 015- 0274- 4.
[11] S. Feng, W. He, X. Zhang, M. Billinghurst, S. Wang, A comprehensive survey on ar-enabled local
collaboration, Virtual Reality 27 (2023) 2941–2966.
[12] B. Sonkoly, B. G. Nagy, J. Dóka, Z. Kecskés-Solymosi, J. Czentye, B. Formanek, D. Jocha, B. P. Gerő,
An edge cloud based coordination platform for multi-user ar applications, Journal of Network
and Systems Management 32 (2024) 40.
[13] N. Kuratomo, H. Miyakawa, T. Ebihara, N. Wakatsuki, K. Mizutani, K. Zempo, Attracting efect
of pinpoint auditory glimpse on digital signage, IEEE Access 11 (2023) 42779–42794.
[14] M. Glaser, L. Hug, S. Werner, S. Schwan, Spatial versus normal audio guides in exhibitions:
Cognitive mechanisms and efects on learning, Educational technology research and development 73
(2025) 169–198.
[15] N. Kuratomo, H. Miyakawa, S. Masuko, T. Yamanaka, K. Zempo, Efects of acoustic comfort and
advertisement recallability on digital signage with on-demand pinpoint audio system, Applied
Acoustics 184 (2021) 108359.
[16] N. Kuratomo, B. Karic, C. Kray, Explicit vs. implicit auditory displays for managing people flow
in a pandemic: An exploratory study, Interacting with Computers (2025) iwaf008.
[17] M. Basner, W. Babisch, A. Davis, M. Brink, C. Clark, S. Janssen, S. Stansfeld, Auditory and
nonauditory efects of noise on health, The lancet 383 (2014) 1325–1332.
[18] R. Thompson, R. B. Smith, Y. B. Karim, C. Shen, K. Drummond, C. Teng, M. B. Toledano, Noise
pollution and human cognition: An updated systematic review and meta-analysis of recent evidence,
Environment international 158 (2022) 106905.
[19] G. Theile, On the naturalness of two-channel stereo sound, Journal of the Audio Engineering</p>
      <p>Society 39 (1991) 761–767.
[20] M. Morimoto, Y. Ando, On the simulation of sound localization, Journal of the Acoustical Society
of Japan (e) 1 (1980) 167–174.
[21] N. Kuratomo, H. Uchida, T. Ebihara, N. Wakatsuki, K. Mizutani, K. Zempo, Spatialphonic360:
Accuracy of the arbitrary sound image presentation using surrounding parametric speakers, in:
Companion Proceedings of the 2022 Conference on Interactive Surfaces and Spaces, ISS
Companion ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 32–36. URL:
https://doi.org/10.1145/3532104.3571462. doi:10.1145/3532104.3571462.
[22] H. Uchida, N. Kuratomo, T. Ebihara, N. Wakatsuki, K. Zempo, Spatialphonic360: Acoustic space
for arbitrary sound image presentation based on both ears tracking, in: Adjunct Proceedings
of the 2022 ACM International Joint Conference on Pervasive and Ubiquitous Computing and
the 2022 ACM International Symposium on Wearable Computers, UbiComp/ISWC ’22 Adjunct,
Association for Computing Machinery, New York, NY, USA, 2023, p. 123–125. URL: https://doi.
org/10.1145/3544793.3560319. doi:10.1145/3544793.3560319.
[23] T. Nishiura, High-realistic acoustic sound field reproduction: Research trend with parametric
array loudspeaker, IEICE Fundamentals Review 10 (2016) 57–64.
[24] F. Fan, Y. Zhu, J. Yang, A grating lobe suppression method for a steerable parametric array
loudspeaker, in: Proceedings of Meetings on Acoustics, volume 52, Acoustical Society of America,
2023, p. 055002.
[25] S. Kinjo, S. Fujiwara, T. Fujioka, Y. Nagata, Parametric loudspeaker steering system using output
pointing interface, IEICE Technical Report; IEICE Tech. Rep. 120 (2020) 45–49.
[26] T. Zhuang, S. Li, F. Niu, J.-X. Zhong, J. Lu, Generating localized audible zones using a
singlechannel parametric loudspeaker, arXiv preprint arXiv:2504.17440 (2025).
[27] T. Zhuang, J. Zhong, J. Lu, The feasibility of sound zone control using an array of parametric
array loudspeakers, in: 2024 IEEE 14th International Symposium on Chinese Spoken Language
Processing (ISCSLP), IEEE, 2024, pp. 66–70.
[28] M. Nakayama, T. Ekawa, T. Takahashi, T. Nishiura, Virtual sound source construction based
on direct-to-reverberant ratio control using multiple pairs of parametric-array loudspeakers and
conventional loudspeakers, Applied Sciences 15 (2025) 3744.
[29] Y. Deng, K. Chen, H. Li, J. Zhang, Matched standard samples method in laboratory listening tests
for annoyance perception, Applied Acoustics 224 (2024) 110103.
[30] D. Yunyun, L. Hao, D. Bo, L. Jianben, et al., The white noise standard sample method and
application for subjective noise evaluation, Xibei Gongye Daxue Xuebao/Journal of Northwestern
Polytechnical University 40 (2022) 746–754.
[31] S. Aoki, M. Toba, N. Tsujita, Sound localization of stereo reproduction with parametric
loudspeakers, Applied Acoustics 73 (2012) 1289–1295.
[32] D. S. Brungart, S. E. Kruger, T. Kwiatkowski, T. Heil, J. Cohen, The efect of walking on auditory
localization, visual discrimination, and aurally aided visual search, Human factors 61 (2019) 976–
991.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Billinghurst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <article-title>Ar/mr remote collaboration on physical tasks: a review, Robotics and Computer-Integrated Manufacturing 72 (</article-title>
          <year>2021</year>
          )
          <fpage>102071</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Mendoza-Ramírez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Tudon-Martinez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Félix-Herrán</surname>
          </string-name>
          , J. d. J.
          <string-name>
            <surname>Lozoya-Santos</surname>
          </string-name>
          , A. VargasMartínez, Augmented reality: survey,
          <source>Applied Sciences</source>
          <volume>13</volume>
          (
          <year>2023</year>
          )
          <fpage>10491</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kobayashi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ueno</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Ise,</surname>
          </string-name>
          <article-title>The efects of spatialized sounds on the sense of presence in auditory virtual environments: a psychological and physiological study</article-title>
          ,
          <source>Presence: Teleoperators and Virtual Environments</source>
          <volume>24</volume>
          (
          <year>2015</year>
          )
          <fpage>163</fpage>
          -
          <lpage>174</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Langiulli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Calbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sbravatti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Umiltà</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gallese</surname>
          </string-name>
          ,
          <article-title>The efect of surround sound on embodiment and sense of presence in cinematic experience: a behavioral and hd-eeg study</article-title>
          ,
          <source>Frontiers in Neuroscience</source>
          <volume>17</volume>
          (
          <year>2023</year>
          )
          <fpage>1222472</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Hendrix</surname>
          </string-name>
          , W. Barfield,
          <article-title>The sense of presence within auditory virtual environments</article-title>
          ,
          <source>Presence: Teleoperators &amp; Virtual Environments</source>
          <volume>5</volume>
          (
          <year>1996</year>
          )
          <fpage>290</fpage>
          -
          <lpage>301</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Kern</surname>
          </string-name>
          , W. Ellermeier,
          <article-title>Audio in vr: Efects of a soundscape and movement-triggered step sounds on presence</article-title>
          ,
          <source>Frontiers in Robotics and AI</source>
          <volume>7</volume>
          (
          <year>2020</year>
          )
          <fpage>20</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>I.</given-names>
            <surname>Mavridou</surname>
          </string-name>
          , E. Seiss, G. Ugazio,
          <string-name>
            <given-names>M.</given-names>
            <surname>Harpster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Panchevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Erie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lopez</surname>
          </string-name>
          <string-name>
            <surname>Jr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Copt</surname>
          </string-name>
          , et al.,
          <article-title>” did you hear that?”: Software-based spatial audio enhancements increase selfreported and physiological indices on auditory presence and afect in virtual reality first author 1*, second author 2, third author 3, forth author 4, fith author 1, sixth author 5, seventh author 5,</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>