=Paper=
{{Paper
|id=Vol-3297/short3
|storemode=property
|title=Auditory Indication System for Object-finding in Remote Collaborative Assistance
|pdfUrl=https://ceur-ws.org/Vol-3297/short3.pdf
|volume=Vol-3297
|authors=Takayuki Komoda,Fumina Utsumi,Takayoshi Yamada,Keiichi Zempo
|dblpUrl=https://dblp.org/rec/conf/apmar/KomodaUYZ22
}}
==Auditory Indication System for Object-finding in Remote Collaborative Assistance==
<pdf width="1500px">https://ceur-ws.org/Vol-3297/short3.pdf</pdf>
<pre>
Auditory indication system for object-finding in remote
collaborative assistance
Takayuki KOMODA1 , Fumina UTSUMI2 , Takayoshi YAMADA1 and Keiichi ZEMPO3,*
1
  Graduate School of Science and Technology, University of Tsukuba, 1-1-1 Tennodai, 3058573, Japan
2
  College of Engineering Systems, University of Tsukuba, 1-1-1 Tennodai, 3058573, Japan
3
  Faculty of Engineering, Information and Systems, University of Tsukuba, 1-1-1 Tennodai, 3058573, Japan


                                       Abstract
                                       In this paper, we propose a remote collaboration system to assist the person with visually impaired in object-finding. The
                                       system consists of a 360-degree image centering on a person with visually impaired and presented as a panoramic image
                                       on the PC screen of a supporter in a remote location. By clicking on the PC screen, the supporter can present the AR
                                       audio (auditory indicator) superimposed on the real space that the person with visually impaired perceives, using spatial
                                       sounds. Auditory indicator enables the person with visually impaired to understand the location of the object intuitively. We
                                       conducted experiments to clarify the effect of the proposed system on the performance time of the object-finding task and the
                                       phrases of the supporters. The results of the experiment showed that the auditory indicator enabled the supporter to guide
                                       the simulated person with visually impaired by using demonstrative pronouns such as “this” and “here”.

                                       Keywords
                                       Augmented reality audio, Human Augmentations, Assistive technology, Remote collaboration


1. Introduction                                                                                         in language, sighted people describe the route based on
                                                                                                        their location. In contrast, PVI describes the route based
According to 2012 World Health Organization (WHO) on the locations of landmarks. Furthermore, for route
report, there are approximately 285 million person with descriptions between specific points, audio navigation
visually impaired (PVI) worldwide1 . PVI suffer many in- created by PVI is subjectively more satisfactory for PVI
conveniences in their daily lives due to their inability to than that created by sighted people [8]. The reason was
recognize visual information. Various assistive technolo- that PVI felt more secure, oriented, and clear when the
gies have been studied, such as navigation aids [1, 2] and direction of travel was explained using landmarks. In
object-finding aids [3, 4, 5]. Assistive technologies have addition, in conversation among sighted people, they can
made progress in assisting PVI. On the other hand, there point to an arbitrary area using demonstrative pronouns
are still many problems before assistive technology can such as "there" and communicate without redundant ex-
be widely used in daily life, such as system error rates pressions [9]. On the other hand, the PVI are less likely
and communication speed. Remote sighted assistance to use demonstrative pronouns in conversation because
(RSA) [6] has received lots of attention in addressing they cannot recognize visual information. They tend to
these issues. RSA combines assistive technologies with communicate more verbosely compared to the sighted [7].
remote assistance from sighted people, and systems such Redundant communication has been suggested to be a
as VizWiz [5] and Be My Eyes2 have been developed.                                                      stress factor in remote collaboration [10].
    However, these remote collaboration systems do not                                                     According to Kraut et al. [11], in a remote collabora-
consider the differences in spatial perception between tion among sighted people, sharing the local user’s visual
PVI and sighted people. According to Tsai et al. [7], PVI space with the supporter in a remote location caused
and sighted people perceive space differently. For ex- the supporter to utter demonstrative pronouns. Gupta et
ample, when describing a route between specific points al. [12] also showed that sharing the local user’s visual
                                                                                                        space and the remote supporter’s gaze shortens the local
APMAR’22: The 14th Asia-PacificWorkshop on Mixed and Augmented user’s task performance time in a remote collaboration
Reality, Dec. 02-03, 2022, Yokohama, Japan                                                              among sighted people. The supporter used more demon-
*
   Corresponding author.
                                                                                                        strative pronouns than when only the visual space was
$ zempo@iit.tsukuba.ac.jp (K. ZEMPO)
 0000-0002-4012-4417 (T. YAMADA); 0000-0003-2339-5298                                                  shared.
(K. ZEMPO)                                                                                                 In this paper, we propose an auditory indication sys-
           © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
           Attribution 4.0 International (CC BY 4.0).                                                   tem for remote collaboration with PVI in object-finding,
    CEUR

           CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073

                                                                                                        which utilizes AR audio superimposed on the real space
1
  https://www.emro.who.int/control-and-preventions-of-blindness-
                                                                                                        that PVI perceives. We conducted experiments to clarify
  and-deafness/announcements/global-estimates-on-visual-
  impairment.html                                                                                       the proposed system’s effect on the performance time
2
  https://www.bemyeyes.com/                                                                             of the object-finding task and the phrases of supporters.
The contributions of this paper are as follows                 remote location. A 360-degree camera is used to present
                                                               a panoramic image of the environment around PVI on the
     • Proposal of an interface using AR audio.                screen of the supporter’s PC. When the supporter clicks
     • AR audio technology for remote collaboration.           any point in the panoramic image, auditory indicators
     • Suggestion of a new interaction between PVI and         are presented to PVI from the direction corresponding
       sighted people using AR audio.                          to the panoramic image. By presenting auditory indi-
                                                               cators with spatial sounds, PVI can intuitively perform
                                                               object-finding.
2. Related work
This chapter describes studies on object-finding assis-        3.2. Interface
tance for PVI. Kaul et al. [3] developed a mobile ap-
                                                               In this section, we describe the configuration of the in-
plication that combines an object detection framework
                                                               terface. PVI wears a 360-degree camera (RICOH THETA
and spatial sound to assist PVI in navigation and object-
                                                               Z1) on the top of the head and presents a panoramic im-
finding. The application recognizes objects around a PVI
                                                               age of the surrounding environment to the screen of the
using a smartphone’s camera function and provides audi-
                                                               supporter’s PC. PVI wears open-ear headphones (Sony
tory feedback by spatial sound to explain the scene. The
                                                               LinkBuds) to listen to auditory indicators without inter-
user evaluation of the auditory feedback was favorable.
                                                               fering with the environmental sounds. The supporter
However, some issues related to the system, such as low
                                                               finds the object (target) from the presented images of the
object detection accuracy and narrow detection range,
                                                               surroundings of PVI and clicks on the screen. The sound
were mentioned.
                                                               source is placed on the computer’s three-dimensional
   In order to improve the reliability and usefulness of
                                                               space corresponding to the target’s position in real space
assistive technology, Bigham et al. developed VizWiz [5],
                                                               by clicking on the screen. Based on the positional rela-
which combines the assistance of sighted people. PVI
                                                               tionship between PVI and the sound source in the com-
uses a smartphone’s camera function to take pictures,
                                                               puter’s three-dimensional space, spatial sounds are gen-
ask questions to online supporters, and receive voice
                                                               erated and presented to PVI.
responses. The system does not make object detection
errors, but PVI requires some recognition of the object’s
location to be found. Therefore, if the object’s location is   3.3. Auditory indicator
unknown, it is not easy to use the system.                     The proposed system presents auditory indicators as spa-
   Be My Eyes2 is a mobile application that allows PVI to      tial sounds to PVI.
ask for assistance from a remote location using a video           Spatial sound means that meta-information such as
chat on their smartphones. PVI does not need to be             distance, direction, and spatial extent is represented in
aware of the object’s location in advance to receive as-       the sound reproduction. The perceived information, such
sistance. However, the field of view that can be shared        as distance and direction, is called sound image localiza-
with the user is limited due to smartphone camera use.         tion. Sound image localization can be applied to a sound
Jones et al. [13] show that in remote collaboration us-        source by giving volume, time, and frequency response
ing smartphones, users who receive video sharing from          differences to the left and right voices. In this paper, to
their smartphones intentionally use information from           present spatial sound, sound image localization is con-
the camera images when asking questions. However, the          volved with a sound source using Unity and Steam Audio.
narrow field of view and the inability to control the di-
rection of the camera was found to be stress factors. In
Be My Eyes, since the person being assisted is PVI, the        4. Experiment
supporter cannot easily convey the information obtained
from the visual images, which may cause redundant com-   In order to clarify the effect of using the proposed system
munication. Wang et al. [10] suggest that redundant      on the performance time of the object-finding task and
communication is a stress factor in remote collaboration.the phrases of the supporter, we conducted an object-
                                                         finding task experiment with a simulated person with
                                                         visually impaired (SPVI) and a participant playing the
3. System design                                         role of a supporter in a remote location, based on the
                                                         experiment by Kual et al. [3]
3.1. System overview                                        Eight Japanese university students were randomly
The system proposed is shown in Figure 1. The pro- paired, with one participant as SPVI and the other as the
posed system consists of PVI, who receives support, and supporter. SPVI wore an eye mask. Of the four groups of
a supporter (sighted people) who provides support from a eight participants in the experiment, two groups of four
                                                                                                                  Desk                    Target
                                                                             360 degree
                                                     Open-ear
                                                                               camera
                                                    headphones


                        Click on the screen


                         Supporter                     Person with visually impaired                               Beginning position of SPVI


                                                                                                                                                3.6 m
                        Caluculation of auditory    Spatial audio                  Auditory
                                                                                   indicator
                       indicatorʼs 3D coordinates
 Click on the screen


                                                      Live streaming


                                                                                                                                                        1.7 m
                          Panoramic image


                                                              360 degree      Open-ear
                                                            camera (RICOH)                                                     2.3 m
                                                                             headphones
                                                                               (Sony)

                                   PC                                                                                    Designated position

                           Supporter                Person with visually impaired                               Overview of the laboratory

Figure 1: System configuration and overview of the laboratory.


used the proposed system, which enables the sharing of                                       The remote collaboration could not be performed cor-
360-degree images, auditory indicators, and voice instruc-                                rectly in the third trial of pair #1 due to the problem with
tions (Condition proposed.) The remaining two groups                                      the output of the panoramic image. Therefore, the task
of four shared the 360-degree images and performed the                                    performance time was excluded as an outlier. In addition,
task by voice instructions without auditory indicators                                    the task of the third trial of pair #3 was also excluded
(Condition control.).                                                                     from matching the number of tasks.
   In the laboratory, desks are arranged in four directions                                  As a result of the experiment, the task performance
around the SPVI, and seven objects (targets) are placed on                                time tends to be shortened when auditory indicators are
the desks. Figure 1 shows an overview of the laboratory.                                  presented.
   In the experiment, SPVI is instructed by the experiment
supervisor on what to find (targets). Although there are                                  5.2. Effect on speech
seven possible targets, SPVI is only informed of them
once the experiment supervisor instructs SPVI to look                                     As a result of the experiment, it was confirmed that when
for them. SPVI cooperates with the supporter through the                                  the auditory indicator was not presented, the supporters
system to find the target and carry it to the designated                                  used directional phrases such as "right/left/straight" to
position. Three trials were conducted per pair. Since                                     support SPVI.
the eye mask blocked SPVI’s vision, the experiment was                                       When auditory indicators were presented, phrases
conducted with SPVI sitting on a swivel chair with wheels                                 such as "right/left/straight" were also confirmed. On the
to avoid the risk of falling.                                                             other hand, we could confirm the use of demonstrative
                                                                                          pronouns such as "this" and "here". In addition, it was
                                                                                          confirmed that SPVI responded correctly to the direc-
5. Results                                                                                tion of the target when demonstrative pronouns were
                                                                                          used. Table 2 shows the number of times demonstrative
5.1. Performance time                                                                     pronouns, the phrases "left," "right," and "straight" were
The task performance time was compared between the                                        used.
conditions in which spatial sounds were presented and
those in which they were not. The results are shown in                                    5.3. Discussion
Table 1.
                                                                                          The experiment results showed that the use of auditory
  Participants in Pairs #1 and #2 were presented with
                                                                                          indicators shortened the performance time of the object-
auditory indicators, while those in Pairs #3 and #4 were
                                                                                          finding task and enabled the use of demonstrative pro-
not presented with auditory indicators.
                                                                                          nouns by the supporters.
Table 1                                                          were presented. In addition, it was also confirmed that the
Performance time. #1, #2 are using auditory indicator, #3, #4    auditory indicator made the supporter use more demon-
are not using auditory indicator.                                strative pronouns such as "this" and "here". We concluded
                          Performance time[s]                    that these results are because the presentation of the tar-
              Pair      1st    2nd     3rd    Ave                get direction by auditory indicators functioned in the
               #1      56.3 41.0        -     48.7               same way as gaze sharing in the remote collaboration
               #2      72.6 57.5 40.7 56.9                       between sighted people. The results of this paper sug-
               #3      52.0 74.3        -     63.2               gest that demonstrative pronouns can be used in remote
               #4      99.6 55.3 52.8 69.2                       collaboration with PVI, indicating a new interaction be-
                                                                 tween PVI and sighted people using AR audio. However,
Table 2                                                          one limitation of this paper is that the participants of
Number of times demonstrative pronouns, the phrases “left”, the experiment were SPVI. Since SPVI is a blindfolded
“right” and “straight” is used. Condition proposed: Using audi- sighted person, it is not representative of PVI, and future
tory indicator, condition control: Not using auditory indicator. studies should be conducted with PVI as a participant.
                  Cond.: proposed          Cond.: control           In the future, it is necessary to investigate whether the
                 Total Ave Var           Total Ave Var           presentation of auditory indicators has the same role as
      Pro.         7       1.4    1.8      0       0      0      gaze sharing and to clarify the mechanism by which the
      Left         3       0.6    0.8      7      1.4   0.8      demonstrative pronouns were used.
    Right        5      1.0    3.0      7      1.4    0.3
   Straight      4      0.8    1.7      7      1.4    0.8
                                                                References
   We conclude that the reason for the shorter task perfor-      [1] M. H. A. Wahab, A. A. Talib, H. A. Kadir, A. Johari,
mance time is that the auditory indicator enables an in-             A. Noraziah, R. M. Sidek, A. A. Mutalib, Smart cane:
tuitive presentation of the target direction. In the remote          Assistive cane for visually-impaired people, arXiv
collaboration between sighted people, task efficiency was            preprint arXiv:1110.5156 (2011).
improved by sharing the gaze of the supporter by point-          [2] A. Helal, S. E. Moore, B. Ramachandran, Drishti: An
ing [12]. We consider that a similar mechanism is re-                integrated navigation system for visually impaired
sponsible for shorting task performance time. We also                and disabled, in: Proceedings fifth international
note that demonstrative pronouns were used. Previous                 symposium on wearable computers, IEEE, 2001, pp.
work [12, 11] confirmed that sharing the visual space                149–156.
enables supporters to use demonstrative pronouns, im-            [3] O. B. Kaul, K. Behrens, M. Rohs, Mobile recognition
proving task efficiency. The results of this paper are               and tracking of objects in the environment through
consistent with these results.                                       augmented reality and 3d audio cues for people
   The use of demonstrative pronouns is considered to be             with visual impairments, in: Extended Abstracts
because the auditory indicators functioned in the same               of the 2021 CHI Conference on Human Factors in
way as pointing for the sighted people, and a pseudo-                Computing Systems, 2021, pp. 1–7.
visual space was shared between SPVI and the supporters.         [4] M. Eckert, M. Blex, C. M. Friedrich, et al., Object
This is concluded from the fact that in the study of remote          detection featuring 3d audio localization for mi-
collaboration among sighted people by Kraut et al. [11],             crosoft hololens, in: Proc. 11th Int. Joint Conf. on
the supporter started to use demonstrative pronouns after            Biomedical Engineering Systems and Technologies,
the visual space was shared.                                         volume 5, 2018, pp. 555–561.
                                                                 [5] J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C.
                                                                     Miller, R. Miller, A. Tatarowicz, B. White, S. White,
6. Conclusion and future works                                       et al., Vizwiz: nearly real-time answers to visual
                                                                     questions, in: Proceedings of the 23nd annual ACM
In this paper, we propose a system that uses auditory                symposium on User interface software and technol-
indicators to provide directions to assist PVI in object-            ogy, 2010, pp. 333–342.
finding through remote collaboration. In order to clarify        [6] S. Lee, M. Reddie, C.-H. Tsai, J. Beck, M. B. Rosson,
the effectiveness of the proposed system, we conducted               J. M. Carroll, The emerging professional practice
an object-finding task experiment. We investigated the               of remote sighted assistance for people with visual
effect of the auditory indicator on the task performance             impairments, in: Proceedings of the 2020 CHI Con-
time and the phrases of supporters. As a result of the               ference on Human Factors in Computing Systems,
experiment, it was confirmed that the task performance               2020, pp. 1–12.
time tended to be shortened when auditory indicators             [7] K.-Y. Tsai, Y.-H. Hung, R. Chen, E. Chang, Indoor
     spatial voice navigation for people with visual im-
     pairment and without visual impairment, in: Pro-
     ceedings of the 2019 7th International Conference
     on Information and Education Technology, 2019,
     pp. 295–300.
 [8] Y.-H. Hung, K.-Y. Tsai, E. Chang, R. Chen, Voice
     navigation created by vip improves spatial perfor-
     mance in people with impaired vision, International
     journal of environmental research and public health
     19 (2022) 4138.
 [9] Y. Hato, S. Satake, T. Kanda, M. Imai, N. Hagita,
     Pointing to space: modeling of deictic interaction
     referring to regions, in: 2010 5th ACM/IEEE Inter-
     national Conference on Human-Robot Interaction
     (HRI), IEEE, 2010, pp. 301–308.
[10] B. Wang, Y. Liu, J. Qian, S. K. Parker, Achieving
     effective remote working during the covid-19 pan-
     demic: A work design perspective, Applied psy-
     chology 70 (2021) 16–59.
[11] R. E. Kraut, D. Gergle, S. R. Fussell, The use of vi-
     sual information in shared visual spaces: Informing
     the development of virtual co-presence, in: Pro-
     ceedings of the 2002 ACM conference on Computer
     supported cooperative work, 2002, pp. 31–40.
[12] K. Gupta, G. A. Lee, M. Billinghurst, Do you see
     what i see? the effect of gaze tracking on task space
     remote collaboration, IEEE transactions on visual-
     ization and computer graphics 22 (2016) 2413–2422.
[13] B. Jones, A. Witcraft, S. Bateman, C. Neustaedter,
     A. Tang, Mechanics of camera work in mobile video
     collaboration, in: Proceedings of the 33rd Annual
     ACM Conference on Human Factors in Computing
     Systems, 2015, pp. 957–966.

</pre>