=Paper=
{{Paper
|id=Vol-3297/short3
|storemode=property
|title=Auditory Indication System for Object-finding in Remote Collaborative Assistance
|pdfUrl=https://ceur-ws.org/Vol-3297/short3.pdf
|volume=Vol-3297
|authors=Takayuki Komoda,Fumina Utsumi,Takayoshi Yamada,Keiichi Zempo
|dblpUrl=https://dblp.org/rec/conf/apmar/KomodaUYZ22
}}
==Auditory Indication System for Object-finding in Remote Collaborative Assistance==
Auditory indication system for object-finding in remote collaborative assistance Takayuki KOMODA1 , Fumina UTSUMI2 , Takayoshi YAMADA1 and Keiichi ZEMPO3,* 1 Graduate School of Science and Technology, University of Tsukuba, 1-1-1 Tennodai, 3058573, Japan 2 College of Engineering Systems, University of Tsukuba, 1-1-1 Tennodai, 3058573, Japan 3 Faculty of Engineering, Information and Systems, University of Tsukuba, 1-1-1 Tennodai, 3058573, Japan Abstract In this paper, we propose a remote collaboration system to assist the person with visually impaired in object-finding. The system consists of a 360-degree image centering on a person with visually impaired and presented as a panoramic image on the PC screen of a supporter in a remote location. By clicking on the PC screen, the supporter can present the AR audio (auditory indicator) superimposed on the real space that the person with visually impaired perceives, using spatial sounds. Auditory indicator enables the person with visually impaired to understand the location of the object intuitively. We conducted experiments to clarify the effect of the proposed system on the performance time of the object-finding task and the phrases of the supporters. The results of the experiment showed that the auditory indicator enabled the supporter to guide the simulated person with visually impaired by using demonstrative pronouns such as “this” and “here”. Keywords Augmented reality audio, Human Augmentations, Assistive technology, Remote collaboration 1. Introduction in language, sighted people describe the route based on their location. In contrast, PVI describes the route based According to 2012 World Health Organization (WHO) on the locations of landmarks. Furthermore, for route report, there are approximately 285 million person with descriptions between specific points, audio navigation visually impaired (PVI) worldwide1 . PVI suffer many in- created by PVI is subjectively more satisfactory for PVI conveniences in their daily lives due to their inability to than that created by sighted people [8]. The reason was recognize visual information. Various assistive technolo- that PVI felt more secure, oriented, and clear when the gies have been studied, such as navigation aids [1, 2] and direction of travel was explained using landmarks. In object-finding aids [3, 4, 5]. Assistive technologies have addition, in conversation among sighted people, they can made progress in assisting PVI. On the other hand, there point to an arbitrary area using demonstrative pronouns are still many problems before assistive technology can such as "there" and communicate without redundant ex- be widely used in daily life, such as system error rates pressions [9]. On the other hand, the PVI are less likely and communication speed. Remote sighted assistance to use demonstrative pronouns in conversation because (RSA) [6] has received lots of attention in addressing they cannot recognize visual information. They tend to these issues. RSA combines assistive technologies with communicate more verbosely compared to the sighted [7]. remote assistance from sighted people, and systems such Redundant communication has been suggested to be a as VizWiz [5] and Be My Eyes2 have been developed. stress factor in remote collaboration [10]. However, these remote collaboration systems do not According to Kraut et al. [11], in a remote collabora- consider the differences in spatial perception between tion among sighted people, sharing the local user’s visual PVI and sighted people. According to Tsai et al. [7], PVI space with the supporter in a remote location caused and sighted people perceive space differently. For ex- the supporter to utter demonstrative pronouns. Gupta et ample, when describing a route between specific points al. [12] also showed that sharing the local user’s visual space and the remote supporter’s gaze shortens the local APMAR’22: The 14th Asia-PacificWorkshop on Mixed and Augmented user’s task performance time in a remote collaboration Reality, Dec. 02-03, 2022, Yokohama, Japan among sighted people. The supporter used more demon- * Corresponding author. strative pronouns than when only the visual space was $ zempo@iit.tsukuba.ac.jp (K. ZEMPO) 0000-0002-4012-4417 (T. YAMADA); 0000-0003-2339-5298 shared. (K. ZEMPO) In this paper, we propose an auditory indication sys- © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). tem for remote collaboration with PVI in object-finding, CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 which utilizes AR audio superimposed on the real space 1 https://www.emro.who.int/control-and-preventions-of-blindness- that PVI perceives. We conducted experiments to clarify and-deafness/announcements/global-estimates-on-visual- impairment.html the proposed system’s effect on the performance time 2 https://www.bemyeyes.com/ of the object-finding task and the phrases of supporters. The contributions of this paper are as follows remote location. A 360-degree camera is used to present a panoramic image of the environment around PVI on the • Proposal of an interface using AR audio. screen of the supporter’s PC. When the supporter clicks • AR audio technology for remote collaboration. any point in the panoramic image, auditory indicators • Suggestion of a new interaction between PVI and are presented to PVI from the direction corresponding sighted people using AR audio. to the panoramic image. By presenting auditory indi- cators with spatial sounds, PVI can intuitively perform object-finding. 2. Related work This chapter describes studies on object-finding assis- 3.2. Interface tance for PVI. Kaul et al. [3] developed a mobile ap- In this section, we describe the configuration of the in- plication that combines an object detection framework terface. PVI wears a 360-degree camera (RICOH THETA and spatial sound to assist PVI in navigation and object- Z1) on the top of the head and presents a panoramic im- finding. The application recognizes objects around a PVI age of the surrounding environment to the screen of the using a smartphone’s camera function and provides audi- supporter’s PC. PVI wears open-ear headphones (Sony tory feedback by spatial sound to explain the scene. The LinkBuds) to listen to auditory indicators without inter- user evaluation of the auditory feedback was favorable. fering with the environmental sounds. The supporter However, some issues related to the system, such as low finds the object (target) from the presented images of the object detection accuracy and narrow detection range, surroundings of PVI and clicks on the screen. The sound were mentioned. source is placed on the computer’s three-dimensional In order to improve the reliability and usefulness of space corresponding to the target’s position in real space assistive technology, Bigham et al. developed VizWiz [5], by clicking on the screen. Based on the positional rela- which combines the assistance of sighted people. PVI tionship between PVI and the sound source in the com- uses a smartphone’s camera function to take pictures, puter’s three-dimensional space, spatial sounds are gen- ask questions to online supporters, and receive voice erated and presented to PVI. responses. The system does not make object detection errors, but PVI requires some recognition of the object’s location to be found. Therefore, if the object’s location is 3.3. Auditory indicator unknown, it is not easy to use the system. The proposed system presents auditory indicators as spa- Be My Eyes2 is a mobile application that allows PVI to tial sounds to PVI. ask for assistance from a remote location using a video Spatial sound means that meta-information such as chat on their smartphones. PVI does not need to be distance, direction, and spatial extent is represented in aware of the object’s location in advance to receive as- the sound reproduction. The perceived information, such sistance. However, the field of view that can be shared as distance and direction, is called sound image localiza- with the user is limited due to smartphone camera use. tion. Sound image localization can be applied to a sound Jones et al. [13] show that in remote collaboration us- source by giving volume, time, and frequency response ing smartphones, users who receive video sharing from differences to the left and right voices. In this paper, to their smartphones intentionally use information from present spatial sound, sound image localization is con- the camera images when asking questions. However, the volved with a sound source using Unity and Steam Audio. narrow field of view and the inability to control the di- rection of the camera was found to be stress factors. In Be My Eyes, since the person being assisted is PVI, the 4. Experiment supporter cannot easily convey the information obtained from the visual images, which may cause redundant com- In order to clarify the effect of using the proposed system munication. Wang et al. [10] suggest that redundant on the performance time of the object-finding task and communication is a stress factor in remote collaboration.the phrases of the supporter, we conducted an object- finding task experiment with a simulated person with visually impaired (SPVI) and a participant playing the 3. System design role of a supporter in a remote location, based on the experiment by Kual et al. [3] 3.1. System overview Eight Japanese university students were randomly The system proposed is shown in Figure 1. The pro- paired, with one participant as SPVI and the other as the posed system consists of PVI, who receives support, and supporter. SPVI wore an eye mask. Of the four groups of a supporter (sighted people) who provides support from a eight participants in the experiment, two groups of four Desk Target 360 degree Open-ear camera headphones Click on the screen Supporter Person with visually impaired Beginning position of SPVI 3.6 m Caluculation of auditory Spatial audio Auditory indicator indicatorʼs 3D coordinates Click on the screen Live streaming 1.7 m Panoramic image 360 degree Open-ear camera (RICOH) 2.3 m headphones (Sony) PC Designated position Supporter Person with visually impaired Overview of the laboratory Figure 1: System configuration and overview of the laboratory. used the proposed system, which enables the sharing of The remote collaboration could not be performed cor- 360-degree images, auditory indicators, and voice instruc- rectly in the third trial of pair #1 due to the problem with tions (Condition proposed.) The remaining two groups the output of the panoramic image. Therefore, the task of four shared the 360-degree images and performed the performance time was excluded as an outlier. In addition, task by voice instructions without auditory indicators the task of the third trial of pair #3 was also excluded (Condition control.). from matching the number of tasks. In the laboratory, desks are arranged in four directions As a result of the experiment, the task performance around the SPVI, and seven objects (targets) are placed on time tends to be shortened when auditory indicators are the desks. Figure 1 shows an overview of the laboratory. presented. In the experiment, SPVI is instructed by the experiment supervisor on what to find (targets). Although there are 5.2. Effect on speech seven possible targets, SPVI is only informed of them once the experiment supervisor instructs SPVI to look As a result of the experiment, it was confirmed that when for them. SPVI cooperates with the supporter through the the auditory indicator was not presented, the supporters system to find the target and carry it to the designated used directional phrases such as "right/left/straight" to position. Three trials were conducted per pair. Since support SPVI. the eye mask blocked SPVI’s vision, the experiment was When auditory indicators were presented, phrases conducted with SPVI sitting on a swivel chair with wheels such as "right/left/straight" were also confirmed. On the to avoid the risk of falling. other hand, we could confirm the use of demonstrative pronouns such as "this" and "here". In addition, it was confirmed that SPVI responded correctly to the direc- 5. Results tion of the target when demonstrative pronouns were used. Table 2 shows the number of times demonstrative 5.1. Performance time pronouns, the phrases "left," "right," and "straight" were The task performance time was compared between the used. conditions in which spatial sounds were presented and those in which they were not. The results are shown in 5.3. Discussion Table 1. The experiment results showed that the use of auditory Participants in Pairs #1 and #2 were presented with indicators shortened the performance time of the object- auditory indicators, while those in Pairs #3 and #4 were finding task and enabled the use of demonstrative pro- not presented with auditory indicators. nouns by the supporters. Table 1 were presented. In addition, it was also confirmed that the Performance time. #1, #2 are using auditory indicator, #3, #4 auditory indicator made the supporter use more demon- are not using auditory indicator. strative pronouns such as "this" and "here". We concluded Performance time[s] that these results are because the presentation of the tar- Pair 1st 2nd 3rd Ave get direction by auditory indicators functioned in the #1 56.3 41.0 - 48.7 same way as gaze sharing in the remote collaboration #2 72.6 57.5 40.7 56.9 between sighted people. The results of this paper sug- #3 52.0 74.3 - 63.2 gest that demonstrative pronouns can be used in remote #4 99.6 55.3 52.8 69.2 collaboration with PVI, indicating a new interaction be- tween PVI and sighted people using AR audio. However, Table 2 one limitation of this paper is that the participants of Number of times demonstrative pronouns, the phrases “left”, the experiment were SPVI. Since SPVI is a blindfolded “right” and “straight” is used. Condition proposed: Using audi- sighted person, it is not representative of PVI, and future tory indicator, condition control: Not using auditory indicator. studies should be conducted with PVI as a participant. Cond.: proposed Cond.: control In the future, it is necessary to investigate whether the Total Ave Var Total Ave Var presentation of auditory indicators has the same role as Pro. 7 1.4 1.8 0 0 0 gaze sharing and to clarify the mechanism by which the Left 3 0.6 0.8 7 1.4 0.8 demonstrative pronouns were used. Right 5 1.0 3.0 7 1.4 0.3 Straight 4 0.8 1.7 7 1.4 0.8 References We conclude that the reason for the shorter task perfor- [1] M. H. A. Wahab, A. A. Talib, H. A. Kadir, A. Johari, mance time is that the auditory indicator enables an in- A. Noraziah, R. M. Sidek, A. A. Mutalib, Smart cane: tuitive presentation of the target direction. In the remote Assistive cane for visually-impaired people, arXiv collaboration between sighted people, task efficiency was preprint arXiv:1110.5156 (2011). improved by sharing the gaze of the supporter by point- [2] A. Helal, S. E. Moore, B. Ramachandran, Drishti: An ing [12]. We consider that a similar mechanism is re- integrated navigation system for visually impaired sponsible for shorting task performance time. We also and disabled, in: Proceedings fifth international note that demonstrative pronouns were used. Previous symposium on wearable computers, IEEE, 2001, pp. work [12, 11] confirmed that sharing the visual space 149–156. enables supporters to use demonstrative pronouns, im- [3] O. B. Kaul, K. Behrens, M. Rohs, Mobile recognition proving task efficiency. The results of this paper are and tracking of objects in the environment through consistent with these results. augmented reality and 3d audio cues for people The use of demonstrative pronouns is considered to be with visual impairments, in: Extended Abstracts because the auditory indicators functioned in the same of the 2021 CHI Conference on Human Factors in way as pointing for the sighted people, and a pseudo- Computing Systems, 2021, pp. 1–7. visual space was shared between SPVI and the supporters. [4] M. Eckert, M. Blex, C. M. Friedrich, et al., Object This is concluded from the fact that in the study of remote detection featuring 3d audio localization for mi- collaboration among sighted people by Kraut et al. [11], crosoft hololens, in: Proc. 11th Int. Joint Conf. on the supporter started to use demonstrative pronouns after Biomedical Engineering Systems and Technologies, the visual space was shared. volume 5, 2018, pp. 555–561. [5] J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, S. White, 6. Conclusion and future works et al., Vizwiz: nearly real-time answers to visual questions, in: Proceedings of the 23nd annual ACM In this paper, we propose a system that uses auditory symposium on User interface software and technol- indicators to provide directions to assist PVI in object- ogy, 2010, pp. 333–342. finding through remote collaboration. In order to clarify [6] S. Lee, M. Reddie, C.-H. Tsai, J. Beck, M. B. Rosson, the effectiveness of the proposed system, we conducted J. M. Carroll, The emerging professional practice an object-finding task experiment. We investigated the of remote sighted assistance for people with visual effect of the auditory indicator on the task performance impairments, in: Proceedings of the 2020 CHI Con- time and the phrases of supporters. As a result of the ference on Human Factors in Computing Systems, experiment, it was confirmed that the task performance 2020, pp. 1–12. time tended to be shortened when auditory indicators [7] K.-Y. Tsai, Y.-H. Hung, R. Chen, E. Chang, Indoor spatial voice navigation for people with visual im- pairment and without visual impairment, in: Pro- ceedings of the 2019 7th International Conference on Information and Education Technology, 2019, pp. 295–300. [8] Y.-H. Hung, K.-Y. Tsai, E. Chang, R. Chen, Voice navigation created by vip improves spatial perfor- mance in people with impaired vision, International journal of environmental research and public health 19 (2022) 4138. [9] Y. Hato, S. Satake, T. Kanda, M. Imai, N. Hagita, Pointing to space: modeling of deictic interaction referring to regions, in: 2010 5th ACM/IEEE Inter- national Conference on Human-Robot Interaction (HRI), IEEE, 2010, pp. 301–308. [10] B. Wang, Y. Liu, J. Qian, S. K. Parker, Achieving effective remote working during the covid-19 pan- demic: A work design perspective, Applied psy- chology 70 (2021) 16–59. [11] R. E. Kraut, D. Gergle, S. R. Fussell, The use of vi- sual information in shared visual spaces: Informing the development of virtual co-presence, in: Pro- ceedings of the 2002 ACM conference on Computer supported cooperative work, 2002, pp. 31–40. [12] K. Gupta, G. A. Lee, M. Billinghurst, Do you see what i see? the effect of gaze tracking on task space remote collaboration, IEEE transactions on visual- ization and computer graphics 22 (2016) 2413–2422. [13] B. Jones, A. Witcraft, S. Bateman, C. Neustaedter, A. Tang, Mechanics of camera work in mobile video collaboration, in: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, 2015, pp. 957–966.