=Paper=
{{Paper
|id=Vol-2960/paper9
|storemode=property
|title=Voicing Concerns: User-Specific Pitfalls of Favoring Voice over Text in Conversational Recommender Systems (Short paper)
|pdfUrl=https://ceur-ws.org/Vol-2960/paper9.pdf
|volume=Vol-2960
|authors=Alain D. Starke,Minha Lee
|dblpUrl=https://dblp.org/rec/conf/recsys/StarkeL21
}}
==Voicing Concerns: User-Specific Pitfalls of Favoring Voice over Text in Conversational Recommender Systems (Short paper)==
<pdf width="1500px">https://ceur-ws.org/Vol-2960/paper9.pdf</pdf>
<pre>
Voicing Concerns: User-Specific Pitfalls of Favoring Voice
over Text in Conversational Recommender Systems
Alain D. Starke1,2 , Minha Lee3
1
  Wageningen University & Research, Droevendaalsesteeg 4, 6708 PB Wageningen, The Netherlands
2
  University of Bergen, P.O. Box 7800, 5020 Bergen, Norway
3
  Eindhoven University of Technology, Groene Loper 3, 5612 AE Eindhoven, The Netherlands


                                          Abstract
                                          In the context of Conversational Recommender Systems (CRSs) and Conversational User Interfaces (CUIs; e.g., Digital
                                          Assistants, such as Siri), an increasing number of voice-based applications are emerging, often at the expense of text-based
                                          applications. In this position paper, we argue that the possible first-mover advantage of adopting voice-based technologies
                                          may put specific groups of users at a profound disadvantage, as they are likely to run into accessibility issues. For example,
                                          users that stammer or whom are not fluent in the English language have a hard time using voice-based conversational
                                          recommender systems. Along this line, we describe a number of challenges and issues for current and future systems.

                                          Keywords
                                          Conversational User Interfaces, Recommender Systems, Accessibility, Inclusion, Voice-based Systems


1. Introduction                                                                                    tions will be a bigger target for many commercial ap-
                                                                                                   plications than text-based systems. For one, a specific
Voice-based technologies are diffusing through society at                                          application of voice-based interactions for recommender
a fast pace. Reportedly 4.2 billion digital voice assistants                                       systems research is the users’ ability to retrieve person-
were in use in 2020 [1], including well-known technolo-                                            alized suggestions by voice, as hands-free interaction
gies such as Amazon Alexa and Siri. Their role in the                                              [3, 7].
‘Internet of Things’ system is becoming increasingly im-                                              Despite the possibilities of voice-based interactions,
portant [2], in the sense incumbent technologies, such as                                          we see some challenges. Specifically, the trend of the
recommender systems, are often made compatible with                                                commercial landscape that prioritizes voice first comes
voice-based applications [3].                                                                      with disadvantages for specific users who are either not
              One aspect of voice-based or conversational user in-                                 equipped to work with the technology (e.g., Siri) or who
terfaces is to retrieve personalized content. To date, how-                                        are not the targeted, ‘mainstream’ user.
ever, most conversational recommender systems (CRSs)                                                  We briefly give an overview of text and voice systems
to date are text-based [4]. They focus on mining textual                                           before we jump to the critique of voice-based recom-
user input, such as through fixed messages in clickable                                            mender systems. We bring up why text-based solutions
menus or by open-ended text queries [5, 6]. In compar-                                             may be more beneficial in certain contexts and for spe-
ison, the number of voice-based conversational recom-                                              cific users, making it important text-based CRSs are not
mender systems are still limited, but is likely to expand                                          discontinued. However, due to the growing trend of
in the coming years [3].                                                                           voice-based interactions, e.g., the rise of Alexa, we be-
              The user-system dynamic between text-based and                                       lieve that we cannot avoid designing for different types
voice-based interactions differs greatly. Whereas text-                                            of voice-based interactions in the coming future. For the
based CRSs can rely on either open-ended queries or fixed                                          latter, we will formulate a few suggestions.
input (e.g., the user selects an answer option), voice-based
queries tend to be impromptu and are more complex to
process. Nonetheless, given the current share and ex- 2. Conversational Systems
pected growth of digital assistant use [1], the emergence
of digital assistants, such as Amazon Alexa and Google 2.1. Text-based systems
Home, suggests that designing for voice-based interac-
                                                                                                  Text-based conversational systems, which are also known
                                                                                                  as chatbots, have been around for decades, such as
3rd Edition of Knowledge-aware and Conversational Recommender
Systems (KaRS) & 5th Edition of Recommendation in Complex                                         Weizenbaum’s    ELIZA in the 1960s [8]. Chatbots now
Environments (ComplexRec) Joint Workshop @ RecSys 2021,                                           exist on many business-to-consumer websites, for ex-
September 27–1 October 2021, Amsterdam, Netherlands                                               ample as an automated customer service agent [9]. In
Envelope-Open alain.starke@wur.nl (A. D. Starke); m.lee@tue.nl (M. Lee)                           terms of technical implementation, two approaches are
                     © 2021 Copyright for this paper by its authors. Use permitted under Creative
                     Commons License Attribution 4.0 International (CC BY 4.0).                   taken to build chatbots, which typically also applies to
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
text-based conversational recommenders [6]. They are          understand multimodal cues like gestures or gaze that ac-
either built as command-based systems that respond to         company users’ speech [15], which is likely to affect the
user queries as if they are commands or as bots that use      interpretability of the voice-based query. Finally, when
natural language understanding. For example, on Slack,        there is more than one person talking, the system has to
one can issue commands (e.g., unsubscribe) that chatbots      distinguish whose voice to zone in on [7], which might
can easily understand rather than using natural language      lead to conflicts of agency [16]. Each of these problems
(e.g., “please get me out of this channel”) that can be       are challenging. Yet, even if these technical issues are
vague for systems to understand [10]. Also, people may        resolved, they will not create a headway for a seamless
easily misspell or give incomplete input that chatbots        experience for individuals; inclusion is not about one-
cannot accurately interpret.                                  size-fits-all, but about how these technical issues do not
   An important challenge of conversational systems is        disproportionally affect specific userse.
to mitigate misunderstandings or a conversational break-
down [11]. There are inadequate responses to user re-
quests, false positives [11], either because of unclear       3. Critique of Voice-based systems
query or a missing database category [12, 13]. In such
                                                              Voice-based queries tend to be ‘messier’ than text-based
cases, conversational repair strategies becomes important
                                                              inputs [6]. The current state-of-the-art in Natural Lan-
like how the system should correct for misunderstood
                                                              guage Processing methods opens up possibilities for more
phrases or unclear user intentions [12]. An example of a
                                                              open-ended conversational strategies in recommenda-
repair strategy is a system giving potential options that
                                                              tion. However, even NLP-based systems are often still
people can choose from, such as “I did not understand
                                                              limited to familiar input, i.e., requiring an explicit under-
that. Did you mean X or Y?”, to keep the conversation
                                                              standing of users’ messages, which often gets misunder-
going.
                                                              stood. Problems like of environmental noise distortion
   In sum, the two strategies then are to design 1)
                                                              of user input are common [7]. Yet when it comes to
command-based systems that allow for minimum flexi-
                                                              user adoption, voice-based technologies have diffused
bility on user input for efficiency or 2) natural language-
                                                              at a large speed in terms of innovation adoption [17, 2].
based systems that allow for greater input flexibility, but
                                                              These have both positive and negative side effects.
also then, with an increased number of possible repair
                                                                  On the one hand, it seems that voice-based applica-
strategies that are not always successful.
                                                              tions be integrated in a modular way, as they can work
                                                              with recommendation libraries without designing an ap-
2.2. Voice-based systems                                      propriate user interface. On the other hands, it seems
Voice-based conversational user interfaces (VUIs) are         that innovation in technology may only benefit those
becoming more popular in everyday use. In particular,         that can work it. Even for people who are considered to
smart home assistants such as Google Home and Ama-            be “regular users”, there is a lot of trial and error when
zon Alexa are used more frequently to help its users find     it comes to learning how to interact with voice-based
content that they are looking whether [14], regardless of     agents [18] and recommender systems. But there are
whether the query is mundane and factual (e.g., ‘What         people who have additional difficulties due to various
weather is it today?’), or more exploratory (e.g., ‘Play me   differences in abilities; technical solutions often are built
a song for my dinner party’). The latter is more com-         around the normative assumption that users are fully
monly explored in recommender systems research, for it        able-bodied, i.e., with sight, hearing, and other abilities
seeks to retrieve an item that a user does not explicitly     intact [19].
know about.                                                       In recommender domains that use more traditional
   Current voice-based systems face a number of techni-       interfaces, marginalized people are often ‘served’ by a
cal issues that are often situational. For example, consid-   simple fix. For example, a tourism recommender system
ering the use of voice-based assistant in a car [7], there    for people with physical disabilities would apply post-
may be environmental noise, people not formulating            filtering to an appropriate set of recommendations [20].
clearly due to multi-tasking driving and voice interac-       The problem for conversational recommenders, however,
tion, among other causes. Technical issues that come          not only applies to the appropriateness of the suggested
with VUIs are many, and will also impact the design of        context, but also to the usability of the technology in
voice-based conversational recommender systems. To            the first place. For example, people who stammer, being
list three, they at times lack noise robustness, multimodal   a small subset of the population, face difficulties at the
understanding, and addressee detection [7]. In most con-      start of their interaction: a voice-based system often can-
texts, users’ environmental conditions will feature a cer-    not understand what they say due to the lack of training
tain degree of background noise, such as when people are      data and design choices [21]. Their speech does not fit
away from home. Moreover, voice-based system cannot           the normative template of how people should “normally”
talk. Furthermore, many smart home assistants are only         text-based chatbots with voice-based agents. In terms
compatible with a few languages (e.g., they are ‘biased’       of users being “better understood”, the decades old text-
towards English [14]), and speech recognition may be dis-      based interactions may be better suited. Perhaps counter-
torted because of fluctuations in human emotions [4, 22].      intuitively, due to the limits of query and text-based con-
We realize that these issues on inclusion in the use of        versational recommendation, the odds are smaller that
voice-based assistants is nuanced [16]. This means that        it ‘gets it wrong’. Or, more technically, that it generates
notions on who can easily use voice-based agents de-           a negative adversarial response [25], or has a conversa-
pends on multiple factors, such as accents, speech pat-        tional breakdown [12]. Although the usability is arguably
terns, and the access to commercial agents like Alexa,         lower in the sense that one needs to “touch” an inter-
which introduces many ways that efforts to include can         face, this poses a huge advantage to those who have to
also be exclusive, e.g., by prioritizing one accent over       concentrate to interact with such a voice-based applica-
others.                                                        tion. However, the consumer trend is shaping up to favor
                                                               voice-based applications; IBM, Google, and Amazon have
                                                               product lines that promote voice-first interactions.
4. Suggestions for Conversational                                  We offer two suggestions moving forward. To optimize
   Recommender Systems                                         accessibility for all users, a move towards ‘voice-enabled’
                                                               rather than ‘voice-first’ or voice-based recommender sys-
We have highlighted different challenges for both text-        tems would be desirable, akin to technology in which
based and voice-based interactions. What stands out is         ‘voice’ is a feature rather a key characteristic (e.g., Siri on
that some challenges are easier to resolve with user train-    an Apple iPhone). Although this requires the deployment
ing or adaptation (e.g., lacking sufficient technical knowl-   of two different retrieval and recommendation pipelines,
edge to use a text-based interface), than other challenges     it maximizes accessibility by combining ‘the best of two
(e.g., non-native users lack vocal skils, such as because of   worlds’. To note, we did not consider multimodality, e.g.,
stammering) [21]. What these challenges have in com-           combination of voice, gaze, body movements, and more,
mon is how people’s assumptions about conversational           which will become more important in the coming years
agents, be they chatbots or Alexas, shape their interac-       [26].
tions. People may have expectations that conversational            We also suggest that diversity of data for retrieval and
agents cannot meet, as the systems cannot yet to com-          recommendation is essential to design inclusive conver-
plex tasks such as email management by voice [23]. Even        sational recommender systems, or systems that cater to
text-based chatbots often do not meet people’s needs, as       specific users. Efforts are ongoing when it comes to mak-
users expect a higher level of understanding from bots         ing voice-based interactions more accessible; Google’s
that they were not designed for [10]. Hence, for most of       Project Euphonia1 aims to collect more data on atypical
us, going beyond simple interactions and towards more          speech, e.g., from people with cerebral palsy. Similarly,
complex exchanges is a problem that we all share due to        more time should be spend on collecting difficult data
the state of the technology.                                   when it comes to voice in research, in terms of responding
   Some studies describe that conversational recom-            to “unconventional voices”.
mender systems are distinct from the more traditional
chatbots and dialogue-based systems [6]. However, we
argue that the retrieval of conversational elements in         5. Conclusion
conjunction with ‘task-related items’ are two sides of the
                                                               This paper has reflected on current practices in conver-
same coin. A task-based conversation can be dialogue-
                                                               sational recommender systems. In particular, we have
based, by supporting a task at hand. Instead of focusing
                                                               pitted text-based systems against voice-based systems,
on a false dichotomy between task-based or dialogue-
                                                               observing that while voice-based recommender systems
based systems, a better way forward is being attentive to
                                                               are becoming more common because of their integration
how different users’ capacities get highlighted or ignored
                                                               with digital assistant [3], it may put specific users at a
by systems. The problem to focus on is inclusion vs. ex-
                                                               disadvantage. We have identified a number of challenges
clusion of user groups based on systems’ assumptions of
                                                               to make CRSs more inclusive, particularly for the emerg-
different abilities that people may or may not have.
                                                               ing domain of voice-based user interfaces. We emphasize
   How should we move forward with conversational
                                                               lastly that inclusion for some may mean exclusion for
recommender systems? Recommender systems are tradi-
                                                               others. In order to recommend to all users, we need to
tionally applied in domains where one-shot recommen-
                                                               understand all users. Specifically, understanding users
dations are effective [24], such as movies, e-commerce,
                                                               not only in terms of preferences, but also in terms of the
and books. The use of conversations, however, makes
for more complex interactions which introduces greater
technical challenges. We above differentiated between              1
                                                                       https://sites.research.google/euphonia/about/
fundamental conversational elements, such as speech,               tivity with artificial conversational agents: people
should be a priority.                                              are more likely to initiate repairs of misunderstand-
                                                                   ings with agents represented as human, Computers
                                                                   in Human Behavior 58 (2016) 431–442.
References                                                    [14] B. R. Cowan, P. Doyle, J. Edwards, D. Garaialde,
                                                                   A. Hayes-Brady, H. P. Branigan, J. Cabral, L. Clark,
 [1] L. S. Vailshery, Number of digital voice assistants in
                                                                   What’s in an accent? the impact of accented syn-
     use worldwide from 2019 to 2024 (in billions), 2021.
                                                                   thetic speech on lexical choice in human-machine
     URL: https://www.statista.com/statistics/973815/
                                                                   dialogue, in: Proceedings of the 1st Interna-
     worldwide-digital-voice-assistant-in-use/.
                                                                   tional Conference on Conversational User Inter-
 [2] D. Pal, C. Arpnikanondt, S. Funilkul, W. Chuti-
                                                                   faces, 2019, pp. 1–8.
     maskul, The adoption analysis of voice-based smart
                                                              [15] D. Heylen, Head gestures, gaze and the principles
     iot products, IEEE Internet of Things Journal 7
                                                                   of conversational structure, International Journal
     (2020) 10852–10867.
                                                                   of Humanoid Robotics 3 (2006) 241–267.
 [3] A. Iovine, F. Narducci, G. Semeraro, Conversational
                                                              [16] M. Lee, R. Noortman, C. Zaga, A. Starke, G. Huis-
     recommender systems and natural language:: A
                                                                   man, K. Andersen, Conversational futures: Emanci-
     study through the converse framework, Decision
                                                                   pating conversational interactions for futures worth
     Support Systems 131 (2020) 113250.
                                                                   wanting, in: Proceedings of the 2021 CHI Con-
 [4] C. Gao, W. Lei, X. He, M. de Rijke, T.-S. Chua,
                                                                   ference on Human Factors in Computing Systems,
     Advances and challenges in conversational rec-
                                                                   2021, pp. 1–13.
     ommender systems: A survey, arXiv preprint
                                                              [17] G. McLean, K. Osei-Frimpong, Hey alexa… exam-
     arXiv:2101.09459 (2021).
                                                                   ine the variables influencing the use of artificial
 [5] D. C. Hernandez-Bocanegra, J. Ziegler, Conversa-
                                                                   intelligent in-home voice assistants, Computers in
     tional review-based explanations for recommender
                                                                   Human Behavior 99 (2019) 28–37.
     systems: Exploring users’ query behavior, in: CUI
                                                              [18] C. M. Myers, L. F. Laris Pardo, A. Acosta-Ruiz,
     2021-3rd Conference on Conversational User Inter-
                                                                   A. Canossa, J. Zhu, “try, try, try again:” sequence
     faces, 2021, pp. 1–11.
                                                                   analysis of user interaction data with a voice user
 [6] D. Jannach, A. Manzoor, W. Cai, L. Chen, A survey
                                                                   interface, in: CUI 2021-3rd Conference on Conver-
     on conversational recommender systems, ACM
                                                                   sational User Interfaces, 2021, pp. 1–8.
     Computing Surveys (CSUR) 54 (2021) 1–36.
                                                              [19] S. Costanza-Chock, Design justice: Towards an
 [7] F. Weng, P. Angkititrakul, E. E. Shriberg, L. Heck,
                                                                   intersectional feminist framework for design theory
     S. Peters, J. H. Hansen, Conversational in-vehicle
                                                                   and practice, Proceedings of the Design Research
     dialog systems: The past, present, and future, IEEE
                                                                   Society (2018).
     Signal Processing Magazine 33 (2016) 49–60.
                                                              [20] R. Mahmoud, N. El-Bendary, H. M. Mokhtar, A. E.
 [8] J. Weizenbaum, Eliza—a computer program for the
                                                                   Hassanien, Similarity measures based recom-
     study of natural language communication between
                                                                   mender system for rehabilitation of people with
     man and machine, Communications of the ACM 9
                                                                   disabilities, in: The 1st International Conference
     (1966) 36–45.
                                                                   on Advanced Intelligent System and Informatics
 [9] R. Dale, The return of the chatbots, Natural Lan-
                                                                   (AISI2015), November 28-30, 2015, Beni Suef, Egypt,
     guage Engineering 22 (2016) 811–817.
                                                                   Springer, 2016, pp. 523–533.
[10] M. Lee, L. Frank, W. IJsselsteijn, Brokerbot: A
                                                              [21] L. Clark, B. R. Cowan, A. Roper, S. Lindsay,
     cryptocurrency chatbot in the social-technical gap
                                                                   O. Sheers, Speech diversity and speech interfaces:
     of trust, Computer Supported Cooperative Work
                                                                   Considering an inclusive future through stammer-
     (CSCW) 30 (2021) 79–117.
                                                                   ing, in: Proceedings of the 2nd Conference on
[11] A. Følstad, C. Taylor, Conversational repair in chat-
                                                                   Conversational User Interfaces, 2020, pp. 1–3.
     bots for customer service: the effect of expressing
                                                              [22] J. Pittermann, A. Pittermann, W. Minker, Emotion
     uncertainty and suggesting alternatives, in: Interna-
                                                                   recognition and adaptation in spoken dialogue sys-
     tional Workshop on Chatbot Research and Design,
                                                                   tems, International Journal of Speech Technology
     Springer, 2019, pp. 201–214.
                                                                   13 (2010) 49–60.
[12] Z. Ashktorab, M. Jain, Q. V. Liao, J. D. Weisz, Re-
                                                              [23] E. Luger, A. Sellen, ” like having a really bad pa” the
     silient chatbots: Repair strategy preferences for
                                                                   gulf between user expectation and experience of
     conversational breakdowns, in: Proceedings of the
                                                                   conversational agents, in: Proceedings of the 2016
     2019 CHI Conference on Human Factors in Com-
                                                                   CHI conference on human factors in computing
     puting Systems, 2019, pp. 1–12.
                                                                   systems, 2016, pp. 5286–5297.
[13] K. Corti, A. Gillespie, Co-constructing intersubjec-
                                                              [24] G. Adomavicius, A. Tuzhilin, Toward the next
     generation of recommender systems: A survey of
     the state-of-the-art and possible extensions, IEEE
     transactions on knowledge and data engineering
     17 (2005) 734–749.
[25] G. Penha, C. Hauff, What does bert know about
     books, movies and music? probing bert for con-
     versational recommendation, in: Fourteenth ACM
     Conference on Recommender Systems, 2020, pp.
     388–397.
[26] Y. Deldjoo, J. R. Trippas, H. Zamani, Towards multi-
     modal conversational information seeking, in: Pro-
     ceedings of the ACM Conference on Research and
     Development in Information Retrieval, SIGIR, vol-
     ume 21, 2021.

</pre>