Can an AI-driven VTuber engage People? The KawAIi Case Study Natale Amato1 , Berardina De Carolis1,* , Francesco de Gioia1 , Mario Nicola Venezia1 , Giuseppe Palestra1 and Corrado Loglisci1 1 Department of Computer Science, University of Bari "A. Moro", Bari, Italy Abstract Live streaming has become increasingly popular, with most streamers presenting their real-life appearance. However, Virtual YouTubers (VTubers), virtual 2D or 3D avatars that are voiced by humans, are emerging as live streamers and attracting a growing viewership. This paper presents the development of a conversational agent, named KawAIi, embodied in a 2D character that, while accurately and promptly responding to user requests, provides an entertaining experience in streaming chat platforms such as YouTube while providing adequate real-time support. The agent relies on the Vicuna 7B GPTQ 4-bit Large Language Model (LLM). In addition, KawAIi uses a BERT-based model for analyzing the sentence generated by the model in terms of conveyed emotion and shows self-emotion awareness through facial expressions. Tested with users, the system has demonstrated a good ability to handle the interaction with the user while maintaining a pleasant user experience. In particular, KawAIi has been evaluated positively in terms of engagement and competence on various topics. The results show the potential of this technology to enrich interactivity in streaming platforms and offer a promising model for future online assistance contexts. Keywords LLMs, virtual agent, live streaming 1. Introduction In today’s digital era, streaming platforms like YouTube Live, Vimeo Livestream, Twitch, LinkedIn live, Facebook live have gained relevance, becoming primary stages for content creation and sharing. A distinctive feature of these platforms is the direct and immediate interaction between content creators and their audience through live chats. These chats are not only communication tools but also places of entertainment and information seeking where viewers can actively participate in the broadcast by asking questions, commenting, and interacting with other users. Currently, YouTube may serve as users’ source of information, entertainment, and connection, as users can associate, inspire and motivate each other within this huge networking platform [1]. By their innovative and impressive creation, some YouTubers gained numerous views and subscriptions, which eventually turned them into micro-celebrities, influencers, or internet celebrities with their fan base [2, 3, 4]. Virtual YouTubers (VTubers) are online entertainers who are typically human YouTubers or live streamers who use a virtual avatar generated using computer graphics. The digital trend originated in Japan in the mid-2010s and has evolved into an international online phenomenon in the 2020s [5]. Before the coronavirus pandemic forced the world into internet isolation in 2020, VTubing was a niche medium, largely confined to Japan’s overactive subculture of fanboys and otaku. The pandemic’s disruptions to everyday lives, and the entertainment industry, have broadened VTubing’s appeal. A majority of VTubers are English and Japanese-speaking YouTubers or live streamers who use avatar design that is tied to Japanese popular culture and aesthetics, particularly those found in anime and manga. They frequently employ anthropomorphism, imbuing their avatars with a mix of human and Joint Proceedings of the ACM IUI Workshops 2024, March 18-21, 2024, Greenville, South Carolina, USA * Corresponding author. $ berardina.decarolis@uniba.it (B. D. Carolis)  0000-0002-2689-137X (B. D. Carolis) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings non-human attributes. This fusion of characteristics is a defining aspect of VTuber culture, enhancing its distinct allure and fostering creativity1 . Usually, behind a VTuber there is a human being who talks and animates his avatar using a webcam and software, which captures the streamer’s motions, movements and face expressions, and maps these characteristics into a two or three-dimensional model. However, real-time management of a high volume of interactions can pose a considerable challenge for human streamers. Recently, due to the availability of powerful Large Language Models (LLMs), the number of VTubers whose behavior is driven by artificial intelligence is growing. The emblematic case is represented by Neuro-sama [6], which is the first technology able to combine a chatbot with a female avatar. The speech and personality exhibited by Neuro-sama are generated by an AI system that utilizes an LLM, like in the current paper, enabling the character to communicate with the viewers in a live chat. In this paper, we describe the development and evaluation, from the user experience point of view, of an AI-based VTuber, named KaWAIi. KaWAIi is endowed with conversational features based on the use of an LLM aiming at providing users with a rich, entertaining, and engaging live-chat experience. With a careful blend of technologies like Vicuna 7B GPTQ 4-bit 128g and DistilRoBERTa-base, the agent not only provides engaging and relevant responses but enriches its behavior with facial expressions that are coherent with the emotion expressed in the sentence to pronounce. We tested the KaWAIi both in terms of accuracy and expressivity of the answer, usability and engagement, and enjoyment during the conversation. In addition, we asked also participants to evaluate their perceived trust in KawAIi and how much they felt influenced by the agent. For the first task, we asked participants to state their perceived accuracy and expressiveness of the answer provided by the agent on various thematic categories. This evaluation allowed us to understand better the model’s capabilities and effectiveness in different contexts. For the other aspects, users evaluated the interaction with the agent as usable and pleasant, and its answers as very expressive. Moreover, they trusted and felt influenced by what KawAIi was telling them. These results indicate that an AI-driven Vtuber is a good solution in entertainment contexts. 2. KawAIi’s VTuber The core of the proposed system is based on a combination of two models, Vicuna-13b-GPTQ-4bit-128g and DistilRoBERTa-base [7, 8]. The model Vicuna 7B GPTQ has been carefully crafted for KawAIi VTuber to create a conversational agent that behaves exactly as intended: polite, courteous, and friendly in its responses. KawAIi is designed to interact with users to reflect the friendliness and willingness typical of human interactions. To endow KawAIi with facial expressions coherent with the emotional content contained in her answers, we used the DistilRoBERTa-base fine-tuned to recognize the six basic emotions, along with a neutral category, from textual input: anger, disgust, fear, joy, neutrality, sadness, and surprise [9, 8]. In this way, the agent can show facial expressions that are consistent with the emotional content of what she is saying. The KawAIi VTuber is based on the architecture illustrated in Figure 1. The proposed system’s architecture consists of several software modules described below: • YouTube Message Server: it collects real-time user interactions from the YouTube chat. These messages are then written into a file, providing a persistent and accessible record of the chat interactions. • Seleniumgpt: a software module that reads the messages from the file and sends them to a WSL (Windows Subsystem for Linux) web interface using Firefox Selenium, a popular browser automation tool used for interacting with web pages. Once the message is sent, Seleniumgpt waits for an updated response from the server, which processes the message using the models. When the response is ready, Seleniumgpt downloads a corresponding audio file for the updated 1 https://en.wikipedia.org/wiki/VTuber Figure 1: The KawAIi VTuber Architecture. response, providing an audible format for the interaction. Subsequently, Seleniumgpt uses an emotion classification model based on DistilRoBERTa-base to predict the emotion conveyed in the updated response [8]. We consider the emotion with the highest prediction confidence above 0.5. In this way, we could avoid showing inconsistent expressions when the prediction confidence was low. Once the emotion is analyzed, Seleniumgpt routes the audio file through a virtual cable while simultaneously setting the avatar’s expression by pressing specific Windows keys. The updated response is then written to a file. • VseeFace: it is a facial tracking and expression generation software. Usually, it reads the keys pressed in Windows and utilizes this information to set the avatar’s expression, providing a visual representation of the identified emotion. In our case, we send the key combinations expected by VSeeFace by a module through PyAutoGUI, a Python library that enables automated control of the mouse and keyboard. With PyAutoGUI, it is possible to send key combinations without manual intervention. This is achieved by calling the appropriate functions from the library, and specifying the desired key combinations as parameters. PyAutoGUI then simulates the keyboard input by emulating key presses and releases on the operating system level. This allows us to control the avatar’s expression dynamically based on the identified emotions, providing a seamless and automated visual representation of the emotions being expressed. Additionally, an idle animation is set in VSeeFace as a default expression for the avatar before any specific emotion or key combination is received. It provides a starting point for the avatar’s expressions and ensures that there is always some form of visual expression even when specific emotions are not being actively triggered. By setting an idle animation at the beginning, the avatar appears alive and responsive from the moment in which the application is launched. This contributes to a more engaging and interactive user experience, as the avatar is not static when there is no specific input being provided. • XSplit: it is a live streaming software that captures the video output from VSeeFace, the audio output from the virtual cable, and the text of the updated response from the file. These input streams are then combined and managed to create the live broadcast on YouTube. This architecture allows answering dynamically and in real-time user interactions on the YouTube chat, providing textual and audio responses, as well as a virtual avatar that expresses emotions corresponding to the emotional content in the text of the answers. An example of interaction with the proposed system, Figure 2: The KawAIi VTuber running in a YouTube live streaming. the KawAIi VTuber, is depicted in Figure 2. 3. Evaluation To evaluate the KaWAIi VTuber’s ability to deliver an engaging and entertaining experience and, at the same time, the appropriateness of the answer, we conducted a study focusing on assessing the accuracy of the provided answer, the perceived expressivity, usability, engagement and positive experience of the interaction with KawAIi. In addition, trust and the perceived influence of the agent have been evaluated. 3.1. Participants After an advertising campaign within the campus, 40 people have requested to participate in the system evaluation. We selected 32 participants (16 females and 16 males) to have a gender balance. From a pre-test questionnaire, we could assess that they were aged from 18 to 45 years old. Most of them were students (70%), all of them used technologies every day and most of them (65%) had already experience with chatbots and Youtubers. 3.1.1. Questionnaires A pre-test questionnaire was prepared to collect information about participants’ demographic data and backgrounds (i.e. age, gender, level of instruction, current position, use of technology, experience with chatbots, experience with Youtubers, etc.). To evaluate KawAIi, we prepared three questionnaires, one for assessing the accuracy of the generated answers and evaluating its expressiveness, one for evaluating usability, and the last one aiming at assessing other qualities related to the user experience important in this application domain: engagement, enjoyment, perceived influence, and trust. • Accuracy and expressiveness of the answers: the goal was to understand the model’s ability to provide answers perceived as accurate and, at the same time expressively on a wide number of topics, as most Youtubers do. Then, we selected 12 topics (see Table3.1.1). Each participant had to make one question for each topic. The questionnaire asked to rate, on a scale from 1 to 5, two statements regarding i) how much, in the participant’s opinion, the answer to the question was correct and ii) how much the participant found expressive the style in which the answer was generated. Table 1 Topics and related questions. Expressiveness Topic Example of Questions Accuracy (avg) C1-Entertainment What is the last movie that made you cry? 5 3.92 C2-Travel What do you suggest to visit in Paris? 5 3.63 As a child, what did you think you would C3-Life_Story 5 4.17 become when you grew up? C4-Food Which is your go-to comfort food? 4 3.92 What is the name of the highest C5-General Knowledge 5 4.02 mountain in Africa? C6-Literature Who wrote Romeo and Juliet? 3.75 3.14 How do scientists study the effects of climate C7-Science 5 3.17 change on ecosystems and wildlife? C8-History Who was the first president of the United States? 4 3.75 What is the difference between C9-Politics 4.17 4.02 democracy and dictatorship? C10-Logics Knowledge What is a logical contradiction? 3.14 3.75 C11-Mathematics What does the Pythagorean theorem say? 2 3.14 Can you discuss the theme of friendship C12-Anime and Manga 5 4.26 and camaraderie in Naruto? Average 4.25 3.74 • Usability: the CUQ is a questionnaire specifically designed to measure the usability of chatbots and consists of sixteen balanced questions related to different aspects of the interaction with the chatbot. The questions pertain to aspects of Chatbot Personality, Onboarding, User Experience, and Error Handling. • User experience in interacting with KawAIi: to investigate this aspect we designed a custom questionnaire in which participants were asked to rate their experiences on a scale from 1 to 5 concerning specific areas. More specifically, we inquire about the participants’ feelings regarding the pleasantness and engagement of their interactions with KawAIi, as well as the extent to which they felt influenced and trusted the information provided by KawAIi 2 . 3.2. Procedure We instructed participants to fill out the pre-test questionnaire before the testing phase. Each participant was scheduled to visit our research lab at a specified time to interact with KawAIi. A study facilitator provided each participant with information about the project’s goals and the study’s objectives. Par- ticipants were encouraged to interact with KawAIi in a natural manner. Each participant then spent about 10 minutes interacting with the VTuber. Besides the free interaction with KawAIi, we asked the participants to ask a question for each of the previously mentioned topics by selecting it from a predefined set. In this way, we could assess the perceived accuracy of the provided answer by the VTuber. During the interaction, the chat was recorded for further analysis. After the interaction, participants were asked to complete the post-test questionnaires that were made available online. After the session, participants received a comprehensive debriefing. 3.3. Results We analyzed the questionnaire responses to assess the accuracy, expressiveness, usability, and user experience of the interaction with KawAIi, as well as the perceived influence and trust participants placed in the agent. 2 The custom questionnaire can be made available on request. Figure 3: Chatbot Usability Questionnaire (CUQ) results. Figure 4: Average value of the questionnaire items aiming at assessing pleasure, engagement, influence and trust. The findings derived from the questionnaire were notably positive. Overall, user feedback indicated strong approval of the agent’s usability. KawAIi’s capacity to deliver accurate and timely responses contributes to a consistent and smooth communication experience. In particular, the CUQ questionnaire offers a calculation tool that allows for automated scoring. Figure 3 displays the CUQ score achieved by KawAIi, which significantly exceeds 68, which is the threshold considered to be the minimum score for usability [10]. Then, we analyzed the custom questionnaire results (Figure 4). From the average results of each question, we can say that the interaction experience with KawAIi was judged positively and participants felt engaged enough. Participants felt influenced by the VTuber and trusted what it was saying them. 4. Conclusions In today’s digital age, online streaming platforms like YouTube are experiencing an unprecedented increase in popularity. Within these virtual realms, users are increasingly looking for more engaging experiences that transcend passive content consumption. We are witnessing the evolution of live chats from mere messaging tools to vibrant virtual spaces teeming with interaction and entertainment, with AI technology playing a pivotal role in elevating user experiences. Through the implementation of our AI-driven VTuber, we have taken the initial stride in using virtual agents in live-chat interactions, offering a dynamic, entertaining, and engaging experience. Leveraging a blend of cutting-edge technologies, including Vicuna 7B GPTQ 4-bit 128g and DistilRoBERTa-base, the agent delivers precise and contextually relevant responses and increases the streaming experience by seamlessly incorporating real-time entertainment content. Our system architecture has been designed to respond dynamically and instantly to user interactions, seamlessly blending textual and audio responses with a virtual avatar capable of expressing the corresponding emotions. To validate the accuracy of the agent answer and the user experience during the interaction with KawAIi, we performed a user study in which participants were asked to interact with KawAIi and evaluate both the chatbot from different points of view: perceived accuracy and expressiveness of the answer, usability, the experience and engagement with the agent and how much they felt influenced and trusted what the agent was saying. This evaluation provided insights into the potentiality of this type of technology. First of all, except for mathematics, literature and logic, the Vicuna-based model reaches an outstanding accuracy in providing correct answers, on the other side, KawAIi has demonstrated a good capability to engage users in positive interaction experiences in live chats, making the interaction pleasant and engaging. Most participants felt influenced and trusted what the Vtuber was saying them, showing the potential of this technology not only on streaming platforms but also on various other online assistance scenarios, injecting an element of entertainment that enriches interactions, making them more enjoyable and rewarding. Despite these favorable outcomes, we acknowledge that it is necessary to refine and improve our agent, this includes the possibility of further adapting our model to accommodate a broader spectrum of scenarios and user requests, as well as exploring the use of other LLMs to make the interactions with the agent even more natural and captivating. In this vein, we are developing a model for co-speech gesture generation, as outlined in [11], to replicate gestures characteristic of human-driven VTubers, adding an extra layer of authenticity to the agent’s behavior. Moreover, we foresee the potential to extend this system to other domains, including customer support, online education, and counseling services. The agent’s remarkable ability to furnish accurate and pertinent responses, coupled with its expressive and entertaining attributes, holds the promise of a significant breakthrough in these areas. References [1] S. Edosomwan, S. K. Prakasan, D. Kouame, J. Watson, T. Seymour, The history of social media and its impact on business, Journal of Applied Management and entrepreneurship 16 (2011) 79. [2] S. Khamis, L. Ang, R. Welling, Self-branding,‘micro-celebrity’and the rise of social media influencers, Celebrity studies 8 (2017) 191–208. [3] A. Jerslev, Media times| in the time of the microcelebrity: celebrification and the youtuber zoella, International journal of communication 10 (2016) 19. [4] C. Abidin, Internet celebrity: Understanding fame online, Emerald Publishing Limited, 2018. [5] Z. Lu, C. Shen, J. Li, H. Shen, D. Wigdor, More kawaii than a real-person live streamer: Under- standing how the otaku community engages with and perceives virtual youtubers, in: CHI ’21: CHI Conference on Human Factors in Computing Systems, Virtual Event / Yokohama, Japan, May 8-13, 2021, 2021, pp. 137:1–137:14. URL: https://doi.org/10.1145/3411764.3445660. doi:10.1145/3411764.3445660. [6] S. Narita, Ai vtuber neuro-sama is back from its twitch ban and acting as strange as ever, 2023. URL: https://www.automation.agm.com/ai-vtuber-neuro-sama-twitch-ban-strange-2023, retrieved February 11, 2023. [7] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, ArXiv abs/1910.01108 (2019). [8] J. Hartmann, Emotion english distilroberta-base, https://huggingface.co/j-hartmann/ emotion-english-distilroberta-base/, 2022. [9] P. Ekman, et al., Basic emotions, Handbook of cognition and emotion 98 (1999) 16. [10] A. Bangor, P. T. Kortum, J. T. Miller, An empirical evaluation of the system usability scale, Intl. Journal of Human–Computer Interaction 24 (2008) 574–594. [11] S. Nyatsanga, T. Kucherenko, C. Ahuja, G. E. Henter, M. Neff, A comprehensive review of data- driven co-speech gesture generation, in: Computer Graphics Forum, volume 42, Wiley Online Library, 2023, pp. 569–596.