Towards Emotionally Aware AI: Challenges and Opportunities in the Evolution of Multimodal Generative Models Matteo Spanio1,∗ 1 Centro di Sonologia Computazionale (CSC), Department of Information Engineering, University of Padova, Via Giovanni Gardenigo, 6b, 35131 Padova (PD), Italy Abstract The evolution of generative models in artificial intelligence (AI) has significantly expanded the capacity of machines to process and generate complex multimodal data such as text, images, audio, and video. Despite these advancements, the integration of emotional awareness remains an underexplored dimension. This paper examines the state of the art in multimodal generative AI, with a focus on existing models developed by major technology companies. It then proposes an approach to incorporate emotional awareness into AI models, which would enhance human-machine interaction by improving the interpretability and explainability of AI- generated decisions. The paper also addresses the challenges associated with building emotion-aware models, including the need for comprehensive multimodal datasets and the computational complexity of incorporating less-explored sensory modalities like olfaction and gustation. Finally, potential solutions are discussed, including the normalization of existing research data and the application of transfer learning to reduce resource demands. These steps are essential for advancing the field and unlocking the potential of emotion-aware multimodal AI in applications such as healthcare, robotics, and virtual assistants. Keywords Multimodal AI, Emotion-aware AI, Generative models, Human-computer interaction, Multisensory integration 1. Introduction Generative Artificial Intelligence (AI) has experienced rapid advancements, fundamentally transforming how machines interact with data and create new content. Generative AI models, particularly those based on deep neural networks, have revolutionized traditional data processing by autonomously learning and generating complex patterns from raw data. This shift is especially significant in unsupervised learning, where machines produce coherent and meaningful outputs without explicit guidance. These models can now generate text, images, audio, and video, opening vast possibilities across industries such as creative design, healthcare, and robotics [1]. Among these advancements, the rise of multimodal generative AI has been particularly impactful. Multimodality refers to AI systems’ ability to process and integrate various types of data, such as text, images, audio, and video, to perform tasks involving cross-modal generation or understanding. By bridging different sensory inputs and outputs, multimodal AI models can generate content spanning multiple domains, mimicking a more human-like understanding of the world. The significance of multimodal AI lies in its capacity to address the limitations of traditional AI models confined to single modalities. These systems enhance machine perception and understanding, thereby increasing their applicability in real-world scenarios. However, the rise of multimodal systems introduces unique challenges. As AI models expand to include more diverse data forms and sensory inputs, the need for scalable and interpretable models becomes more pressing. Integrating emotional understanding into generative AI could enrich human-computer interaction and create systems that better grasp the subtleties of human experience. Despite significant progress in multimodal AI, the Doctoral Consortium at the 23rd International Conference of the Italian Association for Artificial Intelligence Bolzano, Italy, November 25-28, 2024. ∗ Corresponding author. Envelope-Open spanio@dei.unipd.it (M. Spanio) GLOBE https://matteospanio.github.io/ (M. Spanio) Orcid 0000-0002-2436-7208 (M. Spanio) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings emotional dimension remains largely underexplored [2]. The author argue that developing emotion- aware models is crucial. Emotions represent a fundamental intersection of perceptual modalities in the human brain. Key regions such as the amygdala and the hypothalamus play pivotal roles in sensory perception and emotional regulation, underscoring the importance of emotions and sensations in our cognitive architecture. Consequently, an emotionally aware AI system should model data in a manner more aligned with human cognition. The author firmly believes that emotion-aware models could revolutionize human-machine interaction by improving the interpretability and explainability of model decisions, addressing a key limitation of current AI systems [3, 4]. The subsequent sections will explore various aspects of this domain in the following order: 1. Background and Related Work: Outlining the current state of multimodal models, both with and without emotional awareness, highlighting widely used models from major tech companies like Google, Meta, and OpenAI. 2. Benefits of Emotionally Aware Models: Examining existing research on emotion-aware models, focusing on how emotions mediate other perceptual modalities and the potential advantages of integrating emotional understanding into AI systems. 3. Challenges and Limitations: Providing a critical analysis of the challenges and potential objections to emotion-aware models, including the complexities in creating multimodal datasets and the computational demands of representing olfactory and gustatory modalities. 4. Conclusions: Summarizing the key points, reiterating the importance of developing emotion- aware multimodal generative models to enhance human-computer interaction and some proposals on how to achieve such results. 2. Background and Related Work The landscape of multimodal generative artificial intelligence (AI) is currently shaped by significant investments from major technology companies such as Google, Meta, OpenAI, and Microsoft. These companies are at the forefront of developing cutting-edge models capable of processing and generating diverse types of data, including images, audio, and video. Their efforts encompass not only the creation of expansive datasets but also the release of models that are either open source or proprietary. This trend underscores the immense resources being allocated globally to develop autonomous systems proficient in generating rich multimedia content. 2.1. Key Models and Developments Several notable multimodal generative models have been introduced, showcasing the field’s breadth of capabilities. Text-to-image generation models, such as DALL·E [5] and Stable Diffusion [6], create detailed images from textual descriptions. Text-to-audio models, like MusicLM [7], generate music or soundscapes from text prompts, with promising applications in entertainment and virtual environments. Although in its early stages, text-to-video generation shows potential in media production and simulation environments [8]. In the other direction, models such as image-to-text [9, 10, 11] translate visual inputs into descriptive narratives, providing enhanced capabilities for tasks like automated captioning and assisting individuals with visual impairments. Audio-to-text models, commonly seen in speech-to- text systems, have long been applied in areas such as transcription and virtual assistants, but recent advances in generative models enable more nuanced and context-aware interpretations of spoken language. However, even the simplest models involving two non-textual modalities, like the one discussed in [12] are essentially concatenations of multiple models exchanging textual information. Recently, multimodal models such as Mirasol, Chameleon, and others (including gtp4o) [13, 14, 15] have adopted a different approach called early fusion [16, 17], where the modalities converge into a single latent space that mixes tokens from different domains. Although this approach has yielded better results than previously described models, it remains difficult to interpret. 2.2. Integration of Emotional Awareness A relatively less explored but emerging area within multimodal AI is the integration of emotional awareness. While extensive efforts have been dedicated to recognizing emotions within a single modality [18], there has been an increasing interest in fusing information from multiple modalities [18, 19]. This multimodal approach is advantageous because the combined information from different modalities provides a complementary capability for emotion recognition. However, relatively few efforts have been made to understand the emotion-centric correlation between different modalities. Recent approaches, such as those presented in [2], have shown concrete possibilities for connecting images and sounds through an emotional valence-arousal latent space leveraging supervised contrastive learning techniques. This contribution allows for a more nuanced and dynamic representation of emotional states compared to the older theory of discrete emotional states. By capturing the subtleties and complexities of human emotions, these newer models offer a more sophisticated understanding of how emotions interplay across different sensory inputs. 3. Benefits of Emotionally Aware Models While the experiments conducted by [2] yielded promising results, the potential of contrastive learning in emotional contexts remains largely underexplored. Utilizing supervised contrastive learning allows for the alignment of various encoders corresponding to different modalities within a shared emotional latent space. This methodology supports the development of models that align with established psychological research linking modalities to emotions, as evidenced by studies examining the relationships between audio and emotions [20], odors and emotions [21], and temperature and emotions [22]. By leveraging these insights, it becomes feasible to include often-overlooked modalities such as touch, taste, and smell, which also have emotional correlations. Integrating this knowledge could propel advancements toward Artificial General Intelligence (AGI). Current research in these areas is still nascent, primarily focusing on textual descriptions, as seen in [23], which utilize transformer-based models. Emotion-aware models represent a significant advancement in AI research, as mapping human emotions to the latent space of generative models can foster more natural interactions in applications like virtual assistants, therapeutic tools, and social robots. By embedding emotional understanding, these systems can engage users in a more human-like manner, leading to smoother and more relatable interactions. For instance, an emotion-aware virtual assistant could adapt its tone and suggestions based on the user’s emotional state, enhancing the user experience. Additionally, imposing constraints on the latent space allows for the application of known psychological models to clarify AI behavior and decision-making, enhancing transparency and trustworthiness. Emotions play a crucial role in human perception, influencing how we interpret sensory inputs. Emotion-aware models can bridge multiple modalities—vision, hearing, smell, and taste—creating a richer understanding of the environment and leading to more immersive applications in virtual reality, gaming, and interactive media. To realize the benefits of emotionally aware models, we propose a framework utilizing pretrained encoder/decoder architectures tailored for each modality. This approach efficiently encodes emotional information from sensory data while minimizing computational demands. A pretrained encoder first processes the input data, such as a computational description of food, transforming it into a high- dimensional embedding. This embedding is then input into a specialized middle encoder model designed to capture the emotional essence, producing an emotional embedding represented as a vector of valence and arousal values. Following this, a pretrained decoder model converts the emotional vector into an audio token, which is processed by the pretrained audio model to generate the corresponding output. This architecture effectively manages the computational load through pretrained models, requiring only the middle encoder/decoder model to be trained. This design streamlines the training process and enhances the model’s capacity to translate emotional information across modalities, fostering a more integrated and emotionally aware AI system. 4. Challenges and Limitations The development of multimodal AI models faces several significant challenges and limitations, par- ticularly concerning the availability and quality of datasets. One of the primary obstacles is finding comprehensive datasets that integrate multiple modalities. While substantial progress has been made in creating large-scale datasets for individual modalities, the integration of diverse sensory data, such as combining visual, auditory, and textual information, remains a challenge. This paucity of integrated datasets hampers the ability to train and evaluate multimodal models effectively. Moreover, the col- lection of multimodal data is both costly and labor-intensive, requiring expert evaluations to ensure data quality and alignment across modalities. Web scraping, a common method for gathering large amounts of data, proves insufficient for creating high-quality multimodal datasets. For instance, aligning different types of data (e.g., synchronizing audio with visual inputs) necessitates precise and controlled conditions, often achievable only in well-equipped laboratories. Existing datasets predominantly rely on massive, web-scraped data, which, while voluminous, often lack the quality needed for advanced deep learning applications. The release of models like Microsoft’s Phi [24] has underscored the importance of data quality, demonstrating how high-quality datasets can enhance model efficiency and reduce computational resource requirements. In addition to these general challenges, specific modalities such as olfaction and gustation present unique difficulties. In these research communities, there is no established practice of sharing data in formats suitable for use as training datasets. Furthermore, there are no widely adopted computational representations for olfactory and gustatory information, making it challenging to integrate these modalities into multimodal models. Before end-to-end models that encompass these senses can be developed, significant research is needed to establish standardized computational frameworks and methodologies for these less-explored sensory domains. 5. Conclusions The integration of emotional awareness into multimodal generative AI represents a pivotal next step for advancing human-computer interaction. Emotion-aware AI models can interpret and respond more contextually to human emotions, which is particularly relevant in applications like virtual assistants and healthcare, where nuanced emotional understanding is critical. Furthermore, these models enhance interpretability and trustworthiness in AI decisions. However, key challenges remain, especially in dataset availability and computational complexity. Comprehensive multimodal datasets that incorporate emotions are scarce, and the difficulty in representing sensory modalities such as olfaction and gustation hinders progress. Addressing these issues is crucial to fully realize the potential of emotion-aware AI. 5.1. Future Research Directions To advance research in emotion-aware multimodal AI, two key strategies are proposed: 1. Dataset aggregation and normalization: a considerable body of research, particularly in psy- chology and neuroscience, has already explored the correlation between emotions and various sensory modalities. Although these data are currently dispersed and non-standardized, they often come from high-quality studies. A concerted effort to systematically aggregate and normalize these existing datasets could form the basis for a comprehensive multimodal dataset. Such a resource would support AI models in learning emotional correlations across multiple sensory inputs, creating a foundation for more robust emotion-aware systems. 2. Leveraging transfer learning to reduce computational complexity: the computational demands of building models from scratch, as seen in large tech companies, are a significant barrier for many research initiatives. However, existing deep learning-based encoders have reached a high level of performance. By adopting a contrastive learning framework and leveraging transfer learning or fine-tuning techniques, researchers can utilize pre-existing models to enhance emotion-aware capabilities without requiring massive computational resources. This approach not only shortens the time needed to achieve results but also reduces energy consumption, making research more sustainable and accessible. In summary, advancing emotion-aware multimodal AI requires addressing the current challenges of dataset availability and computational demands. By capitalizing on existing research data and leveraging transfer learning, these obstacles can be overcome, enabling the development of AI systems that are more aligned with human emotions. Such systems will significantly enhance human-computer interaction and broaden the scope of AI applications in various fields. Acknowledgments I would like to express my sincere gratitude to Professor Antonio Rodà and Professor Massimiliano Zampini for their invaluable discussions and advice throughout the development of this paper. Professor Rodà, from the Centro di Sonologia Computazionale (CSC), Department of Information Engineering, University of Padova, provided essential guidance and support. Professor Zampini, from the Center for Mind/Brain Sciences (CIMeC), University of Trento, offered key insights that significantly shaped this research. Their expertise and encouragement have been crucial to this work. References [1] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language Models are Few-Shot Learners, 2020. URL: http://arxiv.org/abs/2005.14165. doi:10.48550/arXiv.2005.14165 , arXiv:2005.14165 [cs]. [2] S. Zhao, Y. Li, X. Yao, W. Nie, P. Xu, J. Yang, K. Keutzer, Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space, in: Proceedings of the 28th ACM International Conference on Multimedia, MM ’20, Association for Computing Machinery, New York, NY, USA, 2020, pp. 2945–2954. URL: https://dl.acm.org/doi/10.1145/3394171.3413776. doi:10.1145/3394171. 3413776 . [3] R. A. Calvo, S. D’Mello, Affect Detection: An Interdisciplinary Review of Models, Methods, and Their Applications, IEEE Transactions on Affective Computing 1 (2010) 18–37. URL: https: //ieeexplore.ieee.org/document/5520655. doi:10.1109/T- AFFC.2010.1 , conference Name: IEEE Transactions on Affective Computing. [4] C. Rudin, C. Chen, Z. Chen, H. Huang, L. Semenova, C. Zhong, Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges, 2021. URL: http://arxiv.org/abs/2103.11251. doi:10.48550/arXiv.2103.11251 , arXiv:2103.11251 [cs, stat]. [5] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, I. Sutskever, Zero-Shot Text-to- Image Generation, 2021. URL: http://arxiv.org/abs/2102.12092. doi:10.48550/arXiv.2102.12092 , arXiv:2102.12092 [cs]. [6] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, High-Resolution Image Synthesis with Latent Diffusion Models, 2022. URL: http://arxiv.org/abs/2112.10752. doi:10.48550/arXiv.2112. 10752 , arXiv:2112.10752 [cs]. [7] A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi, M. Sharifi, N. Zeghidour, C. Frank, MusicLM: Generating Music From Text, 2023. URL: http://arxiv.org/abs/2301.11325. doi:10.48550/arXiv.2301.11325 , arXiv:2301.11325 [cs, eess]. [8] U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, D. Parikh, S. Gupta, Y. Taigman, Make-A-Video: Text-to-Video Generation without Text-Video Data, 2022. URL: http://arxiv.org/abs/2209.14792. doi:10.48550/arXiv.2209.14792 , arXiv:2209.14792 [cs]. [9] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning Transferable Visual Models From Natural Lan- guage Supervision, 2021. URL: http://arxiv.org/abs/2103.00020. doi:10.48550/arXiv.2103.00020 , arXiv:2103.00020 [cs]. [10] J. Li, D. Li, C. Xiong, S. Hoi, BLIP: Bootstrapping Language-Image Pre-training for Unified Vision- Language Understanding and Generation, 2022. URL: https://arxiv.org/abs/2201.12086v2. [11] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, K. Simonyan, Flamingo: a Visual Language Model for Few-Shot Learning, 2022. URL: http://arxiv.org/abs/2204.14198. doi:10.48550/arXiv.2204.14198 , arXiv:2204.14198 [cs]. [12] R. Sheffer, Y. Adi, I Hear Your True Colors: Image Guided Audio Generation, 2023. URL: http: //arxiv.org/abs/2211.03089. doi:10.48550/arXiv.2211.03089 , arXiv:2211.03089 [cs, eess]. [13] A. J. Piergiovanni, I. Noble, D. Kim, M. S. Ryoo, V. Gomes, A. Angelova, Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities, 2024. URL: http://arxiv.org/abs/ 2311.05698. doi:10.48550/arXiv.2311.05698 , arXiv:2311.05698 [cs]. [14] C. Team, Chameleon: Mixed-Modal Early-Fusion Foundation Models, 2024. URL: http://arxiv.org/ abs/2405.09818. doi:10.48550/arXiv.2405.09818 , arXiv:2405.09818 [cs]. [15] H. Laurençon, L. Tronchon, M. Cord, V. Sanh, What matters when building vision-language models?, 2024. URL: http://arxiv.org/abs/2405.02246. doi:10.48550/arXiv.2405.02246 , arXiv:2405.02246 [cs]. [16] K. Gadzicki, R. Khamsehashari, C. Zetzsche, Early vs Late Fusion in Multimodal Convolutional Neural Networks, in: 2020 IEEE 23rd International Conference on Information Fusion (FUSION), 2020, pp. 1–6. URL: https://ieeexplore.ieee.org/document/9190246. doi:10.23919/FUSION45008. 2020.9190246 . [17] F. Yang, B. Ning, H. Li, An Overview of Multimodal Fusion Learning, in: Y. Chenggang, W. Hong- gang, L. Yun (Eds.), Mobile Multimedia Communications, Springer Nature Switzerland, Cham, 2022, pp. 259–268. doi:10.1007/978- 3- 031- 23902- 1_20 . [18] S. Poria, E. Cambria, R. Bajpai, A. Hussain, A review of affective computing: From unimodal anal- ysis to multimodal fusion, Information Fusion 37 (2017) 98–125. URL: https://www.sciencedirect. com/science/article/pii/S1566253517300738. doi:10.1016/j.inffus.2017.02.003 . [19] S. Zhao, S. Wang, M. Soleymani, D. Joshi, Q. Ji, Affective Computing for Large-scale Heterogeneous Multimedia Data: A Survey, ACM Trans. Multimedia Comput. Commun. Appl. 15 (2019) 93:1–93:32. URL: https://dl.acm.org/doi/10.1145/3363560. doi:10.1145/3363560 . [20] H.-T. Hung, J. Ching, S. Doh, N. Kim, J. Nam, Y.-H. Yang, EMOPIA: A Multi-Modal Pop Piano Dataset For Emotion Recognition and Emotion-based Music Generation, 2021. URL: http://arxiv. org/abs/2108.01374. doi:10.48550/arXiv.2108.01374 , arXiv:2108.01374 [cs, eess]. [21] C. A. Levitan, S. Charney, K. B. Schloss, S. E. Palmer, The Smell of Jazz: Crossmodal Correspon- dences Between Music, Odor, and Emotion., in: CogSci, volume 1, 2015, pp. 1326–1331. URL: https://www.academia.edu/download/84424911/paper0233.pdf. [22] F. B. Escobar, C. Velasco, K. Motoki, D. V. Byrne, Q. J. Wang, The temperature of emotions, PLOS ONE 16 (2021) e0252408. URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone. 0252408. doi:10.1371/journal.pone.0252408 , publisher: Public Library of Science. [23] C. Boscher, C. Largeron, V. Eglin, E. Egyed-Zsigmond, SENSE-LM : A Synergy between a Language Model and Sensorimotor Representations for Auditory and Olfactory Information Extraction, in: Y. Graham, M. Purver (Eds.), Findings of the Association for Computational Linguistics: EACL 2024, Association for Computational Linguistics, St. Julian’s, Malta, 2024, pp. 1695–1711. URL: https://aclanthology.org/2024.findings-eacl.119. [24] S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes, A. Del Giorno, S. Gopi, M. Javaheripi, P. Kauff- mann, G. de Rosa, O. Saarikivi, A. Salim, S. Shah, H. S. Behl, X. Wang, S. Bubeck, R. Eldan, A. T. Kalai, Y. T. Lee, Y. Li, Textbooks Are All You Need, 2023. URL: http://arxiv.org/abs/2306.11644. doi:10.48550/arXiv.2306.11644 , arXiv:2306.11644 [cs].