1. Introduction

International Journal of Advanced Computer Sci- V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. tau ence and Applications, Vol.

10.1109/TKDE.2020.3014166

Advancements and Challenges in Generative AI: Architectures, Applications, and Ethical Implications

Flora Amato

flora.amato@unina.it 0

Domenico Benfenati

Egidia Cirillo

egidia.cirillo@unina.it 0

Giovanni Maria De Filippis

Mattia Fonisto

mattia.fonisto@unina.it 0

Antonio Galli

Stefano Marrone

stefano.marrone@unina.it 0

Lidia Marassi

lidia.marassi@unina.it 0

Vincenzo Moscato

vincenzo.moscato@unina.it 0

Narendra Patwardhan

narendra.patwardhan@unina.it 0

Alberto Moccardi

alberto.moccardi@unina.it 0

Antonio Elia Pascarella

Antonio M. Rinaldi

Cristiano Russo

Carlo Sansone

carlo.sansone@unina.it 0

Cristian Tommasino

0 1 0 Department of Electrical Engineering and Information Technology (DIETI), University of Naples Federico II , Via Claudio 21, 80125 Naples , Italy 1 Interdepartmental Center for Research on Management and Innovation in Healthcare (CIRMIS), University of Naples Federico II , Naples , Italy

2018

7 2015 2420 2422

Architecture, classification, and major applications of Generative AI interfaces, specifically chatbots, are presented in this paper. Research paper details how the Generative AI interfaces work with various Generative AI approaches and show the architecture and their working. On the other hand, the generative model is built using advanced machine learning techniques to build dynamic, contextually relevant responses automatically. On the other hand, the retrieval-based model builds up with dependency on a predefined response library. The paper also discusses the use of Generative AI to populate Multimedia Knowledge Graphs (KGs), presenting technologies based on the semantic analysis of deep learning and NoSQL to more efectively integrate and retrieve data. The social and ethical challenges that come with the deployment of generative models are critically reviewed. These dialogues bring forward the balance that has to be maintained between progress and necessity in technological advancements, for which the call for ethical responsibility in developing AI is made. The paper presents a comprehensive review of state-of-the-art Generative AI with special focus on the promises and pitfalls in Generative AI research related to both natural language processing and knowledge management.

eol>artificial intelligence Generative AI

1. Introduction

A chatbot, also known as a conversational agent, is an artificial intelligence (AI) software that can simulate a conversation (or a chat) with a user through text or voice interfaces [1]. Chatbots can use natural language processing (NLP) and machine learning algorithms to understand user inputs and generate appropriate responses, allowing them to provide assistance, automate tasks, and perform other functions without the need for human intervention.

The term "chatbot", short for "chatterbot", was originally coined by Michael Mauldin in 1994 to describe these conversational programs in his attempt to develop a Turing System [2].

This work aims to explore various techniques, approaches and technologies that have been utilized for developing chatbots since the late 1990s; furthermore, we will provide insights into the most common applications and use cases.

2. Architecture and Classification of Generative AI Interfaces As a modern approach for architecture of Generative AI

Interfaces, we will follow [3, 4, 5] and divide the intelligent interfaces structure proposed in the state of the art in four parts: the interface, the multimedia processor, the multimodal input analysis, and the response generator. In detail, 1. The interface is responsible for managing the interaction between the chatbot and users, which involves receiving inputs in various forms such as text or audio and returning appropriate responses. 2. The multimedia processor (optional) may be required to preprocess voice or video signals and convert them into text or recognize the user’s tone to facilitate response generation. 3. The multimodal input analysis unit handles classification and data pre-treatment, often using natural language understanding (NLU) techniques such as semantic parsing, slot filling, and intent identification. 4. The response generator either associates a proper response for the given pre-processed input from a stored dataset or, using modern machine learning techniques, maps the normalized input to the output using a pre-trained model.

The response generator is the core component of a chatbot where the actual question-and-answer process takes place, and it can be considered as the "brain" of the system. Based on the architecture of the response generator, chatbot systems can be classified into two main categories: retrieval-based chatbots, which select their responses form a pre-defined set of possible outcomes, and generative-based chatbots, which use ML techniques to dynamically generate answers [6].

2.1. Retrieval-based chatbots

The goal of retrieval-based chatbots is to "understand" the user input and choose the most suitable responses from a knowledge dataset. There are four sub-categories of retrieval-based chatbots, which can be distinguished based on the architecture of their knowledge dataset and retrieval techniques. These categories are template-based, corpus-based, intent-based, and RL-based [5]. Template-based chatbots

Template-based chatbots select responses from a set of

possible candidates by comparing the user input to certain query patterns. model leverages this information to link normalized user inputs with the most probable user intent [7]. RL-based chatbots RL-based chatbots adopt reinforcement learning for response generation. Reinforcement learning itself is mainly based on the Markov decision process, i.e. a 4tuple (, , , ) where: • = (1, 2, ..., ) is a set of states, called the state space; • = (1, 2, ..., ) is a set of actions, called the action space; • (, ′) = Pr(+1 = ′| = , = ) is the probability that action , in the state at step will lead to state ′ at step + 1; • (, ′) is the reward received after transitioning from state to state ′ when action is performed.

The goal of a Markov decision process it to find a function () (generally called policy) that associate, for every state , the action () = which maximizes the overall reward, i.e. the following expectation value: = [︃ ∞ ∑︁ ()(, +1) =0 ]︃ (1) where is a coeficient (the discount factor) between 0 and 1 [8]. In RL-based chatbots, each state corresponds to a specific turn in the conversation and is usually represented by an embedded vector. After the chatbot is trained, it is able to select the most appropriate response (action) to ensure that the conversation remains relevant and coherent [9].

2.2. Generative-based chatbots

Corpus-based chatbots Generative-based chatbots have the advantage of being able to generate responses dynamically, which can lead to Although template-based chatbots have shown efective- more natural and flexible conversations with users. Genness in certain cases, their fundamental architecture ne- erative chatbots can generate novel responses, which cessitates scanning through all potential outputs for each means that they are not limited to pre-defined responses input until the appropriate response is located. As a like retrieval-based chatbots. This flexibility allows them result, this approach can be slow and unsuitable for ap- to provide more personalized and relevant responses. plications with a large knowledge dataset. Depending on the machine learning architecture used, we will discuss about RNN-based chatbots and Intent-based chatbots Transformer-based chatbots.

Intent-based chatbots utilize machine learning tech

niques to establish a connection between user inputs and pre-defined outputs. Typically, relevant data is collected and stored to establish associations between user intents (i.e., the conceptual meaning behind a user’s request) and appropriate responses. Next, a pre-trained RNN-based chatbots One commonly used method for developing generationbased chatbots involves the use of two interconnected neural networks known as recursive neural networks (RNNs). The first network, called the encoder, is trained to associate an input sentence with an intermediate vec- (AI) to streamline and revolutionize complex decisiontor called the context vector. The second network, making processes, augmenting the power of cuttingcalled the decoder, takes the context vector as input and edge technologies, enhancing the classical Retrievalis trained to generate an output sentence, either by gen- Augmented Generation (RAG) models. Through a meticuerating actual words or by using tokens. This approach lous exploration of a multi-query & human centred RAG is commonly referred to as "sequence-to-sequence" or application design, the access and the understanding to Seq2Seq [6, 10]. sophisticated AI capabilities, bridging the gap between As RNN-based chatbot responses are dynamically gen- technical expertise and practical application, is guaranerated through machine learning models, they may be teed. The culmination of this inquiry comes with a conless precise and more uncertain than retrieval-based chat- cise and robust architectural flow proposal, laying the bots. For this reason, RNN-based chatbots are less com- groundwork for the seamless integration of multiquerymonly used in task- or knowledge-oriented scenarios and RAG solutions into decision-making processes and oferare instead more frequently used in entertainment and ing further insights that extends beyond the confines of mental-health-related activities [5]. this study and pave the way for future advancements in the field.

Transformer-based chatbots A Transformer is a recent type of neural network architecture used for NLU and chatbots. First introduced in [11], is also used in other tasks such as language translation and text summarization. Transformers are based on the self-attention mechanism, which allows the model to learn which parts of the input sequence to attend to at each step of processing, based on the relevance of the other parts of the sequence to the current position. This is done through a process called scaled dot-product attention, where the model learns a set of weights to compute a weighted sum of the input sequence representations.

An important language model based on the Transformer architecture is the Generative Pre-trained Transformer (GPT), which was developed by OpenAI in 2020 [12]. GPT serves as the underlying architecture for the ChatGPT chatbot, which has gained widespread recognition for its ability to provide detailed and articulate responses across a variety of domains [13].

3. Multiquery Retrieval Augmented Generation In the actual forefront of Generative Artificial Intelli

gence (Gen-AI) streamlining complex decision-making processes by enabling accessible and comprehensible tools to all users it is vitally important. The core of this section is relative to propose an alternative to the classical RAG, introduced by Lewis et al. in 2021 [14], enhancing its capabilities with a multiquery approach presenting a concise and solid architectural flow along with main evaluation metrics.

3.1. Methodology

Question Generation Chain The multiquery-RAG system distinguishes itself through its ability to generate multiple variations of the original user query, in a human like fashion, through a specialized question generation chain that produces a prefixed number of alternative queries capturing distinct viewpoints and nuances associated with the original question. This diversification of the query set, if correctly fine-tuned, plays a pivotal role in surmounting the limitations of distance-based similarity searches in vector databases, ensuring a comprehensive and more eficient document retrieval process despite the classical retrieving process.

Answer Generation Chain Following the retrieval of information (documents), the system proceeds to generate answers by synthesizing and formulating responses using the data extracted from the documents and leveraging a wide LLMs systems. Contextualizing and elaborating on those information it ensures that the responses are both accurate and easily understandable for non-experts facilitating broader accessibility and utilization of the information among a wider audience.

3.2. Evaluation Criteria This section outlines the principal metrics [15] that are integral for evaluating a Retrieval-Augmented Generation (RAG) in measuring diferent aspects of the system’s performance as presented in figure [1]. Context Precision This metric evaluates the signal-tonoise ratio within the retrieved contexts measuring how many of the retrieved documents are actually relevant respect to the user’s query. This methodological section delves into the profound im

plications of leveraging Generative Artificial Intelligence

Context Recall This metric assesses whether all neces

sary information required to answer the query has been Recent advancements, however, ofer promising solutions. [18] and [19] present novel frameworks integrating semantic analysis, deep learning, and NoSQL technologies to extract entities from knowledge corpora, bridging the gap between textual and multimedia sources. Their approaches mark significant strides in enriching KGs with diverse data types, fostering more comprehensive knowledge representation and analysis.

Meanwhile, Chen et al. [20] propose a generative approach to the KG population, leveraging machine learning to establish relationships and reduce human intervention in the curation process. Training models to learn underlying data distributions and generate triplets regardless of entity pair co-occurrence in textual corpora pave the way for more eficient and scalable KG construction. This innovative approach streamlines the population process and broadens the scope of knowledge capture, enabling KGs to encapsulate a wider array of interconnected concepts and relationships.

Manual curation, though traditional, is labor-intensive and impractical in the face of expanding data landscapes Figure 1: RAG Evaluation criterion [21]. To address this, a data-centric architecture harnessing generative deep-learning models emerges, automating KG creation, particularly for multimedia instances. retrieved ensuring that the system’s knowledge base cov- By synthesizing multimedia data, irrespective of absolute ers all aspects needed to formulate a comprehensive and data scarcity, a dynamic, infinitely expandable pool of accurate response and relying on a comparison between instances is ensured, underpinning model training and inthe retrieved contexts and the ground truths. ference with a multimedia knowledge graph that evolves alongside data trends.

Faithfulness This metric quantifies the factual accu- Diferent knowledge graph population approaches with racy of the answers generated by the RAG system. It in- generative AI are based on standard steps. The first is volves counting the number of correct factual statements grabbing information from curated textual sources. It is made in the generated answers based on the retrieved possible to enrich it by using Linked Open Data (LOD) contexts and comparing this count to the total number and base the image’s generation using the enhanced texof statements in the answers. tual description to make the text as complete as possible. The next step combines the previously obtained textual Answer Relevancy This metric measures how well statement and produces a representative multimedia inthe generated answers address the user’s queries. For ex- stance of the input text via a generative text-image synample, if a query asks for multiple pieces of information, thesis model. The last step consists of using a focused the relevancy score reflects how completely the response crawler, which allows a check on the quality of the generaddresses all elements of the query. ated image, exploiting diferent metrics useful to measure the degree of similarity of the generated image concerning its textual description and real images crawled from 4. Multimedia Knowledge Graph the web. If the image from the previous step exhibits metpopulation using Generative AI ric values that surpass a threshold determined through experimental evaluation, it can be stored in the node of Knowledge Graphs (KGs) serve as potent repositories, the multimedia knowledge base. adeptly organizing, connecting, and extracting insights In image generation for the knowledge graph population, from many data sources, embodying contemporary text-image synthesis models are developed to bridge the knowledge management principles in semantic web ap- semantic gap between textual descriptions and correplications [16]. Despite their invaluable utility, realizing sponding visual representations. These models leverthe full potential of KGs necessitates a systematic pop- age cutting-edge generative strategies to produce highulation with relevant information, a task fraught with quality images aligned with the provided textual prompts. challenges, mainly when data is scarce [17]. The application of text-to-image models improved a lot in recent years, migrating from Generate Adversarial Network (GAN) to Latent Difusion Models, such as Stable on creating a concrete sustainable generative model, adDifusion [22]. A latent difusion model refines a latent dressing crucial issues related to data collection, key representation by applying difusion steps in the latent model components, and essential additions. One of the space, gradually reducing noise and revealing the desired main goals of the project is to improve model eficiency image. This iterative process involves adding noise and without compromising performance, using techniques updating the latent code. The model implements a de- such as attention and linear layer optimization within the coder network to reconstruct the image from the refined Transformer architecture. Hominis also aims to ensure latent code. the sanitization of public data and develop data collection The evaluation phase of the quality of multimedia in- strategies to capture a wide range of multifaceted data. stances for the KG node is important. The evaluation pro- Additionally, the project involves developing tools for the cess of text-to-image synthesis models involves assessing community to analyze, curate, and critique datasets while their accuracy in converting text inputs into synthetic ensuring fairness, privacy, and legality. The proposed images. methodologies, such as Universal Tokenization, Assisted Some quantitative metrics are used to assess not only the Generation by Recovery (RAG), the use of difusion to quality of the image about the text but also the degree improve model controllability, and the use of muTransfer of realism in a generated image by comparing it to real technique to optimize hyperparameters and reduce carimages, such as Cosine Similarity, which compares the bon footprint associated with training, all aim to improve feature vectors, calculating the cosine between them, FID the eficiency, sustainability, and fairness of AI models. In (Frechèt Inception Distance) [23], a numerical value that particular, the approach of unifying data through Univerquantifies the similarity between the statistical distribu- sal Tokenization can help better manage data diversity, tions of real and generated images computing the Fréchet while RAG can improve model relevance and accuracy, distance between the two distributions, and CLIP score ensuring greater fairness in outcomes. Furthermore, the [24], a metric that understands the relationship between use of difusion to improve model controllability helps enimages and text, used for evaluate the model’s ability to sure that AI outputs are transparent and understandable. rank images based on their relevance to a given textual Today, attention to sustainable, adaptable, and responsidescription and vice versa. ble AI is crucial to ensure that the benefits of artificial intelligence are evenly distributed and that negative impacts, such as the carbon footprint associated with model 5. Ethical and social challenges training, are minimized. In an era where sustainable and responsible AI is essential for our future, projects like Hominis represent a step in the right direction, helping ensure that the benefits of AI are accessible to all while minimizing negative impacts on the environment and society.

The recent advances in generative AI are revolutionizing many sectors thanks to the ability to create original content based on patterns learned from training data. Models such as those based on transformer architectures, have already demonstrated significant success in various fields, including natural language processing, computer vision, and reinforcement learning. However, despite the advan- Acknowledgments tages ofered by generative models, their development and deployment raise concerns regarding ethical and en- This work was partially supported by PNRR MUR Project vironmental implications. Firstly, these models require PE0000013-FAIR. massive computational resources and consume a large The FAIR project is committed to promoting an advanced amount of energy during both training and execution vision of Artificial Intelligence, driving research and deprocesses. This raises concerns about the environmental velopment in this crucial field and constantly keeping impact of AI, especially considering the urgent need to ethical, legal and sustainability considerations in mind reduce carbon emissions to address climate change. Additionally, there are ethical concerns regarding the use and management of training data. Since these models References can generate original content, there is a risk that they may perpetuate biases or discriminations present in the training data, raising questions about fairness, privacy, and data security in the era of AI [25].

The Hominis project, conducted at the University of Naples Federico II in collaboration with industrial partners (DeepKapha), aims to advance toward sustainable and programmable AI solutions [26]. The project focuses