-

ACM KDD Conference, August

1613-0073

Integration Use Case

Amirhossein Ghafari

amirhossein.ghaffari@oulu.fi 0 1

Huong Nguyen

huong.nguyen@oulu.fi 0

Alaa Saleh

alaa.saleh@oulu.fi 0

Lauri Lovén

lauri.loven@oulu.fi 0

Ekaterina Gilman

ekaterina.gilman@oulu.fi 0

Smart City, Transportation, Federated Learning, Edge Computing, Generative AI, RAG

0 Center for Ubiquitous Computing, University of Oulu , Oulu , Finland 1 Infotech Oulu, University of Oulu , Oulu , Finland

2021

26 2024 249 253

This paper presents a system for predicting and warning about trafic accidents in smart cities, aimed at enhancing urban safety through advanced data analysis and explained warning and reporting. Our system emphasizes computational eficiency and data privacy, predicting trafic accident severity with good accuracy. By integrating real data with external knowledge sources, the system produces detailed, contextually relevant reports and warnings. Implemented with efective task orchestration, our system ensures seamless integration and resource management. Evaluation results demonstrate high accuracy and scalability, highlighting its potential for practical application in smart city environments. Future work will focus on further enhancing model eficiency, exploring transfer learning for broader applicability, and conducting real-world deployments to validate system performance.

When edge computing is integrated with AI known as

CEUR ceur-ws.org

1. Introduction

jected to live in urban areas [1]. Urbanization, driven by population growth and migration towards cities, presents both opportunities and challenges such as overpopulation and trafic congestion [ 2]. Developing smart cities is a strategic approach to mitigate these challenges.

A ”smart city” integrates information and communica

tion technology to enhance urban living [ 3 ]. This concept emphasizes the interconnection of community, people, and technology, aiming to prioritize human needs [4].

Urban mobility and transportation are significant chal

lenges, with trafic congestion and accidents being major concerns. Annually, trafic accidents result in 1.35 million deaths globally, underscoring the critical need for efective accident prevention measures [ 5].

In large-scale Internet of Things (IoT) ecosystems, efi

cient data processing is crucial. Centralized cloud servers face latency and security challenges for many application domains, making real-time processing dificult [

Edge computing aims to address these limitations by bringing computational resources closer to data sources, enabling timely processing and reducing latency [7, 8].

nEvelop-O (E. Gilman) tem, containing two AI modules: first, Federated Learning (FL) [18] model to predict trafic accident occurrences and estimate severity and second, Generative Artificial intelligence (GenAI) to generate reports and warnings.

Attribution 4.0 International (CC BY 4.0).

© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Moreover, we utilized k0s, a lightweight Kubernetes distribution, for eficient task orchestration [ 19]. The task to generate datasets, which are challenging to replicate orchestration capabilities of k0s are crucial for seamlessly in real-life scenarios. Yu et al. [21], with the same aim, integrating the FL models and Retrieval-Augmented Gen- proposed a Deep Spatio-Temporal Graph Convolutional eration (RAG) processes across multiple edge nodes. This Network for trafic accident prediction for Beijing trafenables automated deployment, scaling, and manage- fic data, which was collected hourly over three months ment of tasks, ensuring high availability, fault tolerance, and includes accident records (time and location), veand robust performance monitoring for our accident pre- hicle speeds, meteorological conditions and points of vention warning system. interest. Recent research has considered informing other

The contributions of this work can be summarized as vehicles after detecting trafic accidents using IoT, IoV, follows: and related technologies. Zhou et al. [22] proposed an 1. We integrate two diferent kinds of AI modules accident detection algorithm based on spatio-temporal into a coherent distributed system supporting ac- feature encoding with a multilayer neural network. This cident prevention. We comprehensively evaluate method first detects border frames as potential accident this system and analyze the related challenges frames, then encodes the spatial relationships of detected and opportunities. objects to confirm an accident. The process involves using Histogram of Oriented Gradients and ordinal features 2. We orchestrate tasks and monitor our system, initially, followed by CNN feature encoding and object examining its feasibility for real-world smart city relationship detection with a multilayer neural network. environments. A trained Support Vector Machine then confirms the The remainder of this article is organized as follows. presence of an accident. sec:relatedwork discusses related work, while sec:design Another approach involves eforts to reduce accidents describes the system design, and sec:implementation de- before they occur is the work of Uma and Eswari [16], tails the implementation. sec:eval then provides a de- which developed a prototype using a Raspberry Pi and tailed system evaluation and metrics, sec:discussfuture Pi Camera, along with sensors to monitor driver’s eye discusses our findings, implications, and future research movements, detect yawning, and identify toxic gases and directions, and sec:conclusion concludes the work. alcohol consumption. This system, employing the Haar Cascade algorithm for face detection and calculation of Eye Aspect Ratio and Mouth Aspect Ratio, estimates risk 2. Related Work through these feature analysis. Besides, to identify accident hot spots, Le et al. [23] used Road Trafic Accident 2.1. Intelligent Transportation System in data over three years in Hanoi, Vietnam, to develop a Smart City GIS-based statistical analysis technique. This method assesses the influence of accident severity on temporalIntelligent Transportation Systems (ITS) are essential spatial patterns, identifying accident hotspots in relation for the advancement of smart cities, with many recent to specific times of day and seasons. studies dedicated to improving urban trafic management Beyond the mention in [24] of the potential service and safety. Here, we discuss several key works that have supports of cloud to autonomous vehicles applications, made significant contributions to this field. As an exam- edge computing is playing a pivotal role in reshaping ple, Hasan et al. [20] used the Google Distance Matrix trafic management in smart cities. Within this domain, and Directions APIs to provide advanced trafic jam alerts. Mohamed’s [25] and Zhou’s research groups [26] demonTheir Internet of Vehicles (IoV) module detects accidents strated substantial improvements in trafic management and, with the assistance of the National Data Warehouse and reduced congestion durations through an edge-based and a GPS module, notifies the nearest clinic. They devel- model for real-time trafic data analysis. Besides, to oped an Android application for routing suggestions and achieve low latency and high prediction accuracy on employed an Arduino with a Sonar sensor, temperature vehicle identification at the edge, Wan et.al [ 27] have sensor, gyroscope, piezo sensor, and GSM module as the eliminated redundant frames from collected videos and core processing unit. presented an approach for real-time video processing.

Working on one of the most trendy applications, Bort- In a similar manner, Ke et al. [28] developed a multinikov et al. [15] developed a 3D Convolutional Neu- thread system for real-time detection of near-crash events ral Network (CNN) to recognize accidents automatically. in trafic, using video analytics on dashcams. LeveragThey trained the CNN using a custom video game to ing edge power, their system eficiently performs object create accident scenes with various weather and lighting detection and tracking directly from the video feeds on conditions, adding noise to diversify the data. The model board. This approach involves removing irrelevant video was then tested on real trafic videos from YouTube. The to conserve bandwidth and storage while collecting dinovelty of this research lies in the use of video games verse and valuable data for trafic safety such as road user type, vehicle trajectory, vehicle speed, brake switch, and systems begin by detecting vehicles and subsequently throttle. The approach from Ke et al. demonstrates con- estimating trafic flow density. siderable promise for widespread application due to its In their research, Xu et al. [32] employed remote low cost, real-time processing, high accuracy, and broad sensing images for this purpose, while Chougule et al. compatibility with various vehicles and camera types. [33] continuously used the estimated trafic density from

Additionally, a recent work by Nguyen et al. [29] uti- intersection-captured images to dynamically adjust the lized Blockchain technology alongside edge computing to duration of green light and schedule the timing of signals develop a reliable and transparent situational awareness across all lanes. system for autonomous vehicles. Their system broadcasts As one of the highlights in the narrow field of applying notifications and alternative route suggestions from the FL on ITS: risk detection, Yuan et al. [34] introduced nearest edge station when congestion or accidents are FedRD, a framework combining edge-cloud computing, detected by other vehicles, using various sensing data FL, and diferential privacy techniques for intelligent road sources, including dashcam images and environmental damage detection and warning. The framework not only factors like weather, temperature, and humidity. The use improves detection performance and coverage area but of Blockchain in their study ensures the data validity and also addresses privacy concerns through Individualized integrity, as well as facilitates collaboration among difer- Diferential Privacy with pixelization technique. ent service providers. However, despite the recognized Comprehensive evaluations demonstrate FedRD’s cavision and applications, Zhou et al. [30] emphasized that pability to deliver high detection accuracy and wider covemploying edge computing in ITS always comes with erage while preserving user privacy, even in scenarios inherent challenges related to sensor failure, and privacy where edge devices have limited data. This groundbreakprotection concerns, which must be addressed for efec- ing efectiveness sets a new benchmark in the field. tive implementation.

2.3. GenAI in ITS 2.2. FL in ITS Recently, GenAI has garnered significant attention in

Building on the challenges identified by Zhou et al. [ 30] several applications, including ITS, due to its advanparticularly concerning privacy protection, FL recently tages and flexibility. By analyzing data from various has been used more in smart cities. Amongst many ap- sources, such as roadside sensors, vehicles, and trafic plied domains within urban environments, the extension signals, GenAI enhances urban operations by detecting of FL applications in trafic systems is mostly leveraged patterns, identifying trends, and providing accurate prefor trafic monitoring and accident predictions. dictions and advice. With the leverage of natural lan

FedGRU - FL-based Gated Recurrent Unit (GRU) neural guage processing, GenAI can present these predictions in network [17] is one of the pioneering works for trafic human-understandable language, making these technololfow prediction (TFP) with federated deep learning that gies more accessible and practical for smart services [35]. comparably performs to other advanced competing meth- See prior works [36, 37] for examples of how GenAI inods without compromising the privacy and security of tegrated into many services within cities. As another data. Additionally, as proved by experiments, the joint example in ITS, Impedovo et al. [38] propose a deep genannouncement protocol proposed in this paper helps in erative model to predict weekday vehicular trafic flow reducing communication overhead by 64.10% compared to prevent accidents in the most critical areas and imwith centralized models, implicating the scalability of prove continuity by reducing trafic. More notably, RAG, FedGRU for bigger networks. ifrst introduced by Lewis et al. in 2020 [ 39]l, stood out

With the same motivation to address the privacy expo- as a part of this GenAI world, representing a distinct sure risk of centralized machine learning, Qi et al. [31] approach to generating text, informed reasoning, and presented a fully decentralized FL network, utilizing a supporting decision-making.

Blockchain-based FL architecture as opposed to the con- Its application in ITS is not really popular, however, ventional vanilla framework. The authors employed the there are some notable works. For instance, Dai et al. [40] local diferential privacy technique to protect vehicle lo- integrated RAG into autonomous driving systems to encation and utilized GRU to achieve accurate TFP. Perfor- hance decision-making processes. According to the aumance and security comparisons were also made among thors, the use of RAG in their work addresses the problem diferent machine learning models and with/without the of impractical generated content from the mainstream use of blockchain. Qi et al. also conducted comparative foundation models nowadays, such as GPT4 or LLaMa. It analyses in terms of both performance and security, exam- helps these models enhance the reliability of their outputs ining various machine learning models and contrasting during the generation phase by dynamically retrieving acscenarios with and without blockchain implementation. curate contextual information from outer databases (e.g. Concerning the monitoring of trafic congestion, typical updated trafic rules, driving experiences, or human prefSeverity Estimation by FL

Prepocessing Accident Report Generation by RAG

Query Semantic Meaning Model s g n i d d e b m E

Similarity Search Library s k n u h C t n a v e l e R

Accident Severity

Prediction Model

Estimated Accident Severity Sensors

Data Comprehensive analysis of US accident data

Warning Generation Model

Traffic Accident Report erence). Similarly, Ding et al. [41] utilized RAG for more estimation. Figure 1 illustrates the overall system flow, controlled generation of trafic scenarios. Specifically, highlighting the interplay between the key components: RealGen [41] synthesizes new scenarios by combining Federated Learning (FL) and Retrieval- Augmented Genbehaviors from multiple retrieved examples in a gradient- eration (RAG). free manner, using templates or tagged scenarios. This This integrated system combines the strengths of RAG in-context learning framework provides versatile gener- and FL to ensure high-quality outputs while maintaining ative capabilities, including scenario editing, behavior data privacy and relevance. FL enhances the accident composition, and the creation of critical scenarios, thus severity prediction model while maintaining data privacy. enhancing the adaptability and precision of synthetic The RAG system uses integration between the warning data generation for various applications. Most recently, generation model and the knowledge retrieval model to in his Master’s thesis, Mohanan [42] evaluated eight em- enhance the generation process with relevant external bedding RAG models for a chatbot tailored to Indian data, improving context and accuracy. Motor Vehicle Law. Our training approach starts from data preprocessing.

As can be seen, prior research typically focuses on a The preprocessed dataset is then used to train the FL single module, such as risk estimation or warning gen- model for trafic accident risk estimation. The prediceration, limiting possible support for ITS. This raises an tions, along with the sensors’ real-time data, are utilized open question: ”Is it possible to integrate all diverse compo- as input for the RAG model. The RAG model integrates nents into a cohesive and comprehensive ITS framework?” advanced retrieval mechanisms with state-of-the-art lanThis is where our work positions. guage generation capabilities to produce detailed warnings and reports for trafic accidents.

To eficiently manage and deploy these components, 3. System Design we use a task orchestration tool. This tool ensures seamless integration and coordination among the various models, automates deployment, and scales the system as needed. Additionally, it facilitates robust performance monitoring, ensuring high availability and fault tolerance across the system.

This article presents a system for predicting and preventing trafic accidents. It is capable of predicting the possible accidents based on the trafic conditions and other available data, and provides detailed textual comments to the user explaining the grounds leading to such

3.1. Dataset 3.3. Retrieval-Augmented Generation

This study uses US Accidents (2016-2023) dataset 1[43] RAG combines an information retrieval component with from Kaggle, distributed under CC BY-NC-SA 4.0 license. a text generator model to provide situational information This dataset comprises a vast collection of over 7.7 mil- and guidance [44]. In the ITS context, RAG can integrate lion (7,728,394) trafic accident records, covering 49 states various external data sources to analyze and report trafic of the USA from February 2016 to March 2023. The ac- accidents, identifying risk factors and details [45]. This cident data were collected using multiple APIs that pro- makes the system more dynamic and adaptable to new vide streaming trafic incident data captured by various information. In our system, see Figure 1, RAG provides entities, including the US and state departments of trans- textual accident warnings to the end user, along with portation, law enforcement agencies, trafic cameras, and explanations of how the estimates were derived. trafic sensors within the road networks. The data in- Knowledge retrieval model It is designed to find the cludes detailed information on accident severity, location, most relevant information from an external knowledge time, and weather conditions. This dataset was utilized base in response to the query. This enhances FL model to train the FL models for trafic accident prediction. output and sensor data with relevant information. We use SentenceTransformers2 as a retrieval model based on

3.2. Federated Learning similarity search.

Warning generation model: It is designed to generate Our application relies on FL model for accident risk esti- new content using language models. It uses the retrieved mation. FL was selected based on two primary consider- information by the retrieval model and FL-output details ations: data privacy and collaborative enhancement. to generate a response. For our system, we use gpt-3.5turbo-06133 to create contextually relevant warnings and detailed reports. The accident report includes the severity of the accident, the location and trafic control procedures, and guidance and actions. 1. Privacy: Addressing privacy concerns, vehicles in a real scenario do not transmit raw data, which could potentially reveal sensitive information. Instead, only model parameters will be sent, ensuring that individual data remains secure and private. This cannot be done with traditional centralized learning when all data need to be sent to a central server for training. 2. Collaboration: When a vehicle updates and shares its model parameters, it contributes to the overall learning process. This collective efort leads to an improvement in the overall model’s performance, as it can learn from a wide range of diverse and localized inputs. The shared knowledge enables more accurate and robust risk estimation.

The training data features provide a detailed view of accident records, including the specifics of the accidents, the geographic locations, the prevailing weather conditions at the time of the accidents, and various environmental and contextual factors that may be relevant to analyzing the accidents. In a real scenario, the vehicle’s onboard computing system uses these inputs to continuously update its local model, learning from real data.

Once the training is done, the model parameters will be sent to the nearby edge server. The server, after receiving a suficient amount of models will start doing the aggregation to get the global model, which is then sent back to the participating vehicles. When this whole process is complete, we finish one communication round and continue to the next round.

1https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents 3.4. Task Orchestration and Monitoring Efective resource management and device health moni

toring are essential for enhancing the responsiveness of smart city services. This requires comprehensive system monitoring that spans from edge devices to the cloud. The deployment of applications on edge devices necessitates advanced task orchestration platforms, which must be carefully selected based on specific requirements. Given that edge devices typically have limited resources, the chosen tool must operate smoothly under such constraints. For the proposed system, k0s4 has been selected. We selected k0s because of its minimal resource consumption on edge devices and its straightforward and rapid implementation process, supported by comprehensive documentation and active developer forums. It typically operates with as little as 1 CPU and 512 MB of RAM on each controller node and 1 GB of RAM on each worker node, which aligns well with the capabilities of edge devices. However, the minimum requirements increase when the number of worker nodes is increased. Additionally, numerous monitoring options compatible with k0s are available. k0s is packaged as a single, self-extracting binary which embeds Kubernetes binaries. It has many benefits, such as it has no OS level dependencies and everything can be, and is, statically compiled.

2https://sbert.net/ 3https://platform.openai.com/docs/models/gpt-3-5-turbo 4https://docs.k0sproject.io/stable/ 4. System Implementation

back to the participants for training in the next round.

The FL training process concludes after ten communica4.1. Risk Estimation with FL tion rounds. At this stage, various model architectures, encompassing difering layer counts and hyperparame4.1.1. Preprocessing ters, were evaluated over 50 communication rounds to The preprocessing phase for our system includes a series observe the trend and convergence in via its performance. of essential data preparation steps to ensure the quality The selected model outperformed alternatives; models of the dataset for further analysis: with reduced layers demonstrated inferior outcomes (31. Data Cleaning: Duplicated and missing values were 4%), while configurations with additional layers, despite removed. a 3% accuracy improvement, incurred prolonged train2. Feature Engineering: To enhance the informative- ing duration and converged to local, rather than global, ness of the dataset, a new feature, called “Comfort_Index” optima. See Table 1 for details. following Equation 1 is created. _ = ( − 32) ∗ ( /100)

(1) 3. Data Resampling: To address the imbalance issue, both random oversampling and undersampling of the data was done to ensure that each label had an equal distribution. 4. Data Transformation: Done according to feature type: • Categorical Data: One-hot encoding was applied to categorical columns, except for “Street,” “State,” and the target label “Severity”. • Boolean Data: Columns with two distinct values

were binarized, converting them to 0 and 1. • Numeric Data: Columns containing numeric data were left unchanged, preserving their original values.

Using the RAG model, we retrieve text passages using an input sequence. During the generation of the target sequence, we include these passages as additional context. Our model leverages two components, which are implemented in LangChain5. A retriever that retrieves 5. Standardization: The dataset was then subjected relevant text snippets in response to a user’s query or to StandardScaler standardization. This process ensured prompt based on knowledge source which is uploaded that all features had consistent scales and values within using built-in document loader from LangChain. a particular range. In our system, we rely on the US trafic accident database as an external knowledge source, containing a comprehensive analysis of US trafic accident data [ 46]. 4.1.2. FL Training and Prediction This report provides insight into preventive measures To simulate a real-world scenario using our chosen and policy recommendations for decreasing trafic accidataset, we distributed the data across several nodes and dents in the US based on detailed analyses by state, time, established certain assumptions. This section will elabo- and contributing factors such as weather. The retrieval rate on those details. process begins with loading documents using a tool in Distribution: The data is divided into five equal parts, LangChain. This process is enhanced by a splitter tool, corresponding to five nodes in the system. We also make also integrated into LangChain, designed to segment exsure the number of samples of each label is distributed tensive texts into smaller chunks based on a specified equally among clients. chunk size by examining characters recursively which is Model Training: Each client trains its local model, con- crucial for the eficient handling of large textual data. sisting of three fully connected layers. Training specifi- For the creation of text embeddings, we employ Hugcations include the use of the cross-entropy loss function, gingFaceEmbeddings, a specialized embedding model Adam optimizer with a learning rate of 1e-3, and a batch from the Hugging Face library6 within LangChain. This size of 32. After ten training epochs, the locally trained model transforms the segmented text chunks into numermodels are aggregated by the server into a global model, ical vectors, facilitating their computational handling. and the global parameters are saved at each checkpoint, 5https://www.langchain.com/ here at each communication round, before being sent 6https://huggingface.co/ To store these embedding vectors in a vector store, we

7 utilize the FAISS library , a robust vector database. It enables efective similarity search by identifying text chunk vectors most similar to the question vector. This process is vital to determine which portions of the knowledge source are most pertinent to the input query. This is for later retrieval at query time based on the k argument which finds the top k most relevant text chunk vectors for each query. Table 2 summarizes the RAG parameters used.

The generator creates a more detailed, factual, and relevant response based on the original input and retrieved documents. The original input represents the severity of an accident, derived from the FL output and complemented by sensor real-time data. For the generation of coherent and contextually relevant text, the original input and the retrieved documents are fed into gpt-3.5turbo-0613, a sophisticated pre-trained language model.

Based on the content of these documents, the model generates coherent and contextually relevant text grounded in real-world information. Figure 2 illustrates an example of a trafic accident report generated by RAG. 4.3. Task Orchestration and Monitoring

9 using Docker and deployed our application using Lens ity and relevance of warnings and reports generated by As discussed in Sub-section 3.4 we opted for k0S, which the RAG model were assessed. The system’s prompt reis ideal for our needs and simple in implementation. We

8 sponsiveness was also tested, particularly how quickly used Lens IDE which is a Kubernetes IDE to manage it can generate alerts and warnings based on incoming the cluster and monitoring of the whole system. It aldata. Furthermore, the resource management aspect was lows for comprehensive oversight of nodes, pods, and evaluated to ensure that the system’s resource usage is resource monitoring. Monitoring involves tracking the optimized and well-maintained. The developed system usage of CPU, memory, storage, and network bandwidth, was deployed and tested on a real cluster of three nodes and monitoring device safety and functionality to detect with k0s equipped with the monitoring application. any potential problem. We containerized our application IDE and k0s task orchestration tool. We used Cluster met- 5.1. Risk Estimation Evaluation rics in the Lens IDE to monitor the resources eficiently.

5. System Evaluation

To assess the system’s performance, several key metrics were employed. We want to ensure that all the components work perfectly both independently and in the integrated system. First, we monitored the accuracy of the FL model for risk estimation, assessing its ability to predict trafic accident severity. This evaluation utilized the dataset for training the model. Additionally, the qual5.1.1. Accuracy

We monitor the training process of the FL model in terms of accuracy, loss, and convergence. The training for 50 communication rounds with 5 training clients takes up to 4.042 hours.

and the training loss in the lower graph. The model However, as input sizes increase to 1,000 and 10,000, demonstrates convergence approximately by round 30 the total latency grows more substantially, hitting 0.4487 at 71.15%, as depicted in the upper plot. Initially, model seconds for 10,000 inputs. This increment continues, accuracy exhibits an upward trend from round 0 to 30, even more sharply, with the model taking 0.9463 seconds albeit with fluctuations observed around rounds 15-17 to predict outcomes for 100,000 inputs concurrently. and 21. Subsequently, after round 30, the risk estimation Overall, this evaluation outcome underscores the FL model appears to have reached a plateau in accuracy, model’s scalability with a total latency, not only for small becoming converged. This is also reflected in the lower input batches but also optimized for larger ones. Nevergraph of training loss. theless, it should be noted that the measured time can be diferent among diferent working devices.

Training accuracy

It is, however, possible for low power-resource devices to terminate the training process at an earlier stage, such as after round 10 or 20, with negligible tradeofs in accuracy.

5.1.2. Total latency trends

The bar graph (referred to Fig. 4) depicting the total latency for predictions reveals a clear trend: as the number of inputs processed simultaneously increases, so does the time required for prediction.

Time response 0.9463 0.8 0.6 cseondS 0.4 0.2 0.0 0.3931

Starting from a swift 0.3931 seconds for a single input,

the latency moderately rises for batches of 10 and 100 inputs, reaching 0.4062 seconds, suggesting the model handles small to moderate increases in input size eficiently.

5.2. Accident Warning Report Evaluation To evaluate the quality of accident warning report gener

ated by RAG, we have used correctness, relevance, and faithfulness as criteria to assess LLM outputs10. We used gpt-3.5-turbo-0613 for the evaluation task to contextually analyze and interpret generated reports according to the criteria.

Correctness is based on the LLM’s internal knowledge. However, given the potential unreliability of the LLM’s knowledge base, we enhanced the evaluation method by incorporating reference labels. This provides an external benchmark for correctness. The evaluation process produces a dictionary containing key metrics: “score”, a binary integer from 0 to 1 indicating compliance with the criteria, “value”, which is either ”Y” (Yes) or ”N” (No) based on the score, and “reasoning”, which outlines the LLM’s chain of thought. Relevance evaluates the relevance and focus of the generated answer in relation to the provided prompt. Faithfulness assesses the factual consistency of the generated answer against the given context and reference documents. Using this approach, we ensure not only that the generated content meets the prompt’s specific requirements. It also remains true to the factual information provided in the reference material. Figure 5 illustrates an example of RAG output evaluation.

Based on correctness, relevance, and faithfulness criteria, the evaluation shows that the output accurately represents an actual quote. Throughout the evaluation output, all necessary elements are addressed in a comprehensive, well-structured, and well-written manner. Based on the evaluation output, the response summarizes accident data and provides a comprehensive analysis of weather conditions at the time of the accident, including visibility and severity. Additionally, it provides recommendations for preventing accidents in the future relevant to the reference. 10https://python.langchain.com/docs/guides/evaluation/string/ criteria_eval_chain

6. Discussion and Future Work

Correctness_criteria: {'reasoning': To determine if the submission meets the criteria, we need to evaluate the correctness, accuracy, and factual nature of the submission. 1. Check if the submission correctly presents the accident data, including the street, state, latitude, longitude, and various factors related to the accident. 2. Verify if the submission accurately describes the weather conditions at the time of the daicrceicdteinotn,, ainndclpurdeicnigpittaetmipoenr.ature, wind chill, humidity, pressure, visibility, wind 3. Assess whether the submission accurately provides information about the severity of the accident, distance, sunrise/sunset, and comfort index. 4. Evaluate if the recommendations for future prevention are reasonable and relevant to the accident scenario.

Based on the above reasoning, the submission meets the criteria if all the above conditions are satisfied. 'score': 1, 'value': 'Y'} Relevance_criteria: {'reasoning': To determine if the submission meets the criteria of relevance, we need to compare the content of the submission with the provided data.

We will check if the submission accurately refers to a real quote from the text. - The submission provides a detailed analysis of the accident data, including the street, state, and various accident factors. It also mentions the weather conditions, severity, and recommendations for future prevention based on the given data. - The submission accurately reflects the information provided in the data. - Therefore, the submission meets the criteria of relevance.

Based on the above reasoning, the conclusion is that the submission meets all the criteria.'score': 1, 'value': 'Y'} Faithfulness_criteria: The assistant's response is faithful to the reference context. It accurately summarizes the accident data provided in the user question and provides a detailed analysis of the accocmipdreenhte.nsivIet anadlscooveorfsfearlsl trheecormemleenvdaanttioansspecftosr offutthuereaccpirdeevnetntdiaotna.. The response is

7. Conclusion

3.3. Retrieval-Augmented Generation