1. Introduction

PeRAG: Multi-Modal Perspective-Oriented Verbalization with RAG for Inclusive Decision Making

Muhammad Saad Amin

0 1

Horacio Jesús Jarquín Vásquez

Franco Sansonetti

Simona Lo Giudice

Valerio Basile

Viviana Patti

0 0 Department of Computer Science, University of Turin , Italy 1 Department of Electrical and Computer Engineering, Aarhus University , Denmark 2 Dipartimento di Economia e Statistica "Cognetti de Martiis", University of Turin , Italy

2025

Urban policy makers require comprehensive insights into transportation issues and demographic distributions to design equitable and eficient infrastructure. However, analyzing multi-modal data (numeric and visual) while accounting for diverse perspectives remains challenging. To address this, we propose PeRAG, a novel pipeline combining multi-modal perspective-oriented verbalization with Retrieval-Augmented Generation (RAG). Our approach first converts numeric transportation/demographic data and population heatmaps into natural language descriptions using LLaMA, incorporating multiple policy-relevant perspectives. These verbalizations are then fed into the RAG system to generate context-aware, perspectivedriven responses for urban planners. We demonstrate the efectiveness of PeRAG in generating actionable insights for transportation policy, bridging the gap between raw data and decision-making. Our experiments highlight the pipeline's ability to handle heterogeneous data modalities while adapting to diverse stakeholder viewpoints, ofering a scalable solution for smart city analytics.

eol>Multi-modal Verbalization Retrieval-Augmented Generation (RAG) Perspective-Aware NLP Large Language Models (LLMs) Urban Transportation Analytics

1. Introduction Urban environments provide a rich case for multimodal

reasoning: data can include numerical variables (e.g., popUrban policy makers face significant challenges in de- ulation size, number of transport lines), visual artifacts signing equitable transportation systems due to the com- (e.g., heatmaps of population density), and geographical plex interplay of demographic shifts, infrastructure con- descriptors (e.g., district boundaries). Integrating and straints, and socio-economic disparities [1]. Raw data interpreting these diferent modalities coherently is es(e.g., transit logs, census metrics, heatmaps) is often sential for supporting informed decision-making. siloed, requiring labor-intensive integration to derive One of the emerging challenges in this context is insights [2, 3]. While NLP and computer vision tech- perspective-aware verbalization, the task of transformniques have been applied to urban analytics, they typi- ing multimodal data into textual descriptions that reflect cally treat data modalities independently, ignoring the diferent analytical or stakeholder viewpoints [ 6]. For need for cross-modal reasoning (e.g., correlating heatmap instance, the same urban dataset can be verbalized from a patterns with numeric poverty indices) [4]. This limits demographics perspective (“This area has a high populatheir utility for policy decisions requiring holistic, inter- tion of elderly residents”) or a transportation accessibility pretable inputs. perspective (“This zone has limited coverage of public

In recent years, advances in machine learning and NLP transport lines despite high population density”). Generhave enabled new forms of automated data interpretation, ating such targeted descriptions from numeric and image particularly in multimodal settings where information data requires models that understand not only the input spans both structured and unstructured modalities [5]. modalities but also the intended angle of interpretation [7]. This introduces both linguistic complexity—in choosCLiC-it 2025: Eleventh Italian Conference on Computational Linguis- ing appropriate vocabulary, structure, and focus—and tics, September 24 — 26, 2025, Cagliari, Italy reasoning complexity—in determining what information * Corresponding author. is salient for a given perspective. h$orsaacaido@jeseucse..jaaurq.dukin(vMa.sSq.uAezm@inu)n; ito.it (H. J. Jarquín Vásquez); These challenges compound when integrated into franco.sansonetti@unito.it (F. Sansonetti); retrieval-augmented generation (RAG) pipelines. Trasimona.logiudice@unito.it (S. Lo Giudice); valerio.basile@unito.it ditional RAG frameworks are typically designed for text(V. Basile); viviana.patti@unito.it (V. Patti) based retrieval from large knowledge bases; extending 0000-0002-7002-9373 (M. S. Amin); 0000-0001-8110-6832 them to operate over generated textual representations (V. Basile); 0000-0001-5991-370X (V. Patti) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License of multimodal data introduces new issues: retrieval is Attribution 4.0 International (CC BY 4.0).

• We conduct human evaluation and qualitative analysis to assess factuality and relevance, and compare PeRAG outputs against general-purpose

LLMs.

The rest of this paper is organized as follows: Section 2

reviews related work in multimodal NLP, verbalization, and RAG systems. Section 3 describes the methodology, including dataset details, verbalization techniques, and system architecture. Section 4 outlines our experimental setup. Section 5 presents results from verbalization and QA evaluations. Section 6 ofers a detailed analysis and discussion. Section 7 concludes the paper and outlines directions for future work. 2. Related Work only as efective as the fidelity and perspective alignment of the verbalized input, and generation must remain factual, grounded, and contextually relevant [8]. Moreover, multimodal verbalizations are often more compact and abstract than traditional long-form documents, which poses dificulties in relevance ranking and context-aware generation.

In this work, we investigate the following core research questions: 1. How can multimodal data (numeric and visual) be verbalized in a perspective-aware manner to support policy-level interpretation? 2. What are the linguistic and functional trade-ofs between zero-shot and few-shot verbalization approaches in this context? 3. Can a lightweight, locally-deployable RAG pipeline (PeRAG) efectively answer urban policy questions when built on top of such verbalizations? 4. How does the factuality and utility of such a system compare to general-purpose LLMs, especially in high-stakes policy scenarios?

To address these questions, we present PeRAG, a novel framework that combines multimodal data verbalization with a perspective-aware Retrieval-Augmented Generation pipeline. Our work is based on a custom dataset for the city of Turin, comprising over 7,000 examples across multiple years (2012–2019), including 31 features covering demographics, transportation, and trafic. We verbalize both numeric and heatmap data into English summaries across several perspectives (e.g., demographicsfocused, transport lines-focused, temporal shifts), using LLaMA-3.1-8B for the verbalization of numeric data, and LLaMA-3.2-11B-Vision for the verbalization of heatmap data in zero-shot and few-shot settings. These verbalizations serve as the retrievable memory in a Gemma-34B-IT -powered RAG system, which supports questionanswering on urban policy issues. All models are run locally to ensure data privacy and control.

Our key contributions are as follows:

A major research avenue in knowledge-enhanced Multi-Modal Data language modeling is Retrieval-Augmented Generation Initialdparteapcaoralteicotnion and (RAG), in which a retriever module selects relevant textual passages from a knowledge base that are then fed into a generator to produce a grounded, informative response [8, 11]. This has been particularly efective in tasks like open-domain QA, summarization, and dialogue. Variants such as MuRAG [12] have explored incorporating VtherrobuaglihzaLtLioMn VePrebraspliezectdivTeesxt ModAenlaRleysspisonse multiple modalities into retrieval pipelines. ConvtoertseioxtnuosfinngumLLearMicadata Generatipnegrstpexetctfirvoems various ouEtpvualtutahtrionuggthhequmeosdtieoln'sing

In our work, we adapt and extend the RAG architecture for perspective-aware generation by populating the Figure 1: PeRAG: Perspective inclusive pipeline with RAG retrieval index with natural language verbalizations that encode distinct viewpoints over the same input data. Unlike knowledge injection methods that incorporate triplet- 3.1. Homogenizing Heterogeneous Urban based structured knowledge [13], we work purely with free-text verbalizations generated from multimodal data. Data for RAG The retriever retrieves relevant perspective-conditioned Unlike conventional RAG systems that are designed to passages, and the generator uses them to compose con- interface with a variety of knowledge representations— textually rich, stakeholder-specific responses. This re- including tables, RDF triples, JSON schemas, and unstrucsults in a system—PeRAG (Perspective-aware RAG)—that tured documents—our approach standardizes heterogeenables context-sensitive generation not just based on neous urban data into a unified format of unstructured topical relevance but on the interpretive stance encoded textual narratives. This design choice fundamentally simin the input text passages. To the best of our knowl- plifies the retrieval mechanism and maximizes compatiedge, PeRAG represents the first instantiation of RAG bility with LLM-based generation models. Rather than tailored for multi-perspective decision support in urban adapting the retriever to handle multiple data represengovernance contexts. tations, we adopt a single retriever pipeline enabled by

Although LLMs such as ChatGPT and GPT-3 [14] have transforming structured data, including tables, geospatial shown great success in general-purpose generation tasks, indicators, and statistical measures, into natural language their application in decision-making processes has been paragraphs. The resulting textual narratives are semantilimited by a lack of specificity and contextual adaptation cally enriched and explicitly crafted to reflect distinct an[15]. Generic outputs are often insuficient in high-stakes alytical perspectives, ensuring that core domain-specific domains like urban planning, where conflicting group patterns are preserved while adapting the framing to needs (e.g., between commuters, the elderly, and environ- match varied stakeholder viewpoints. mentally conscious citizens) must be mediated through The homogenization approach ofers several key adnuanced communication strategies. Eforts like BLOOM vantages for urban policy applications. First, retrieval [16] have underlined the importance of transparent, rep- simplification is achieved through a unified represenresentative training data, particularly for multilingual set- tation that allows for a single dense retriever without tings. However, our implementation is currently focused requiring modality-specific modules, reducing system on English language generation, which remains domi- complexity and computational overhead. Second, our apnant in LLM infrastructure and evaluation. By operating proach enables cross-modal comparability by facilitating entirely in English while incorporating multi-perspective reasoning across diferent data types, such as comparing reasoning, our approach can generalize to multilingual demographics with transportation patterns through unicontexts in future iterations but already demonstrates form verbal representations. Third, LLM compatibility is strong utility in data-rich governance scenarios [17]. naturally reinforced by using natural language as both input and output, aligning with the intrinsic design of 3. Methodology generative models and enabling seamless integration into query-response pipelines. Figure 1 outlines how PeRAG’s components, multi-modal data, verbalization, perspective inclusion, RAG modules, and evaluation, integrate within the pipeline.

Our methodology introduces a novel pipeline that bridges

heterogeneous urban data and perspective-aware natural language generation using a tailored RetrievalAugmented Generation (RAG) architecture. The following subsections detail our approach to homogeniz- 3.2. Dataset Description ing structured inputs, dataset preparation, verbalization strategies, system design, and evaluation.

The dataset comprises 7,019 urban data records cover

ing Turin’s geography, demography, and transportation systems from 2012 to 2019, ofering a comprehensive gitudinal scope allows for trend identification, seasonal longitudinal view of urban dynamics. pattern analysis, and evaluation of policy interventions

The data encompasses 3,850 census areas, which are over time. portions of municipal territory organised in polygons, The dataset was constructed by integrating multiple used by ISTAT1 to divide the city into manageable, sta- sources: all demographic data was obtained from the tistically meaningful areas. Demographic information GeoPiemonte2 portal, while public transport, trafic, and about each census area is collected with respect to size safety data were provided by Gruppo Torinese Trasporti and population distribution. Special attention is given (GTT)3, which manages public transport services includto urban vulnerabilities, housing conditions, migration ing urban, suburban, and extra urban routes, as well as lfows, and demographic changes in specific neighbor- tram and metro lines. hoods. Census areas can vary significantly in both size and demographic characteristics—they can be as small as 3.3. Perspective-Aware Verbalization of a single street or encompass an entire residential block.

For this reason, the census areas difer greatly from one Urban Data another. To enable retrieval over rich, interpretable textual data,

The census area is the smallest territorial unit used for we developed an Urban Data Verbalization System that analysis and is organized into 93 statistical zones. Sta- translates structured urban records into fluent natural tistical zones are aggregations of multiple census tracts language narratives using large language models (LLMs). and represent one of the intra-municipal territorial units This system addresses the fundamental challenge of transinto which the territory of the City of Turin is divided. forming quantitative urban data into qualitative insights In turn, the statistical zones are grouped into 9 districts - that align with diferent stakeholder perspectives and territorial subdivisions over which the local civil author- analytical frameworks. ity exercises its functions. This hierarchy of spatial units provides multiple levels of geographical granularity for 3.3.1. Verbalization analysis, enabling both fine-grained local insights and broader district-level policy evaluation. Additionally, the Our verbalization pipeline employs LLaMA-3.1-8B as the data for each census area is available for two reference default model for processing numerical data and LLaMAyears: 2012 and 2019, allowing for temporal comparisons 3.2-11B-Vision for processing heatmaps. The selection of across various dimensions. The dataset includes 31 struc- these models allows us to maintain compatibility with tured features for each census-year tuple, systematically other LLMs, ensuring both flexibility and reproducibilcategorized into four primary domains. ity. We implement two primary verbalization strategies

Demographic information includes population den- to balance generation quality with computational efisity, gender distribution, age brackets, foreign residents, ciency. Zero-shot verbalization allows the model to genand the number of families, providing a comprehensive erate descriptions without specific examples, providing population profile. Additionally, the density of each de- maximum creative freedom but potentially sacrificing mographic is calculated within a 500-meter bufer from consistency. Few-shot verbalization employs carefully the centroid of each census area. This approach accounts curated single-shot examples that guide narrative style for the spatial distribution of density and makes the areas while preserving creative expression, resulting in more more comparable in terms of population concentration consistent and domain-appropriate outputs. and access to services. The system utilizes handcrafted prompts specifically

Public transport metrics include stop and line density, designed to elicit structured yet non-hallucinatory sumas well as connectivity indicators that measure how well maries for each data record, ensuring factual accuracy each census area is linked to others in terms of acces- while maintaining linguistic diversity. Two distinct sibility and network coverage. Geographical identifiers prompt templates are employed: one for processing nuencompass census codes, dimensions, statistical zones, merical tabular data using LLaMA-3.1-8B (see Table 6), district names, and boundaries that enable spatial analy- and another for processing heatmap visualizations using sis and policy targeting. Trafic and safety data document LLaMA-3.2-11B-Vision (see Table 5). Complete prompt the number of accidents, vehicle involvement patterns, examples for both verbalization modalities are provided and the number of public transport incidents, support- in Appendix C to ensure reproducibility. In both LLaMA ing risk assessment and safety planning initiatives. This configurations, generation control is achieved through collection represents a significant expansion, enabling carefully tuned parameters, including temperature set to richer temporal and spatial analyses that capture urban 0.6 for optimal creativity balance, top-5 sampling at 0.9 evolution patterns and long-term policy impacts. The lon- for response diversity, repetition penalty of 1.2 to ensure

1National Institute of Statistics: https://www.istat.it/ 2https://geoportale.igr.piemonte.it/cms/ 3https://www.gtt.to.it/cms/

coherence, and the maximum token length is set to 512 information accuracy and data factuality, identifying infor the 8B version and 1024 for the 11B-Vision version to stances where ambiguous phrasing might misrepresent support concise yet informative descriptions. the underlying data. Additionally, the multi-perspective

Each structured record is transformed into multiple approach inherently reduces ambiguity by providing exnarrative versions conditioned on distinct stakeholder plicit analytical framing, rather than generating generic perspectives. These include accessibility-oriented plan- descriptions that could be interpreted in multiple ways. ning focusing on mobility and inclusion, safety and equity perspectives highlighting transportation risks 3.4. Perspective-Aware RAG (PeRAG) and distribution fairness, and demographic inclusion addressing the needs of diverse populations. This multi- PeRAG extends the traditional RAG paradigm to hanperspective approach ensures that verbalizations tran- dle structured urban data through its verbalized form, scend generic summaries and address the specific ana- creating a novel architecture specifically designed for lytical needs of diferent urban stakeholders. Table 3, perspective-aware policy support. The system integrates presented in Appendix A, provides an example of this retrieval and generation components that work synertype of verbalization, illustrating both a general narrative gistically to provide contextually relevant and factually and its corresponding multi-perspective version. grounded responses to complex urban planning queries.

3.3.2. Quality Assessment and Validation 3.4.1. Retrieval Module

Unlike conventional LLM-generated general texts, which The retrieval module employs the all-mpnet-base-v2 senoften sufer from loss of specificity, repetitiveness, or tence transformer for dense vector encoding, chosen for context ignorance, our perspective-aware narratives em- its superior performance on semantic similarity tasks and phasize trends, deficiencies, and socio-geographic fac- computational eficiency. Text chunking is implemented tors of particular interest to diverse urban stakeholders. using a token-based approach with a chunk size set to The annotation protocol involved a systematic evalua- 500 tokens and an overlap of 50 tokens to ensure semantion across four key dimensions: (1) contextual relevance tic continuity across chunk boundaries. This strategy whether the verbalization appropriately captures the ur- ensures that semantically related content remains within ban context and stakeholder perspective, (2) information the same retrievable segment, preserving coherence and accuracy alignment between the verbalized content and relevance across retrieval operations. source data, (3) coverage of information aspects com- The retrieval mechanism operates through cosine pleteness of perspective-specific elements in the verbal- similarity-based semantic ranking with configurable topization, and (4) data factuality dealing with absence of k retrieval, defaulting to 5 results to balance comprehallucinations or fabricated information. Three expert hensiveness with computational eficiency. The system annotators, including two postdoctoral researchers and maintains comprehensive provenance metadata for comone NLP researcher, independently evaluated a random plete traceability, enabling users and analysts to verify sample of generated narratives for each dimension. Given the source of retrieved information and ensuring accountthe exploratory nature of this novel task and time con- ability in policy-relevant applications. straints, a focused evaluation was conducted on a carefully selected subset of examples, with annotation dis- 3.4.2. Generation Module putes resolved through collaborative discussion among the research team. Their comprehensive assessment conifrmed the validity, relevance, and framing alignment of perspective-aware verbalizations, providing empirical support for their use in downstream RAG generation tasks.

To mitigate potential ambiguities introduced during the natural language verbalization process, our approach incorporates several safeguards. First, the verbalization prompts explicitly instruct models to use exact numerical values without modification or approximation, preventing quantitative distortions. Second, the prompts restrict models from drawing conclusions, making assumptions, or interpreting data significance, thereby reducing interpretive ambiguity. Third, during the annotation process, evaluators specifically assessed verbalizations for The generation module utilizes Gemma-3-4B-IT as the default model while supporting any causal decoder-based large language model to ensure adaptability across diferent computational environments. The module processes user queries alongside retrieved perspective-aligned narratives using carefully engineered prompts that structure the input format as query plus perspective narratives.

Generation parameters are optimized for policy applications, with a temperature of 0.7 balancing creativity and factuality, and a 512-token limit ensuring brevity without sacrificing informational depth. The system demonstrates robust capability in responding to complex urban planning questions, supporting district-wise comparisons, demographic-transport correlations, safety and infrastructure assessments, and trend identification over temporal dimensions. 3.5. Implementation and System perspective-aware verbalization approaches using our Eficiency Turin dataset. General verbalization employs standard data-to-text generation without specific perspecThe full system is implemented in Python, leveraging tive conditioning, while perspective-aware verbalizaPyTorch and Hugging Face Transformers for deep learn- tion generates targeted descriptions aligned with speing and natural language processing tasks, alongside cific stakeholder viewpoints, including demographicsSentenceTransformers for semantic retrieval capabilities. focused, transportation infrastructure-focused, temporal The implementation includes comprehensive batch pro- analysis, and deficiency assessment perspectives. cessing capabilities with integrated performance moni- A random sample of 200 data records is selected for toring to ensure scalable operation across large datasets. detailed verbalization analysis, ensuring representation GPU acceleration with automatic device detection opti- across diferent districts, time periods, and demographic mizes computational eficiency while maintaining com- profiles. Our multi-modal dataset is processed through patibility across diferent hardware configurations. both zero-shot and few-shot verbalization strategies for

The system architecture incorporates detailed logging each perspective type, generating a comprehensive corfor each transformation step, enabling comprehensive de- pus of verbalized descriptions for comparative evaluation. bugging and performance analysis. Key operational fea- For the verbalization quality assessment, two authors tures include support for batch verbalization, which pro- jointly annotated three representative examples in a cesses multiple records simultaneously; real-time query- structured meeting format, with any disagreements reing capabilities for interactive policy analysis; and modu- solved through immediate discussion. While the limited lar model swapping, allowing for easy adaptation to dif- sample size ( = 3) precluded formal inter-annotator ferent language models or domain-specific requirements. agreement (IAA) calculation using Cohen or Fleiss’ This implementation approach ensures both research Kappa, the collaborative annotation process ensured conreproducibility and practical deployment feasibility for sistency in evaluation criteria application. Future work real-world urban policy applications. The source code for will expand the annotation sample size to enable robust our PeRAG system, along with the various verbalization inter-annotator reliability metrics. configurations, is publicly available at the following link 4

4. Experimentation

4.2. System Performance Evaluation

We develop a comprehensive set of 25 urban policy

Our experimental evaluation is designed to assess the oriented questions that span diferent complexity levefectiveness of perspective-aware verbalization and the els and analytical requirements. The question set inoverall performance of the PeRAG system in supporting cludes factual queries about specific demographic or urban policy decision-making. We conduct experiments transportation metrics, comparative questions requiring across two primary dimensions: verbalization quality cross-district or temporal analysis, analytical questions assessment and end-to-end system performance evalua- demanding trend identification and causal reasoning, tion. All experiments are performed on locally deployed and policy-oriented questions seeking recommendations models to ensure data privacy and reproducibility, using based on data insights.

NVIDIA GPUs for computational acceleration. Questions are categorized by type (factual, compara

The experimental framework evaluates our system tive, analytical, policy-oriented), complexity level (simagainst several key research questions established in the ple, moderate, complex), and required perspective alignintroduction: the efectiveness of perspective-aware ver- ment (demographics, infrastructure, temporal, deficiencybalization compared to general approaches, the compar- focused). This categorization enables a systematic assessative performance of zero-shot versus few-shot verbal- ment of system performance across diferent query types ization strategies, the utility of PeRAG for urban policy and complexity levels. question answering, and the factuality and relevance of System performance is evaluated against multiple system outputs compared to general-purpose large lan- baseline approaches to assess the contribution of our guage models. perspective-aware framework. These baselines involve querying general-purpose LLMs without access to urbanspecific data. For this purpose, we use the Gemini 2.0

4.1. Verbalization Evaluation Protocol Flash and GPT-4o Mini models. Additionally, we evalu

We conduct a systematic comparison between gen- ate RAG systems using general (non-perspective-aware) eral verbalization, i.e., template-based approach, and verbalizations under both zero-shot and few-shot conifgurations. Each baseline is tested using the same set of questions and evaluation criteria to ensure a fair and consistent comparison.

4Code and dataset are available at https://github.com/MasterHoracio/

CLiC-it-HARMONIA.git. 4.3. Evaluation Metrics In order to evaluate the performance of our proposed perspective-aware framework, as well as all the baseline approaches, we employ the Retrieval Augmented Generation Assessment (RAGAS) framework, specifically designed for reference-free evaluation of RAG pipelines [18]. This framework defines three main metrics. The ifrst, faithfulness, measures whether the answer accurately reflects information that can be directly inferred from the given context. The second, answer relevance, evaluates whether the answer directly and appropriately responds to the given question, without being incomplete or redundant. Finally, the third metric, context relevance, assesses how well the context includes only the necessary information to answer the question, avoiding redundancy. For a detailed explanation, we refer the reader to the following paper [18].

5. Results

the responses generated by the PeRAG system efectively leverage information inferred from the provided context.

On the other hand, the lowest score—both for PeRAG and previous configurations—was observed in the context relevance metric. This may be attributed to the diversity of information retrieved by the retriever module, which stems from the chunk partitioning strategy used. In particular, this strategy incorporated independent general and multi-perspective verbalizations for each district, zone, or census area.

Table 1 presents the evaluation results for the diferent configurations considered. The first section of the table (rows 2 and 3) shows the results obtained by directly querying the LLMs without providing any additional context. It is important to note that the faithfulness and context relevance metrics could not be computed in this case, as both require access to the retrieved context. Neverthe- 6. Analysis less, the answer relevance scores reveal low performance for both models. This can be attributed to the fact that To gain deeper insight into the performance of our promost of the responses were of the type “I cannot answer posed PeRAG pipeline, this section presents a quantitathe question due to lack of necessary data”. Specifically, tive and qualitative analysis of the generated responses. GPT-4o responded this way in 21 out of 25 cases, while In particular, we conduct a comparative evaluation of the Gemini 2.0 did so in 18 out of 25. Overall, Gemini demon- answers produced by the RAG system using the diferent strated marginally better performance in this setting. types of verbalizations. For this analysis, we randomly

Additionally, Table 1 also compares the performance sample three questions from our set of 25, focusing on the of general verbalizations using zero-shot and few-shot demographic and transportation perspectives. The selecconfigurations. These results are shown in the second tion of three questions for detailed BERTScore analysis section of the table (rows 4 and 5). As can be observed, the was determined by several practical constraints. First, answer relevance scores are higher than those obtained by generating reference factual answers for comparative the previously evaluated LLMs, which can be attributed evaluation requires extensive manual verification against to the incorporation of relevant information retrieved the original Turin dataset, which is a time-intensive proby the retrieval module. When comparing the general cess involving careful cross-referencing of multiple data verbalization settings, we observe that the few-shot con- sources and temporal dimensions. Second, as this repifguration outperforms the zero-shot setting across all resents an initial exploration of a novel task combinthree evaluation metrics, with an average improvement ing multi-modal verbalization with perspective-aware of 6%. This gain is likely due to the higher quality and RAG, we prioritized depth over breadth in the qualitagreater level of detail present in the verbalizations gener- tive analysis to thoroughly examine the mechanisms unated under the few-shot configuration. derlying performance diferences between general and

Finally, we present the evaluation results of our pro- perspective-aware verbalizations. Third, the computaposed PeRAG system. As shown, it achieves the highest tional overhead of generating responses across all verbalscores across all three evaluation metrics, with an average ization configurations and computing detailed semantic improvement of 20% compared to the best-performing similarity metrics scales considerably with the number general verbalization configuration. Overall, the highest of questions analyzed. The three selected questions were metric score was obtained in faithfulness, indicating that chosen to represent diferent complexity levels and analytical requirements. expand the evaluation to cover the complete 25-question

For each of these questions, we generate a reference set, enabling more comprehensive statistical analysis of factual answer by manually extracting and synthesiz- semantic similarity performance across diferent quesing the relevant information directly from the original tion types, complexity levels, and analytical perspectives. Turin dataset. The reference answer generation process Additionally, we plan to incorporate multiple semantic involves several systematic steps: (1) identifying the similarity metrics beyond BERTScore to provide a more specific data fields and temporal dimensions required comprehensive assessment of response quality and facto answer each question, (2) querying the structured tual alignment. dataset to retrieve exact numerical values for the relevant census areas, statistical zones, or districts, (3) per- Table 2 forming necessary aggregations or comparisons across Evaluation results based on BERTScore. The columns report the 2012-2019 timeframe where temporal analysis is re- the macro-average recall, precision, and F1 score across the quired, and (4) formulating a concise factual response three randomly selected questions. The prefixes ZS and FS that accurately reflects the quantitative findings without indicate the zero-shot and few-shot configurations of the geninterpretive bias. For instance, for questions involving eral verbalization. demographic trends, reference answers include precise Approach Recall Precision F1 population counts, percentage changes, and specific demographic categories afected, all derived directly from ZS-RAG 0.818 0.831 0.821 the census data. This manual reference generation pro- FS-RAG 0.837 0.852 0.846 cess, while labor-intensive, provides ground-truth answers that serve as reliable baselines for evaluating the PeRAG 0.851 0.873 0.862 factual accuracy and completeness of system-generated responses through semantic similarity metrics. We use An important consideration in our verbalization apthe BERTScore metric [19], a widely adopted measure proach is the management of potential linguistic ambiof semantic similarity between a generated text and a guities that could impact downstream RAG performance. reference [20]. Finally, we present a discussion highlight- Our analysis of generated verbalizations reveals that ing the strengths and weaknesses of the PeRAG pipeline perspective-aware conditioning significantly reduces incompared to general verbalizations. terpretive ambiguity compared to general verbalization

Table 2 presents the BERTScore evaluation results for approaches. For instance, when describing transportathe three randomly selected questions. The first section of tion infrastructure, general verbalizations might use amthe table (rows 2 and 3) reports the results for the general biguous terms like ‘adequate coverage’ or ‘reasonable verbalizations, where the few-shot configuration achieves accessibility’, whereas perspective-aware verbalizations the highest scores across all BERTScore metrics. These provide specific contextual framing, such as ‘limited acoutcomes are consistent with the trends observed in the cessibility for elderly residents due to sparse stop density reference-free evaluation metrics. The second section in residential areas’. This specificity not only reduces amof the table shows the results for our PeRAG pipeline, biguity but also enhances retrieval precision, as queries which consistently achieves the best performance across can be matched more accurately to relevant perspectiveall three metrics, further reinforcing the findings obtained conditioned content. However, we acknowledge that through the reference-free evaluation. some residual ambiguity remains inherent to natural lan

We acknowledge that the BERTScore analysis based guage representation, particularly in cases where numerion three questions represents a preliminary assessment cal thresholds are verbalized using qualitative descriptors of semantic similarity performance, and the limited sam- (e.g., ‘high density’ vs. specific population counts). Fuple size constrains the statistical generalizability of these ture work will explore hybrid approaches that preserve ifndings. The selection was necessitated by the substan- exact numerical values alongside natural language detial manual efort required for reference answer genera- scriptions to further minimize interpretive ambiguity. tion and verification against the multi-dimensional Turin To compare the outputs generated by our diferent condataset. Each reference answer requires careful extrac- figurations, Table 4 (included in Appendix B) presents tion and synthesis of information across multiple data a comparison between the response produced by our ifelds, temporal dimensions, and geographical units, fol- PeRAG pipeline and the one generated using the few-shot lowed by independent verification by domain experts. configuration of the general verbalization. This configWhile these three questions provide initial evidence of uration was selected due to its strong performance in PeRAG’s superior semantic alignment with ground truth both the reference-free metrics and the BERTScore. Addata, we recognize that broader systematic analysis is ditionally, both responses are contrasted with a reference essential for robust conclusions. Future work will im- answer constructed from factual information. The quesplement automated reference generation procedures and tion used in this analysis was selected from the set of three randomly chosen questions. comparative analysis reveals that few-shot verbalization

As shown in Table 4, the selected question involves strategies provide superior generation fidelity and pera temporal comparison of demographic characteristics spective alignment compared to zero-shot approaches, from 2012 to 2019. According to the reference answer, a despite increased computational overhead (RQ2). PeRAG, population decrease is observed across most demographic our lightweight locally-deployable RAG pipeline, efecgroups, including males, females, minors, foreigners, and tively answers urban policy questions by leveraging these working-age citizens. In contrast, the only group that multimodal verbalizations as retrievable memory, ensurexperienced population growth during this period was ing data privacy while maintaining system responsivesenior citizens. ness (RQ3). Human evaluation confirms that PeRAG ex

When comparing these findings to the output gen- hibits superior factuality and utility compared to generalerated by the PeRAG pipeline, we observe that it suc- purpose LLMs in high-stakes policy scenarios, with cessfully identified the overall downward trend across domain-specific grounding providing enhanced accumultiple demographic groups, highlighting that the re- racy and contextual relevance (RQ4). The framework duction was not evenly distributed. This aligns with the establishes a reproducible methodology for transforming factual data presented in the reference answer. Moreover, complex urban datasets into actionable policy insights, PeRAG accurately captured the groups that experienced demonstrating that specialized, domain-grounded AI sysdecline—such as the working-age population, minors, tems outperform general-purpose alternatives in critical and foreigners—and correctly identified an increase in decision-making contexts. the senior population, consistent with the reference.

However, the PeRAG response emphasized the Limitations The various perspectives explored in this reworking-age population as the most afected category, search, such as demographic, population, transportation, whereas the reference answer pointed to foreigners. This gender, and age, were derived from the dataset used in our discrepancy may be attributed to the nature of the multi- evaluation. However, these perspectives do not incorpoperspective verbalizations, which were generated at the rate public opinion. As ongoing work, we are expanding level of census areas, statistical zones, and districts. Con- these perspectives through a research survey aimed at sequently, when retrieving information using the re- integrating viewpoints that reflect public opinion of cititriever module (configured with = 5), it may not have zens and stakeholders of Turin. The annotation protocol, captured a fully comprehensive view across all nine dis- while systematic, was applied to a limited sample size tricts. This limitation has been corroborated by analyz- due to the exploratory nature of this novel task. The ing the retrieved chunks, where recalculating the values collaborative annotation approach, though ensuring conbased on the retrieved verbalizations indeed showed that sistency, does not provide quantitative measures of IAA. the working-age group experienced the largest decline. Future iterations of this work will implement larger-scale

Finally, Table 4 also includes the output of the gen- annotation studies with multiple independent annotators eral verbalization under the few-shot configuration. As and IAA metrics to strengthen the evaluation framework. shown, the response generated by the RAG system fails Additionally, we are working at enriching the evaluation to clearly identify the downward trends across the difer- framework. We plan to complement the reference-free ent demographic groups as well as the upward trend for evaluation metrics applied [21] by incorporating taskseniors. These results are consistent with those observed based evaluation protocols and comprehensive human in the reference-free evaluation metrics. Moreover, al- evaluation strategies to better assess the practical utility though the response is factually correct, it does not ad- of perspective-aware verbalizations in real-world urban dress the perspective implied by the question, highlight- planning contexts. ing the importance of incorporating perspective-aware verbalizations. Similar to the PeRAG pipeline, the retrieved chunks in this configuration also exhibit limi- Acknowledgments tations, indicating a potential area for improvement in future work.

The research is conducted at the Department of Com

puter Science, University of Turin, Italy, and is funded by the “HARMONIA” project - M4-C2, I1.3 Partenariati Estesi - Cascade Call - FAIR - CUP C63C22000770006 PE PE0000013 funded under the NextGenerationEU programme (PI: Viviana Patti).

7. Conclusion This research demonstrates that multimodal urban data

can be efectively verbalized through perspective-aware approaches to support policy-level interpretation, with our framework successfully processing over 7,000 examples across multiple analytical perspectives (RQ1). The J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, GAs: Automated evaluation of retrieval augmented G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, generation, in: N. Aletras, O. De Clercq (Eds.), G. Krueger, T. Henighan, R. Child, A. Ramesh, Proceedings of the 18th Conference of the EuroD. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, pean Chapter of the Association for Computational E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, Linguistics: System Demonstrations, Association C. Berner, S. McCandlish, A. Radford, I. Sutskever, for Computational Linguistics, St. Julians, Malta, D. Amodei, Language models are few-shot learners, 2024, pp. 150–158. URL: https://aclanthology.org/ in: Proceedings of the 34th International Confer- 2024.eacl-demo.16/. ence on Neural Information Processing Systems, [19] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, NIPS ’20, Curran Associates Inc., Red Hook, NY, Y. Artzi, Bertscore: Evaluating text generation USA, 2020, pp. 1–25. with BERT, in: 8th International Conference on [15] W. Liu, X. Wang, M. Wu, T. Li, C. Lv, Z. Ling, Z. Jian- Learning Representations, ICLR 2020, Addis Ababa, Hao, C. Zhang, X. Zheng, X. Huang, Aligning Ethiopia, April 26-30, 2020, OpenReview.net, 2020, large language models with human preferences pp. 1–41. URL: https://openreview.net/forum?id= through representation engineering, in: L.-W. SkeHuCVFDr.

Ku, A. Martins, V. Srikumar (Eds.), Proceedings [20] M. Hanna, O. Bojar, A fine-grained analyof the 62nd Annual Meeting of the Association sis of BERTScore, in: L. Barrault, O. Bojar, for Computational Linguistics (Volume 1: Long F. Bougares, R. Chatterjee, M. R. Costa-jussa, C. FedPapers), Association for Computational Linguis- ermann, M. Fishel, A. Fraser, M. Freitag, Y. Graham, tics, Bangkok, Thailand, 2024, pp. 10619–10638. R. Grundkiewicz, P. Guzman, B. Haddow, M. Huck, URL: https://aclanthology.org/2024.acl-long.572/. A. J. Yepes, P. Koehn, T. Kocmi, A. Martins, M. Mordoi:10.18653/v1/2024.acl-long.572. ishita, C. Monz (Eds.), Proceedings of the Sixth Con[16] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilic, ference on Machine Translation, Association for D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, Computational Linguistics, Online, 2021, pp. 507– M. Gallé, J. Tow, A. M. Rush, S. Biderman, A. Web- 517. URL: https://aclanthology.org/2021.wmt-1.59/. son, P. S. Ammanamanchi, T. Wang, B. Sagot, [21] D. Deutsch, R. Dror, D. Roth, On the limitations N. Muennighof, A. V. del Moral, O. Ruwase, R. Baw- of reference-free evaluations of generated text, in: den, S. Bekman, A. McMillan-Major, I. Beltagy, Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), ProceedH. Nguyen, L. Saulnier, S. Tan, P. O. Suarez, ings of the 2022 Conference on Empirical MethV. Sanh, H. Laurençon, Y. Jernite, J. Launay, ods in Natural Language Processing, Association M. Mitchell, C. Rafel, A. Gokaslan, A. Simhi, for Computational Linguistics, Abu Dhabi, United A. Soroa, A. F. Aji, A. Alfassy, A. Rogers, A. K. Arab Emirates, 2022, pp. 10960–10977. URL: https: Nitzav, C. Xu, C. Mou, C. Emezue, C. Klamm, //aclanthology.org/2022.emnlp-main.753/. doi:10. C. Leong, D. van Strien, D. I. Adelani, et al., BLOOM: 18653/v1/2022.emnlp-main.753. A 176b-parameter open-access multilingual language model, CoRR abs/2211.05100 (2022). URL: https://doi.org/10.48550/arXiv.2211.05100. doi:10.

48550/ARXIV.2211.05100. [17] R. Bommasani, D. A. Hudson, E. Adeli, R. B.

Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. S. Chatterji, A. S. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. FeiFei, C. Finn, T. Gale, L. E. Gillespie, K. Goel, N. D.

Goodman, S. Grossman, N. Guha, T. Hashimoto, P. Henderson, J. Hewitt, D. E. Ho, J. Hong, K. Hsu, J. Huang, T. Icard, S. Jain, D. Jurafsky, P. Kalluri, S. Karamcheti, G. Keeling, F. Khani, O. Khattab, P. W. Koh, M. S. Krass, R. Krishna, R. Kuditipudi, et al., On the opportunities and risks of foundation models, CoRR abs/2108.07258 (2021). URL: https: //arxiv.org/abs/2108.07258. arXiv:2108.07258. [18] S. Es, J. James, L. Espinosa Anke, S. Schockaert, RAIn 2018, census section 2535 in statistical zone 75 covered an area of 145,065.42 units, with a population of 192. The population percentage was approximately 0.0027%, of which 54.7% were female. Immigrants comprised 21.9% of the population, with 13.5% being female immigrants.

Minors accounted for 13.0%, seniors 22.4%. There were 5 stops and 5 lines stopping, with stops representing 2.6% coverage, 100% stop-to-line ratio, and 0.017% length coverage.

In 2018, census section 2535 in statistical zone 75 had a population of 192 individuals. Females accounted for 54.7% of the population. Immigrants represented 21.9% of residents, with female immigrants making up 13.5%. Children under 18 years old comprised 13.5% of the population, while seniors accounted for 22.4%.

The total population of census section 2535 was 192, representing approximately 0.0027% of the overall population. The area covered by this section was 145,065.42 units, indicating a low population density given the large area size.

Census section 2535 covers a substantial area of 145,065.42 units within statistical zone 75. Despite its size, the population remains small at 192 individuals.

Within census section 2535, there are 5 stops served by 5 lines. Stops cover 2.6% of the area, with an equal number of stops per line. The total length covered by stops corresponds to 0.017% of the area.

Immigrants make up 21.9% of the population in census section 2535, with female immigrants comprising 13.5%. This suggests a significant immigrant presence within the local population.

Age Group Perspective

Minors account for 13.5% of the population in census section 2535, while seniors make up 22.4%, indicating a population with a notable proportion of older adults.

Gender tion

Distribu

Females represent 54.7% of the population in census section 2535, showing a slightly higher female population compared to males.

A. Verbalization Examples B. Example of Output Response C. Verbalization Prompt Examples To ensure reproducibility and transparency of our verbalization approach, we provide the complete prompt templates used for both numerical data (see Table 6) and heatmap verbalization (see Table 5).

C.1. Heatmap Verbalization Prompt (LLaMA-3.2-11B-Vision)

The following prompt template is used for generating narratives from population density heatmaps and other visual data representations:

C.2. Numerical Data Verbalization

Prompt (LLaMA-3.1-8B)

The following prompt template is used for generating narratives from structured numerical urban data:

From 2012 to 2019 the population in Turin has decreased. Considering all the census, does this decrease afect all the demographics? Which category is more afected and which one is less afected? Between 2012 and 2019, the population of Turin declined by 4.32%, and this trend afected most demographic groups. Both male and female populations decreased at nearly the same rate (4.34% and 4.31% respectively), indicating no significant gender disparity. The number of minors saw a notable decline of 5.96%, while the foreign (legally registered) population experienced the most significant drop, decreasing by 6.48%, together with the number of working-age citizens (6.28%). In contrast, the senior population was the only group to increase, growing by 2.05% Analyzing the demographic shifts across Turin’s statistical zones between 2012 and 2019, it is evident that the overall population declined during this period. However, the decrease was not evenly distributed across all demographic categories. The working-age population (19–64) experienced the most significant contraction, suggesting a notable loss of economic contributors. This trend was consistently observed across various districts. In contrast, the number of seniors slightly increased, indicating a growing aging population. The number of minors also declined, though to a lesser extent than the working-age group. Meanwhile, the foreign-born population registered a visible decline as well, potentially due to migration dynamics. In summary, while most demographic segments were negatively afected, the working-age group was the most impacted, whereas the senior population was the least afected, showing a modest increase.

Analyzing the provided census data reveals a complex picture regarding demographic shifts between 2012 and 2019 in Turin’s districts. While a precise overall population decrease for the entire city isn’t evident from this dataset, several districts demonstrate notable changes. District 5 (specifically areas like Basse di Stura and Le Vallette) shows a consistent population size—around 41 residents in 2019, mirroring 2012—with minimal fluctuation within the 500-meter radius. Conversely, District 1 exhibits a clear increase in population size between the two years. In 2012, Census Area 799 (Politecnico) held 52 residents, expanding to 1,478 within a 500-meter radius. Similarly, Census Area 3209 (Politecnico) grew from 212 residents to 2,784. Looking at specific categories, the *minor* population appears significantly impacted by this growth, increasing dramatically in District 1 (from 32 to 423) compared to a smaller change in District 5. Seniors also show increased numbers in District 1 (47 to 652) while remaining largely stable in District 5. The foreign national population experienced a relative increase in District 1 (2 to 104) compared to a slight increase in District 5 (13 to 20). Therefore, while certain districts remained static, the overall trend indicates that the *minor* population is experiencing the most pronounced demographic shift, particularly within