<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimodal sentiment analysis with Grad-CAM for urban revitalization⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Serhii Dolhopolov</string-name>
          <email>dolhopolov@icloud.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tetyana Honcharenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kyiv National University of Construction and Architecture</institution>
          ,
          <addr-line>31, Air Force Avenue, Kyiv, 03037</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>The application of Artificial Intelligence (AI) in urban planning offers unprecedented opportunities for assessing public sentiment towards revitalization projects by analyzing the vast stream of multimodal data generated by citizens. However, the increasing complexity of these models, particularly those that fuse text, image, and geospatial data, often results in “black box” systems. This lack of transparency hinders their adoption by non-expert stakeholders like urban planners, who require trustworthy and interpretable insights to make informed decisions. To address this critical gap, this paper introduces a novel framework centered on an XAI Orchestrator. This system couples a high-performance multimodal sentiment analysis model with a suite of modality-specific explainability techniques, namely Gradient-weighted Class Activation Mapping (Grad-CAM) for visual saliency and SHapley Additive exPlanations (SHAP) for textual attribution. These disjointed technical outputs are then synthesized by a Large Language Model (LLM) composer into a cohesive, human-readable narrative, complete with a visual attribution map. Our quantitative evaluation demonstrates that the underlying predictive model achieves high classification accuracy, with a weighted F1-score of 0.9, validating the efficacy of the multimodal fusion approach. Qualitative case studies further reveal the framework's ability to generate clear and intuitive explanations, successfully deconstructing the model's predictions and grounding them in specific visual and textual evidence. By bridging the gap between prediction and interpretation, this work presents a viable methodology for deploying trustworthy AI systems in civic tech, fostering more transparent, accountable, and human-centric urban planning.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Explainable AI (XAI)</kwd>
        <kwd>Multimodal Sentiment Analysis</kwd>
        <kwd>Urban Planning</kwd>
        <kwd>Grad-CAM</kwd>
        <kwd>SHAP1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The revitalization of urban territories stands as a paramount challenge in contemporary global
development, profoundly shaping the socio-economic vitality, environmental sustainability, and
overall quality of life for an increasingly urbanized global population. For decades, the evaluation of
such large-scale initiatives was predominantly governed by a techno-economic rationale, where
success was quantified through tangible, lagging indicators: metrics of economic growth,
infrastructure completion timelines, and return on investment. While valuable, this paradigm often
neglected the most critical component of any urban ecosystem: its inhabitants. A modern,
humancentric approach to city planning has since emerged, championing the principle that the true efficacy
of urban transformation is not merely reflected in steel and concrete but is best measured through the
public’s perception, sentiment, and lived experience. In this evolving landscape of participatory
governance and smart city development, the ability to accurately, dynamically, and ethically assess
citizen sentiment is no longer a supplementary objective but a core requirement. Transforming the
subjective, often unstructured feedback of a diverse populace into actionable, data-driven intelligence
is fundamental to fostering urban environments that are not only functional and efficient but also
inclusive, equitable, and responsive to the genuine needs of their communities.</p>
      <p>The traditional mechanisms for gauging public opinion, such as formally structured municipal
surveys, town hall meetings, and public hearings, have long served as the bedrock of civic
engagement. These methods, however, are increasingly misaligned with the pace and complexity of
modern urban dynamics. They are inherently characterized by significant operational latency, often
taking months to plan, execute, and analyze, by which time public sentiment may have already
shifted. Furthermore, their high operational costs and reliance on limited sample sizes introduce
significant potential for demographic, selection, and response biases, frequently failing to capture the
nuanced, granular, and real-time pulse of community sentiment. Consequently, decisions based on
this data, while well-intentioned, may not fully represent the multifaceted and often conflicting views
present within a city.</p>
      <p>
        The digital transformation of civic life has precipitated a paradigm shift, unlocking an alternative
and profoundly richer source of public insight: a vast, unstructured, and continuous multimodal
stream of data organically generated by citizens. This digital torrent comprises geo-tagged social
media posts, intricate textual comments, and a formidable volume of digital imagery and video. This
wealth of information presents an unprecedented opportunity to monitor public opinion with a level
of granularity, scope, and immediacy previously unattainable. The imperative to transform this raw
digital exhaust into structured, actionable intelligence is a challenge that extends far beyond urban
planning. Parallel research efforts in critical domains such as emergency and disaster management
have pioneered methods for enhancing situational awareness by contextually enriching dense
mobility data from mobile phones with the sparse, yet semantically rich, content from geo-social
media platforms [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Such work demonstrates a growing recognition that fusing disparate data
sources is key to understanding complex human-environment interactions, whether in the context of
a natural disaster [
        <xref ref-type="bibr" rid="ref2 ref3">2–3</xref>
        ] or the daily life of a city.
      </p>
      <p>
        However, the effective utilization of this data demands the development of sophisticated
analytical techniques capable of navigating its inherent complexities. The data is not only massive in
volume but is also fundamentally multimodal, with each modality presenting unique challenges and
opportunities. Textual data from social media is often short, informal, and rife with colloquialisms,
sarcasm, and sentiment-laden emojis that defy traditional natural language processing. Visual data,
in the form of images and video, provides powerful, direct evidence of the urban condition – a
photograph of urban decay, a vibrant community event, or a dilapidated playground can convey a
more potent and immediate sentiment than a textual description alone. The third critical modality,
geospatial information, grounds this digital feedback in the physical world. Yet, as research by
Honcharenko et al. highlights, the formal representation of spatial and topological relations through
multidimensional data models remains a non-trivial technical challenge [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Successfully interpreting
this data requires more than just technical proficiency; it necessitates an approach that can augment
the inherent spatial awareness and cognitive decision-making capabilities of human stakeholders, a
complex process explored by Riabchun et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        In response to this data-rich environment, the field of Multimodal Sentiment Analysis (MSA) has
advanced significantly. Initial research efforts, which were predominantly focused on unimodal,
textcentric analyses, have given way to more sophisticated architectures designed to integrate and
interpret heterogeneous data streams. Foundational work in this area has explored powerful fusion
strategies, moving from early-stage feature concatenation to more advanced techniques. These
include the development of dedicated representation learning frameworks like MISA, which learns
both modality-invariant and modality-specific features [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and complex fusion mechanisms like the
Tensor Fusion Network, which explicitly models the intricate inter-modality dynamics between
language, visual, and acoustic signals [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Further research has advanced this frontier by exploring
multi-task learning frameworks to disentangle these complex signal characteristics [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and, more
recently, by leveraging the immense capabilities of Large Language Models (LLMs) and Large
VisionLanguage Models (LVLMs) to handle text-centric multimodal tasks [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and incorporate external,
contextual world knowledge to resolve ambiguities in visual scenes [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>Despite these remarkable technical achievements, a persistent and critical challenge shadows the
field: the inherent opacity of the models themselves. Even as these systems achieve state-of-the-art
performance, they often function as inscrutable “black boxes.” This issue is further compounded by
systemic vulnerabilities within the data itself; research by Yang et al. has shown that even the most
robust models can be misled by spurious correlations stemming from latent label and context biases
within datasets, a problem they propose to mitigate using a causal inference framework [11]. The
“black box” nature of these models is not a mere academic curiosity; it represents a fundamental
barrier to their real-world adoption. The urgency of this problem is magnified by the rapid and
widespread proliferation of complex AI into disparate, high-stakes domains, with critical applications
ranging from automated software modernization [12] and the behavioral modeling of autonomous
robotic systems [13] to the multi-stage classification of construction site elements [14] and the
modeling of complex environmental systems using ensemble machine learning [15]. In all such
applications, predictions are delivered without transparent justification, severely limiting their utility
and struggling to earn the trust of the domain experts and non-expert stakeholders who are
ultimately responsible for making decisions.</p>
      <p>This fundamental trade-off between predictive accuracy and model transparency has catalyzed
the growth of a dedicated and vital research discipline: Explainable Artificial Intelligence (XAI),
which seeks to develop theories and methods for interpreting the decisions of complex models, as
comprehensively surveyed by Linardatos et al. [16]. The resulting “explainability gap” is arguably the
single most significant barrier to the responsible and effective deployment of AI in civic planning and
governance. This sentiment is echoed in technical research focused on specific explainability
techniques; for instance, Sharma and Kumar emphasize that methods like Grad-CAM are essential for
visualizing model activations and building a foundational understanding of model behavior in
classification tasks [17]. Furthermore, the work of Dhore et al. demonstrates that individual XAI
techniques often have their own limitations, such as noisy outputs from Grad-CAM, and that hybrid
approaches combining the strengths of multiple explainers can yield clearer and more reliable
insights [18]. Ultimately, if an urban planner, a policy-maker, or a community representative cannot
understand why an AI model has flagged a specific neighborhood as exhibiting a spike in negative
sentiment, the prediction, no matter how accurate, remains an unactionable and untrustworthy piece
of data.</p>
      <p>To address the critical challenges of model opacity and stakeholder trust, this paper introduces a
novel, integrated system for multimodal sentiment analysis explicitly designed for the domain of
urban revitalization. Our approach moves beyond the singular pursuit of predictive accuracy and
instead prioritizes the generation of transparent, human-intelligible, and actionable explanations.
The proposed framework, termed the “XAI Orchestrator,” is architected to ingest, process, and
interpret data from three distinct yet complementary modalities: the textual content of social media
posts, the visual information contained in associated images, and the critical context provided by
geospatial data. The inclusion of the geospatial modality is particularly crucial, as it grounds abstract
digital feedback in the tangible reality of the physical urban environment. However, as noted, this
integration is not without its own ethical complexities. The demonstrated ability of modern Large
Multimodal Models to infer precise geographic locations from user-posted imagery with startling
accuracy raises significant geo-privacy concerns that must be proactively addressed [19]. Our
framework’s emphasis on transparency is therefore not merely a technical feature but a core tenet for
the ethical and trustworthy deployment of such technologies in a civic context.</p>
      <p>The central innovation of our work is the concept of an “orchestrator” that intelligently
synergizes the outputs of multiple, state-of-the-art explainability techniques rather than relying on a
single method. By doing so, we aim to construct a more holistic and robust explanatory narrative that
is greater than the sum of its parts. For the visual modality, our system employs Gradient-weighted
Class Activation Mapping (Grad-CAM), a canonical technique for producing visual heatmaps that
highlight the specific regions within an image that were most salient to the model’s classification
decision. For the textual modality, we utilize SHapley Additive exPlanations (SHAP), a
gametheoretic approach that attributes a prediction to individual words or tokens, quantifying the precise
contribution of each to the final sentiment score.</p>
      <p>
        These specific techniques are prime examples of gradient-based feature attribution methods, a
class of explainers that has become central to the scientific endeavor of interpreting neural networks,
as detailed in the technical review by Wang et al. [20]. Our framework deliberately employs these
methods within their established classification context. However, we acknowledge that the
applicability and boundaries of these tools are an active and evolving area of research. For example,
recent work by Bachhawat has explored the novel challenge of generalizing Grad-CAM to
embedding networks, which do not produce the discrete class scores on which the method
traditionally relies [21]. This ongoing innovation underscores the dynamism of the XAI field. Our
approach is thus conceptually similar to integrated spatio-temporal topic-sentiment models proposed
in related GeoAI domains, which seek to overcome the limitations of sequential analysis by creating a
unified output [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The ultimate goal of our XAI Orchestrator is to translate the complex,
highdimensional outputs of both the predictive model and its explainers into clear, composite insights
that augment the inherent spatial awareness and cognitive decision-making capabilities of human
stakeholders [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>The overarching goal of this research is to bridge the critical gap between the high predictive
power of modern multimodal AI systems and the pressing need for their transparent and trustworthy
application in urban planning. We contend that for AI to become a truly collaborative tool in civic
governance, its outputs must be scrutable, its reasoning legible, and its potential biases identifiable by
the human experts who bear the ultimate responsibility for urban outcomes. Our work is motivated
by the hypothesis that a multi-explainer, multimodal framework can provide a more comprehensive,
reliable, and actionable form of insight than any unimodal or single-explainer system alone.</p>
      <p>To that end, the primary contributions of this paper are threefold and are designed to address
specific gaps in the current body of research:


</p>
      <p>We first propose and implement a robust deep learning architecture specifically designed to
fuse textual, visual, and geographic features for superior sentiment classification in the urban
domain. This model serves as the high-performance predictive engine for which our
explainability framework provides interpretive oversight.</p>
      <p>Our core contribution is the introduction of the XAI Orchestrator, a system that synergizes
multiple, modality-specific explainability techniques (Grad-CAM for visual saliency, SHAP
for textual attribution) and integrates them with a Large Language Model (LLM)-based
composer. This orchestrator is designed to generate cohesive, multimodal explanations that
present a unified narrative explaining a given prediction.</p>
      <p>We demonstrate the efficacy of the complete system through a series of experiments on a
simulated but realistic dataset reflecting the challenges of urban sentiment analysis. The
validation moves beyond standard accuracy metrics to assess the quality and utility of the
generated explanations themselves, providing tangible examples of visual heatmaps, textual
attributions, and composed natural language summaries that justify the model’s conclusions.</p>
      <p>Through this work, we aim to present a viable and extensible methodology for building and
deploying explainable AI systems in the complex, high-stakes environment of urban revitalization.
By prioritizing interpretability, we seek to foster greater trust, facilitate more effective human-AI
collaboration, and ultimately contribute to the development of more equitable and responsive cities.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>The efficacy of any data-driven system for urban analysis is fundamentally predicated on the quality,
diversity, and integrity of its underlying data. To this end, our proposed framework is built upon a
robust and scalable data acquisition and preprocessing pipeline, engineered to systematically ingest,
cleanse, and structure heterogeneous data streams into an analysis-ready format. This foundational
layer, depicted in the preprocessing pipeline diagram (Figure 1), ensures that the subsequent
sentiment analysis and explainability models are supplied with data that is not only rich in contextual
information but also compliant with stringent ethical and privacy standards.
2.1.</p>
      <sec id="sec-2-1">
        <title>Data Acquisition and Preprocessing Pipeline</title>
        <p>The pipeline is designed to address the inherent challenges of multimodal urban data, which
originates from a variety of disparate and often unstructured sources, as depicted in Figure 1. The
initial stage of our methodology focuses on the acquisition of this diverse data. We identify four
primary sources of information. First, official municipal data provides a structured, authoritative
baseline, including administrative boundaries, infrastructure logs, and demographic statistics.
Second, public surveys offer a source of explicitly solicited citizen feedback, providing deep
qualitative insights, albeit on a limited scale. The third and most voluminous source is geotagged
social media posts (Geo-posts), an unsolicited stream of real-time public expression from platforms
such as X, Instagram, Telegram, and Facebook. Finally, direct image and video uploads from citizens
or dedicated inspection platforms offer high-fidelity visual evidence of the urban environment.</p>
        <p>Given the disparate formats and velocities of these data streams, a centralized Ingestion Bus
serves as the unified entry point into our system. This layer, implemented using a distributed event
streaming platform such as Apache Kafka, is responsible for collecting and queuing incoming data
from various APIs, scrapers, and direct uploads. This architecture decouples the data sources from
the processing logic, ensuring fault tolerance and scalability. Upon ingestion, all raw, unmodified
data is persisted in a Raw Data Lake. This repository acts as an immutable, chronological source of
truth, allowing for data lineage tracking and reprocessing should the downstream analytical
requirements evolve.</p>
        <p>Once ingested, the raw data is channeled into a multi-stage parallel processing pipeline where
each modality undergoes specialized preparation. For textual content, a Text NLP module performs a
series of linguistic normalization procedures. This includes cleaning raw text by removing HTML
tags and special characters, tokenization, stop-word removal, and lemmatization to reduce
inflectional variations. The primary objective of this stage is to convert unstructured textual data into
a structured format suitable for ingestion by advanced language models.</p>
        <p>In parallel, the Image Analysis module handles the initial processing of visual data. This stage
involves standard computer vision preprocessing steps, including image resizing to a uniform
resolution required by our neural network encoders, pixel value normalization, and data
augmentation where necessary to improve model robustness. This ensures that all visual inputs are
standardized before feature extraction.</p>
        <p>A critical component of the pipeline is the Geo Normalization module. Geospatial information
arrives in various forms, including precise latitude-longitude coordinates, place names (e.g., “City
Hall”), or the bounding boxes of administrative districts. This module is responsible for resolving
these different formats into a unified, standardized coordinate system (e.g., WGS 84). It performs
geocoding for place names and calculates centroids for polygons, ensuring that every data point has a
consistent and machine-readable spatial representation.</p>
        <p>Crucially, an ethical and privacy-preserving layer for PII Removal is integrated into the pipeline.
This dedicated anonymization module automatically identifies and redacts Personally Identifiable
Information from all data modalities. It employs a combination of named entity recognition (NER) for
text and object detection or blurring techniques for images to remove sensitive information such as
names, faces, addresses, and license plates, ensuring compliance with data protection regulations like
GDPR.</p>
        <p>The culmination of this pipeline is the structured storage of the processed and anonymized data
into two distinct, optimized databases. All structured metadata – including the cleaned text,
normalized geospatial coordinates, timestamps, and pointers to the processed image files – is stored
in a Metadata DB. This database is implemented using PostgreSQL with the PostGIS extension,
enabling efficient and powerful geospatial querying. Concurrently, the high-dimensional vector
representations (embeddings) generated by the downstream deep learning models are stored in a
dedicated Embeddings DB, optimized for high-speed similarity search. This dual-database
architecture separates structured metadata from high-dimensional feature vectors, ensuring optimal
performance for both relational queries and machine learning operations.</p>
      </sec>
      <sec id="sec-2-2">
        <title>The Multimodal Sentiment Analysis Model</title>
        <p>Following the initial data preprocessing, the core of our framework’s predictive capability resides in
the Multimodal Sentiment Analysis Model. This sophisticated engine is architected to transform the
prepared, unimodal data streams into high-dimensional, semantically rich representations, fuse them
into a unified vector, and ultimately render a sentiment classification. The complete architecture of
this model, from its modality-specific encoders to the final classification head, is illustrated in Figure
2. The model is logically partitioned into two primary stages: a set of parallel Multimodal Encoders
responsible for feature extraction and a sequential Feature Fusion Architecture that integrates these
features for prediction.
The initial task of the model is to independently process each of the three data modalities – text,
image, and geospatial data – by projecting them from their raw, heterogeneous formats into a shared,
high-dimensional latent space. This process, known as embedding, is performed by three specialized
neural network encoders, each tailored to the unique characteristics of its corresponding data type.</p>
        <p>The Text Encoder is responsible for converting the preprocessed textual data into dense vector
representations. For this task, we employ a Transformer-based architecture, specifically a pre-trained
model from the BERT (Bidirectional Encoder Representations from Transformers) family. BERT is
exceptionally well-suited for this purpose due to its ability to capture deep, bidirectional contextual
relationships between words in a sentence. By leveraging its pre-training on vast linguistic corpora,
the model can generate a fixed-length embedding for each input text that encapsulates not only its
lexical content but also its nuanced semantic and syntactic structure.</p>
        <p>Concurrently, the Image Encoder processes the visual data. This component is built upon a
stateof-the-art computer vision model, such as a deep Convolutional Neural Network (CNN) like ResNet
or a Vision Transformer (ViT). These architectures are designed to extract a rich hierarchy of visual
features, ranging from low-level patterns like edges and textures to high-level semantic concepts
such as objects, scenes, and their spatial arrangements. The output of the Image Encoder is a dense
feature vector that serves as a quantitative summary of the visual content pertinent to urban
sentiment.</p>
        <p>The most novel of the encoders is the Geo Encoder, which is tasked with interpreting the
normalized geospatial information. To move beyond the limitations of simple coordinate-based
features, we model the urban environment as a complex graph, where locations of interest (e.g.,
specific intersections, public squares, or building sites) are represented as nodes and their spatial
relationships (e.g., adjacency, street network connectivity, administrative containment) are
represented as edges. We employ a Graph Neural Network (GNN) to process this structured data. A
GNN is uniquely capable of learning from the relational topology of the graph, allowing it to generate
a context-aware embedding for each location that reflects its position and significance within the
broader urban fabric. This approach enables the model to capture spatial context that would be lost in
a simple coordinate representation.
2.2.2.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Feature Fusion and Classification</title>
        <p>Following the generation of these unimodal representations, the next critical step is to integrate them
into a single, holistic feature vector that captures the combined sentiment signal. This is the primary
function of the Feature Fusion Layer. This layer receives the fixed-length embedding vectors from the
Text, Image, and Geo Encoders and synergizes them. While simple fusion techniques such as vector
concatenation provide a baseline, our framework employs a more sophisticated cross-attention
mechanism. This allows the representations of different modalities to interact and “attend” to one
another, enabling the model to dynamically weight the importance of each modality in the context of
the others. For instance, the model can learn to amplify the importance of a negative textual
sentiment when the accompanying image visually confirms the presence of urban decay.</p>
        <p>The conceptual goal of this fusion process is to project the combined multimodal data into a
wellstructured latent space where semantically similar data points cluster together. As illustrated in
Figure 3, this process maps individual multimodal inputs – each comprising an image, text, and
geolocation – to a specific point in an abstract vector space. The model is trained such that inputs
corresponding to the same sentiment class (e.g., “Public Complaint,” “Infrastructure Failure,” or
“Community Appreciation”) form dense, separable clusters within this space. This conceptualization
of projecting data into modality-invariant or shared subspaces is a powerful paradigm in multimodal
learning.</p>
        <p>The culmination of the predictive model is the Sentiment Classifier. This final component,
typically a multi-layer perceptron (MLP) with one or more hidden layers, takes the unified, fused
feature vector as its input. Its function is to perform the final classification task by mapping this rich,
multimodal representation to a probability distribution across the predefined sentiment categories
(e.g., Positive, Neutral, Negative). The output of this classifier – the predicted sentiment label and its
associated confidence score – serves as the primary input for the XAI Orchestrator, which is detailed
in the subsequent section.
2.3.</p>
      </sec>
      <sec id="sec-2-4">
        <title>The XAI Orchestrator</title>
        <p>While the Multimodal Sentiment Analysis Model provides the core predictive capability of our
framework, its function as an opaque “black box” presents a significant barrier to its adoption in the
trust-dependent domain of urban planning. To surmount this obstacle, we introduce the most novel
component of our methodology: the XAI Orchestrator. This module, whose architecture is illustrated
in the “Explainability” section of Figure 2, is engineered to deconstruct the model’s complex
decisionmaking process into a set of transparent, human-intelligible components. Its primary function is not
to make predictions, but to explain them, thereby transforming the model from a mere analytical tool
into a collaborative partner for human experts. The orchestrator operates in two distinct stages: first,
it employs a suite of modality-specific explainers to generate attribution scores, and second, it uses a
Large Language Model to synthesize these scores into a cohesive, multimodal narrative.
2.3.1.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Modality-Specific Explainers</title>
        <p>The initial step in generating an explanation is to attribute the model’s final prediction back to the
input features of each modality. To accomplish this, the XAI Orchestrator leverages three distinct,
state-of-the-art explainability techniques, each selected for its suitability to a specific data type.</p>
        <p>For generating visual explanations, we employ Gradient-weighted Class Activation Mapping
(Grad-CAM). This technique is applied directly to the final convolutional layer of the Image Encoder
(e.g., ResNet). Grad-CAM utilizes the gradients of the predicted sentiment class as they flow back into
this layer, effectively weighting the importance of each activation map. The result is a
classdiscriminative localization map, or “heatmap,” which highlights the specific pixels and regions
within the input image that were most influential in the model’s decision. This provides a direct and
intuitive visual answer to the question: “Where in the image was the model looking when it made this
prediction?”</p>
        <p>For textual and geospatial explanations, we utilize methods that can provide feature-level
attribution scores for structured and sequential data. Our framework incorporates both SHapley
Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME). SHAP, a
method grounded in cooperative game theory, calculates the marginal contribution of each feature
(e.g., a specific word in the text or a particular attribute from the Geo Encoder, such as proximity to a
landmark) to the final prediction. It provides a theoretically robust and consistent measure of feature
importance. LIME, a model-agnostic technique, operates by creating a local, interpretable surrogate
model (such as a linear model) in the vicinity of the prediction being explained. It identifies the key
features that are most influential within that local decision boundary. By applying these methods, we
can precisely quantify the impact of individual words (e.g., “dilapidated,” “vibrant”) or geospatial
characteristics on the sentiment score, answering the question: “Which specific textual or spatial
features drove this prediction?”.</p>
      </sec>
      <sec id="sec-2-6">
        <title>LLM-based Explanation Composer</title>
        <p>The raw outputs from these modality-specific explainers – a visual heatmap from Grad-CAM and
numerical importance scores from SHAP/LIME – are powerful but disjointed. To be truly actionable
for a non-technical stakeholder, they must be synthesized into a single, coherent narrative. This is the
role of the LLM-based Explanation Composer.</p>
        <p>This central component of the orchestrator first programmatically converts the structured
outputs of the explainers into a detailed natural language prompt. For instance, the prompt might
encode information such as: “The model predicted ‘Negative’ sentiment with 92% confidence.
GradCAM analysis indicates a strong focus on the lower-left quadrant of the image, corresponding to a
pile of rubble. SHAP analysis of the text identified the words ‘neglected’ and ‘garbage’ as having the
highest negative contributions.”</p>
        <p>This rich, context-laden prompt is then fed into a pre-trained Large Language Model (e.g., Llama 3
or a model from the GPT series). The LLM is tasked with synthesizing this structured information
into a fluid, easy-to-understand narrative. It is specifically instructed to connect the findings from
different modalities, creating a story that explains the prediction. For example, it might generate a
summary like: “The system has identified a strongly negative sentiment for this report. The visual
evidence, which focuses on an area of accumulated waste and debris, corroborates the textual
description, where the words ‘neglected’ and ‘garbage’ were the primary drivers of the negative
classification.”</p>
        <p>The final output of the XAI Orchestrator is a two-part, composite explanation. The first part is a
Visual Attribution Map, which is the original input image overlaid with the Grad-CAM heatmap to
visually ground the explanation. The second part is the textual summary generated by the LLM,
providing the narrative context. Together, these components deliver a holistic, multimodal, and
interpretable justification for the AI’s decision, designed to be directly usable by urban planners and
citizens.
2.4.</p>
      </sec>
      <sec id="sec-2-7">
        <title>System Architecture and Implementation</title>
        <p>To translate our methodological framework into a functional and scalable application, we have
designed a robust, multi-tiered system architecture. This architecture, illustrated in Figure 4,
delineates the end-to-end flow of information from the end-user to the core analytical services and
back. It is engineered to support interactive, real-time analysis for stakeholders such as urban
planners and citizens, ensuring that the insights generated by our models are both accessible and
actionable. The system is comprised of four primary layers: the User Interface (UI), the API Layer, the
Backend Service, and the Results Database.</p>
        <p>The primary point of interaction for end-users is the User Interface (UI). This layer is implemented
as a responsive web application, accessible on both desktop and mobile devices, to cater to the
distinct needs of its target audiences. For Urban Planners, the interface provides a comprehensive
dashboard featuring interactive geospatial visualizations, such as heatmaps of sentiment distribution
overlaid on a city map, alongside temporal charts and detailed drill-down capabilities for individual
data points. For Citizens, a simplified interface allows for the submission of reports and the viewing
of aggregated, anonymized sentiment trends within their local communities. This dual-purpose
design ensures that the system serves as both a professional analytical tool and a platform for civic
engagement.</p>
        <p>Mediating the communication between the user-facing application and the backend processing
engine is the API Layer. This layer is implemented using a high-performance web server, such as
Nginx, and exposes a well-defined REST/GraphQL API. The API serves as a formal contract, handling
all incoming requests from the UI – for example, requests to submit new data, query for historical
sentiment in a specific area, or retrieve the explanation for a particular prediction. Utilizing a
standardized API decouples the frontend from the backend, allowing each to be developed, scaled,
and maintained independently. This architectural choice is critical for ensuring system modularity
and long-term maintainability.</p>
        <p>The core computational logic of the system resides within the Backend Service, which we have
implemented as the XAI Orchestrator Service. This service, developed in Python, encapsulates the
entire analytical workflow detailed in the preceding sections. It houses the pre-trained multimodal
sentiment model and the full suite of explainability modules (Grad-CAM, SHAP, LIME, and the LLM
Composer). Upon receiving a request from the API layer, this service orchestrates the execution of
the necessary models, generates both the sentiment prediction and its corresponding multimodal
explanation, and packages the results for transmission back to the user. This service-based
architecture allows for the complex machine learning workloads to be isolated on dedicated
hardware, including GPUs for model inference, ensuring that the system can handle a high volume of
requests efficiently.</p>
        <p>Finally, the persistence and retrieval of all generated data are managed by the Results Database
(DB). This layer is implemented using a robust relational database management system, such as
PostgreSQL, which is well-suited for storing the structured outputs of our system. The database
schema is designed to store the final sentiment predictions, the textual explanations generated by the
LLM, pointers to the saved Visual Attribution Maps, and all associated metadata. By persisting these
results, the system can serve historical queries rapidly without needing to re-run the computationally
expensive models, ensuring a responsive and efficient user experience. This comprehensive,
multilayered architecture provides the foundation for a scalable, reliable, and user-centric platform for
explainable multimodal urban analysis.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>To empirically validate the performance of our proposed framework, we conducted a series of
rigorous experiments on a comprehensive test dataset. The evaluation is designed to be twofold: first,
a quantitative assessment of the multimodal model’s predictive accuracy and the contribution of its
components; and second, a qualitative analysis of the explanations generated by the XAI
Orchestrator. This section is dedicated to the quantitative findings, which establish the model’s
efficacy and justify its architectural design.
3.1.</p>
      <sec id="sec-3-1">
        <title>Quantitative Performance Evaluation</title>
        <p>The primary objective of this evaluation is to measure the predictive power of the fully integrated
Multimodal Sentiment Analysis Model. We assess its ability to accurately classify citizen feedback
into three distinct sentiment categories: Negative, Neutral, and Positive.
3.1.1.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Classification Accuracy</title>
        <p>To obtain a granular understanding of the model’s classification performance, we first generated a
confusion matrix, which provides a disaggregated view of its predictions against the ground-truth
labels. The resulting matrix, displayed in Figure 5, visualizes the distribution of correct and incorrect
classifications for each sentiment class. The strong diagonal entries – 728, 395, and 218 for Negative,
Neutral, and Positive classes, respectively – indicate a high rate of correct predictions across the
board. The model demonstrates exceptional proficiency in identifying negative sentiment, which is
often the most prevalent and critical category in urban monitoring applications.</p>
        <p>Analysis of the off-diagonal elements reveals the model’s specific confusion patterns. The most
notable misclassification occurs between the Negative and Neutral classes, where 46 Negative
instances were incorrectly predicted as Neutral, and 27 Neutral instances were misclassified as
Negative. This suggests a degree of semantic overlap in the data for these categories, where
expressions of mild dissatisfaction may be difficult to distinguish from neutral statements. The model
exhibits the least confusion when classifying Positive instances, with only 16 such cases being
misidentified.</p>
        <p>For a more rigorous and standardized assessment, we derived key performance metrics from the
confusion matrix, which are summarized in Table 1. The model achieved a high overall accuracy of
89.4%, confirming its general effectiveness. A closer examination of the per-class metrics reveals a
more nuanced performance profile. The model attains an exceptionally high precision of 0.95 for the
Negative class, indicating that when it predicts a sentiment as negative, it is very likely to be correct.
Its recall for this class is 0.89, suggesting it successfully identifies the vast majority of all true negative
instances. Conversely, the Positive class exhibits a very high recall of 0.93, meaning the model is
adept at finding positive feedback, but with a lower precision of 0.76, indicating it sometimes
misattributes positive sentiment to neutral or negative posts. The Neutral class shows a balanced
performance with a precision and recall of 0.88. The weighted average F1-Score, which accounts for
class imbalance, stands at a robust 0.9, providing a single, comprehensive measure of the model’s
high predictive capability.</p>
        <sec id="sec-3-2-1">
          <title>Sentiment Class</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Precision</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>Recall</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>F1-Score</title>
        </sec>
        <sec id="sec-3-2-5">
          <title>Support</title>
          <p>Negative
Neutral
Positive</p>
        </sec>
        <sec id="sec-3-2-6">
          <title>Overall</title>
          <p>Accuracy
Weighted Avg
To validate our central hypothesis that a multimodal approach provides superior performance over
unimodal methods, we conducted a comprehensive ablation study. This study systematically
deconstructs our full model to isolate and measure the contribution of each modality. We evaluated
three distinct model configurations: (1) a Text-Only baseline, (2) a bimodal model combining Text
and Image data, and (3) our full Multimodal model incorporating Text, Image, and Geospatial data.</p>
          <p>The results of this study are presented in Figure 6, which plots the weighted F1-score for each
configuration. The findings provide clear and compelling evidence for the value of data fusion. The
Text-Only model, serving as our baseline, achieved a respectable F1-score of 0.76. The introduction of
the visual modality in the Text + Image configuration yielded a significant performance increase,
elevating the F1-score to 0.83. This substantial improvement underscores the rich, and often essential,
contextual information that images provide, which is entirely absent in a purely text-based analysis.</p>
          <p>The final configuration, our full Multimodal model, which integrates the geospatial data via the
Graph Neural Network encoder, achieved the highest F1-score of 0.9. This final incremental gain
demonstrates that spatial context, while perhaps more subtle than visual or textual cues, provides a
further layer of disambiguation that is critical for resolving complex cases and achieving peak
performance. These results empirically validate our architectural design, proving that each modality
offers a unique and complementary signal, and that their effective fusion is essential for building a
highly accurate and reliable urban sentiment analysis system.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Qualitative Analysis of Multimodal Explanations</title>
        <p>While the quantitative metrics presented in the preceding section establish the high predictive
accuracy of our model, they do not fully capture the primary contribution of this work: the
generation of transparent and trustworthy explanations. To demonstrate the practical utility and
interpretive power of our XAI Orchestrator, this section presents a qualitative analysis of its outputs
through two distinct case studies. These examples, representative of common scenarios in urban
monitoring, are designed to showcase the system’s ability to deconstruct its predictions and present
its reasoning in a human-intelligible format, thereby bridging the gap between a model’s prediction
and a stakeholder’s understanding.
3.2.1.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Case Study 1 – Deconstructing a “Negative” Sentiment Prediction from</title>
      </sec>
      <sec id="sec-3-5">
        <title>Urban Decay Imagery</title>
        <p>The first case study examines a report containing an image of a building that suffered severe damage
during a rocket attack, which is a clear instance of negative urban sentiment. Figure 7 provides a
comprehensive breakdown of the visual explanation generated by our XAI Orchestrator for this
example. Panel 1 shows the original input image, depicting the aftermath of a structural collapse.
Panel 2 displays the raw Grad-CAM heatmap, where the warmer, red-hued regions indicate areas of
high model activation and importance for the final prediction. Panel 3 presents the intuitive overlay,
blending the heatmap with the original image to directly link the model’s focus to specific visual
features. Finally, Panel 4 utilizes the heatmap as a mask to isolate the most salient regions, providing
an unambiguous view of the model’s visual attention.</p>
        <p>The analysis clearly reveals that the model’s prediction was overwhelmingly driven by the large
pile of rubble and debris at the base of the damaged structure. However, a critical challenge arises
from the nature of the pre-trained image encoder. Trained on the general-purpose ImageNet dataset,
the encoder lacks the specific vocabulary and context of urban planning. Consequently, its raw,
classbased prediction for this salient region is technically correct but contextually meaningless for a
planner; it might classify the object as “ruin” or, more abstractly, “breakwater” due to visual
similarities in texture and form.</p>
        <p>This is the juncture at which the XAI Orchestrator’s LLM Explanation Composer becomes critical.
The composer receives a structured set of inputs: the final sentiment prediction (“Negative”), the
primary visual focus (the rubble), and the raw object classification (“breakwater”). The LLM then
synthesizes this information, re-contextualizing the naive technical label into a domain-relevant
explanation. A representative output from the composer would be: “The system has classified this
submission with a Negative sentiment. The decision was primarily based on the visual content of the
image, with the model’s attention concentrated on the extensive rubble and structural collapse at the
center of the frame, which it identifies as a clear indicator of urban decay and destruction.” This
composed explanation effectively translates the opaque internal logic of the model into an actionable
and trustworthy insight for a human expert.</p>
      </sec>
      <sec id="sec-3-6">
        <title>Case Study 2 – Synthesizing a “Neutral” Sentiment Prediction from</title>
      </sec>
      <sec id="sec-3-7">
        <title>Reconstruction Activity</title>
        <p>The second case study explores a more nuanced scenario: an image depicting a building undergoing
reconstruction, featuring a prominent construction crane. This scenario could be interpreted as
neutral (objective reporting of an activity) or positive (a sign of recovery). Figure 8 presents the visual
explanation for this case. The Grad-CAM analysis (Panels 2, 3, and 4) clearly indicates that the
model’s visual attention is localized almost exclusively on the construction crane and its immediate
operational area.</p>
        <p>In this instance, the power of the XAI Orchestrator lies in its ability to synthesize multiple sources
of evidence into a coherent narrative. Let us assume the accompanying text for this image was a
simple, factual statement: “Work is underway to restore the building on Main Street.” The XAI
Orchestrator would receive the following inputs for its LLM composer: (1) a “Neutral” sentiment
prediction, (2) a primary visual focus on the “crane” from Grad-CAM, and (3) the textual content. The
LLM Composer then fuses these elements to produce the final, multimodal explanation:
“A Neutral sentiment was predicted for this report. The model’s decision was primarily influenced
by the visual presence of a construction crane, which it identified as the key object of interest. This
visual focus on reconstruction activity is consistent with the factual, non-emotive language used in
the accompanying text.”</p>
        <p>These two case studies demonstrate the core functionality of the XAI Orchestrator. It not only
provides accurate, low-level attributions through methods like Grad-CAM but, more importantly,
leverages a Large Language Model to elevate these technical outputs into high-level, contextually
aware, and directly interpretable explanations. This process is essential for building stakeholder trust
and ensuring that the insights generated by the AI system are truly actionable.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>The increasing integration of artificial intelligence into the domain of urban planning presents a
profound paradox: while the predictive power of complex, data-driven models offers unprecedented
opportunities for responsive governance, their inherent opacity creates a critical barrier to
stakeholder trust and practical adoption. This “black box” problem is particularly acute in the context
of urban revitalization, where decisions directly impact communities and demand the highest
standards of transparency and accountability. The inability of decision-makers to understand the
reasoning behind an AI-generated insight renders that insight unactionable, perpetuating a gap
between technological capability and real-world utility.</p>
      <p>In this paper, we addressed this fundamental challenge by proposing and validating a novel,
endto-end framework for explainable multimodal sentiment analysis. Our solution is built upon a
highperformance multimodal analysis model that effectively fuses textual, visual, and geospatial data to
achieve a nuanced understanding of citizen sentiment. The core innovation, however, is the XAI
Orchestrator, a dedicated module that couples this predictive engine with a suite of state-of-the-art
interpretability techniques. By synergizing modality-specific explainers – namely Grad-CAM for
visual saliency and SHAP for textual attribution – and synthesizing their outputs through a Large
Language Model composer, our framework deconstructs complex predictions into clear,
humanintelligible justifications.</p>
      <p>Our empirical evaluation demonstrated the dual efficacy of this approach. The quantitative results
confirmed the high predictive accuracy of the underlying model, achieving a weighted F1-score of
0.9. A rigorous ablation study further validated our multimodal design, showing a clear and
significant performance improvement with the integration of each data modality. Beyond predictive
power, our qualitative case studies revealed the interpretive strength of the XAI Orchestrator,
showcasing its ability to generate composite explanations that are both visually intuitive and
narratively coherent. The system successfully translated opaque model activations into clear,
actionable insights, directly linking sentiment predictions to specific visual and textual evidence.</p>
      <p>While this work establishes a viable framework for explainable urban analysis, it also illuminates
several promising avenues for future research. A logical next step is the development of
domainspecific encoders, fine-tuned on large-scale urban datasets, which could enhance the model’s ability
to recognize contextually relevant features beyond what is possible with general-purpose
pretraining. Further investigation into more advanced feature fusion mechanisms, such as
attentionbased co-learning across modalities, may yield additional performance gains. Perhaps the most
critical future direction lies in the development of robust, quantitative metrics to formally evaluate
the quality, faithfulness, and utility of the generated explanations, moving the field beyond purely
qualitative assessment. Finally, the framework’s modular design invites the future integration of
additional data modalities, such as audio from public forums or sensor data, to create an even more
holistic understanding of the urban environment.</p>
      <p>In conclusion, this research presents more than just a novel architecture; it offers a
methodological blueprint for designing and deploying AI systems that are not only accurate but also
transparent and trustworthy. By prioritizing explanation alongside prediction, our work contributes
to the critical effort of transforming AI from a “black box” tool into a collaborative partner for urban
planners, policymakers, and the public. Ultimately, it is through such human-centric, explainable
systems that we can hope to build the smarter cities of the future – cities that are not only
technologically advanced but also more accountable, equitable, and responsive to the citizens they
serve.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>The authors gratefully acknowledge the support provided by the Ministry of Education and Science
of Ukraine. This research was conducted within the framework of the applied research project titled
“Methodology for determining tonality and classification of multimodal content in territorial
revitalization projects based on neural network methods” (State Registration Number: 0125U001683).
The project was executed under the national priority research direction of “Information and
Communication Technologies”.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Gemini 2.5 Pro, Claude 4.0 Sonnet, and GPT-5
in order to: accelerate the software code generation procedure and to improve the readability and
correctness of the English language. After using these tool(s)/service(s), the author(s) reviewed and
edited the content as needed and take(s) full responsibility for the publication’s content.
International Conference on Multimedia, pp. 2282–2291, October 2024.
https://doi.org/10.1145/3664647.3681403.
[11] D. Yang, M. Li, D. Xiao, Y. Liu, K. Yang, Z. Chen, Y. Wang, P. Zhai, K. Li, and L. Zhang, “Towards
Multimodal Sentiment Analysis Debiasing via Bias Purification,” ArXiv, pp. 1–18, March 2024.
https://doi.org/10.48550/arXiv.2403.05023.
[12] O. Matsiievskyi, T. Honcharenko, O. Solovei, T. Liashchenko, I. Achkasov, and V. Golenkov,
“Using Artificial Intelligence to Convert Code to Another Programming Language,” 2024 IEEE
4th International Conference on Smart Information Systems and Technologies (SIST), pp. 379–
385, May 2024. https://doi.org/10.1109/SIST61555.2024.10629305.
[13] O. Matsiievskyi, I. Achkasov, Y. Borodavka, and R. Mazurenko, “Behavioral model of
autonomous robotic systems using reinforcement learning methods,” CEUR Workshop
Proceedings, vol. 3896, pp. 560–568, October 2024. https://ceur-ws.org/Vol-3896/short14.pdf.
[14] S. Dolhopolov, T. Honcharenko, O. Terentyev, V. Savenko, A. Rosynskyi, N. Bodnar, and E.</p>
      <p>Alzidi, “Multi-Stage Classification of Construction Site Modeling Objects Using Artificial
Intelligence Based on BIM Technology,” 2024 35th Conference of Open Innovations Association
(FRUCT), pp. 179–185, April 2024. https://doi.org/10.23919/fruct61870.2024.10516383.
[15] A. Neftissov, A. Biloshchytskyi, I. Kazambayev, S. Dolhopolov, and T. Honcharenko, “An
Advanced Ensemble Machine Learning Framework for Estimating Long-Term Average
Discharge at Hydrological Stations Using Global Metadata,” Water, vol. 17, no. 14, pp. 2097, July
2025. https://doi.org/10.3390/w17142097.
[16] P. Linardatos, V. Papastefanopoulos, and S. B. Kotsiantis, “Explainable AI: A Review of Machine
Learning Interpretability Methods,” Entropy, vol. 23, no. 1, pp. 18, December 2020.
https://doi.org/10.3390/e23010018.
[17] D. A. Sharma, and K. S. NagendraKumar, ”Explainable AI: Scene Classification and GradCam
Visualization,” International Journal of Advanced Research in Science, Communication and
Technology, vol. 4, no. 4, pp. 1–5, June 2024. https://doi.org/10.48175/ijarsct-19006.
[18] V. Dhore, A. Bhat, V. Nerlekar, K. Chavhan, and A. Umare, “Enhancing Explainable AI: A Hybrid
Approach Combining GradCAM and LRP for CNN Interpretability,” ArXiv, pp. 1–10, May 2024.
https://doi.org/10.48550/arXiv.2405.12175.
[19] Y. Yang, S. Wang, D. Li, S. Sun, and Q. Wu, “GeoLocator: A Location-Integrated Large
Multimodal Model (LMM) for Inferring Geo-Privacy,” Applied Sciences, vol. 14, no. 16, pp. 7091,
August 2024. https://doi.org/10.3390/app14167091.
[20] Y. Wang, T. Zhang, X. Guo, and Z. Shen, “Gradient based Feature Attribution in Explainable AI:
A Technical Review,” ArXiv, vol. 1, no. 1, pp. 1–25, March 2024.
https://doi.org/10.48550/arXiv.2403.10415.
[21] M. Bachhawat, “Generalizing GradCAM for Embedding Networks,” ArXiv, pp. 1–7, February
2024. https://doi.org/10.48550/arXiv.2402.00909.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Honzák</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Resch</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Ruthensteiner</surname>
          </string-name>
          ,
          <string-name>
            <surname>P</surname>
          </string-name>
          , “
          <article-title>Contextual Enrichment of Crowds from Mobile Phone Data through Multimodal Geo-Social Media Analysis</article-title>
          ,
          <source>” ISPRS Int. J. Geo Inf.</source>
          , vol.
          <volume>13</volume>
          , no.
          <issue>10</issue>
          , pp.
          <fpage>350</fpage>
          ,
          <year>October 2024</year>
          . https://doi.org/10.3390/ijgi13100350.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hanny</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Resch</surname>
          </string-name>
          , “
          <article-title>Multimodal Geo-Information Extraction from Social Media for Supporting Decision-Making in Disaster Management,” AGILE GIScience Ser</article-title>
          ., vol.
          <volume>5</volume>
          , no.
          <issue>28</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          ,
          <year>September 2024</year>
          . https://doi.org/10.5194/agile-giss-5-
          <fpage>28</fpage>
          -
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hanny</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Resch</surname>
          </string-name>
          , “
          <article-title>Multimodal GeoAI: An integrated spatio-temporal topic-sentiment model for the analysis of geo-social media posts for disaster management</article-title>
          ,
          <source>” Int. J. Appl. Earth Obs. Geoinformation</source>
          , vol.
          <volume>139</volume>
          , pp.
          <fpage>104540</fpage>
          , May
          <year>2025</year>
          . https://doi.org/10.1016/j.jag.
          <year>2025</year>
          .
          <volume>104540</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Honcharenko</surname>
          </string-name>
          , G. Ryzhakova,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Borodavka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ryzhakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Savenko</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O.</given-names>
            <surname>Polosenko</surname>
          </string-name>
          , “
          <article-title>Method for representing spatial information of topological relations based on a multidimensional data model,”</article-title>
          <source>ARPN Journal of Engineering and Applied Sciences</source>
          , vol.
          <volume>16</volume>
          , no.
          <issue>7</issue>
          , pp.
          <fpage>802</fpage>
          -
          <lpage>809</lpage>
          ,
          <year>April 2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Riabchun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Honcharenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Honta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chupryna</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O.</given-names>
            <surname>Fedusenko</surname>
          </string-name>
          , “
          <article-title>Methods and means of evaluation and development for prospective students' spatial awareness</article-title>
          ,”
          <source>International Journal of Innovative Technology and Exploring Engineering</source>
          , vol.
          <volume>8</volume>
          , no.
          <issue>11</issue>
          , pp.
          <fpage>4050</fpage>
          -
          <lpage>4058</lpage>
          ,
          <year>August 2019</year>
          . https://doi.org/10.35940/ijitee.k1532.
          <fpage>0981119</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hazarika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Poria</surname>
          </string-name>
          , :MISA:
          <string-name>
            <surname>Modality-Invariant</surname>
          </string-name>
          and
          <article-title>-Specific Representations for Multimodal Sentiment Analysis</article-title>
          ,
          <source>” Proceedings of the 28th ACM International Conference on Multimedia</source>
          , pp.
          <fpage>1122</fpage>
          -
          <lpage>1131</lpage>
          ,
          <year>October 2020</year>
          . https://doi.org/10.1145/3394171.3413678.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Poria</surname>
          </string-name>
          , E. Cambria, and L. Morency, “
          <article-title>Tensor Fusion Network for Multimodal Sentiment Analysis</article-title>
          ,
          <source>” Conference on Empirical Methods in Natural Language Processing</source>
          , pp.
          <fpage>1103</fpage>
          -
          <lpage>1114</lpage>
          ,
          <year>September 2017</year>
          . https://doi.org/10.18653/v1%
          <fpage>2FD17</fpage>
          -
          <lpage>1115</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , and J. Wu, “
          <article-title>Learning Modality-Specific Representations with SelfSupervised Multi-Task Learning for Multimodal Sentiment Analysis</article-title>
          ,
          <source>” AAAI Conference on Artificial Intelligence</source>
          , vol.
          <volume>35</volume>
          , no.
          <issue>12</issue>
          , pp.
          <fpage>10790</fpage>
          -
          <lpage>10797</lpage>
          ,
          <year>2021</year>
          . https://doi.org/10.1609/aaai.v35i12.
          <fpage>17289</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Che, and
          <string-name>
            <given-names>B.</given-names>
            <surname>Qin</surname>
          </string-name>
          , “
          <string-name>
            <surname>Large Language Models Meet Text-Centric Multimodal Sentiment Analysis: A Survey</surname>
          </string-name>
          ,” ArXiv, pp.
          <fpage>1</fpage>
          -
          <lpage>40</lpage>
          ,
          <year>June 2021</year>
          . https://doi.org/10.48550/arXiv.2406.08068.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Tao</surname>
          </string-name>
          , “
          <article-title>WisdoM: Improving Multimodal Sentiment Analysis by Fusing Contextual World Knowledge,”</article-title>
          <source>Proceedings of the 32nd ACM</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>