<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimodal Spatio-Temporal Vehicle Speed Prediction Using Hexagonal Grids in Santiago, Chile</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Diego Silva</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Billy Peralta</string-name>
          <email>billy.peralta@unab.cl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Orietta Nicolis</string-name>
          <email>orietta.nicolis@unab.cl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andres Bronfman</string-name>
          <email>abronfman@unab.cl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Caro</string-name>
          <email>lcaro@uct.cl</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hans Lobel</string-name>
          <email>halobel@uc.cl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Pontificia Universidad Católica de Chile, Departamento de Ciencias de Computación</institution>
          ,
          <addr-line>Santiago</addr-line>
          ,
          <country country="CL">Chile</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad Andres Bello, Facultad de Ingeniería</institution>
          ,
          <addr-line>Santiago, 7500971</addr-line>
          ,
          <country country="CL">Chile</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universidad Católica de Temuco, Departamento de Ingeniería Informática</institution>
          ,
          <addr-line>Temuco</addr-line>
          ,
          <country country="CL">Chile</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The rapid growth of e-commerce and the increasing need for logistical optimization in highly congested urban environments require advanced models for vehicle speed prediction. Traditional models often overlook the influence of the geographic environment and rely solely on historical speed data, limiting their accuracy in dynamic scenarios. In addition, most approaches use square grid structures, which introduce spatial distortions and fail to capture the connectivity of road networks efectively. In this work, we propose a multimodal model that integrates spatio-temporal information from GPS sensors with satellite imagery, leveraging HexConvLSTM and MLP neural networks to enhance predictive robustness. Unlike conventional methods, our approach utilizes a hexagonal grid representation, which provides a more uniform spatial structure and improved neighborhood representation that aligns better with road topology than conventional square grids for modeling multidirectional trafic dynamics. This paper presents the implementation and evaluation of the model, highlighting its efectiveness in improving the accuracy of route planning for freight transportation in Santiago Centro. The results show that the multimodal approach significantly reduces the mean absolute error (MAE) to 2.296 in test dataset, outperforming a baseline model based solely on spatiotemporal data by 8.3%. This research validates the benefits of incorporating visual data and hexagonal grid-based spatial modeling into trafic prediction and suggests exploring its applicability in other urban settings.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The rapid growth of e-commerce has transformed logistics into a critical factor for business
competitiveness. Fast and eficient deliveries are now an essential requirement for consumers, who increasingly
demand shorter delivery times [1]. In this context, optimizing the planning of transport routes has
become a key challenge, particularly in highly congested urban areas such as downtown Santiago, Chile.
From a modeling point of view, trafic prediction has evolved from traditional statistical techniques,
such as ARIMA and SARIMA, to more advanced deep learning techniques based on recurrent and
convolutional neural networks [2]. However, many of these models remain limited by their exclusive
reliance on historical speed data and GPS coordinates, failing to incorporate visual environmental
information, such as road layout, green space, and building density, which afects trafic flow and is
otherwise not encoded in GPS data.</p>
      <p>A fundamental limitation of conventional approaches lies in their inability to efectively capture the
interaction between urban infrastructure and trafic dynamics. Factors such as building density, the
presence of school zones, critical intersections, and recurrent congestion patterns are often ignored
in traditional prediction models [3]. As a result, these models struggle to anticipate fluctuations in</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
x1
x2
…</p>
      <sec id="sec-1-1">
        <title>Image I</title>
      </sec>
      <sec id="sec-1-2">
        <title>Speed xt</title>
        <p>HexConvLSTM</p>
      </sec>
      <sec id="sec-1-3">
        <title>Speed xt+1</title>
        <p>vehicle speed with suficient accuracy, which afects decision-making in freight transportation logistics.
To address this gap, we explore a multimodal approach that integrates spatiotemporal data from GPS
sensors with satellite imagery, providing a more comprehensive representation of the urban trafic
environment.</p>
        <p>This work introduces a multimodal prediction model based on HexConvLSTM and MLP neural
networks. The proposed architecture leverages LSTM networks to capture temporal dependencies,
while satellite imagery is processed through a Multilayer Perceptron (MLP) to extract relevant urban
features. By integrating these two modalities, our approach improves vehicle speed estimation for
freight transportation in Santiago Centro, optimizing route planning and contributing to more eficient
urban logistics management. Here, vehicle speed denotes the cell-level average trafic velocity.</p>
        <p>Figure 1 illustrates the architecture of the proposed multimodal model for vehicle speed prediction,
integrating spatiotemporal data ( 1,  2, ...,   ) from GPS sensors with visual information from a satellite
image ( ). The HexConvLSTM network models spatiotemporal relationships, while the MLP extracts
features from  , combining both sources to predict future speed ( +1 ).</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <sec id="sec-2-1">
        <title>2.1. Related work</title>
        <p>Prediction of vehicle speed in urban environments has been extensively studied using deep learning
techniques. Stienen et al. [4] proposed a deep neural network model that integrates satellite data,
meteorological information, and GPS trajectories to predict vehicle speed in regions with limited data
availability. Their approach demonstrated that combining these data sources improves the accuracy of
the prediction in areas lacking extensive historical records. The results showed that their model reduced
the mean squared error compared to traditional methods, validating the importance of incorporating
environmental data into trafic forecasting.</p>
        <p>Guo et al. [5] developed NanoSight–YOLO, an optimized model for the detection of micro-vehicles
in satellite imagery. Their work implemented an architecture based on Faster R-CNN and attention
mechanisms to enhance the detection of small objects in highly congested urban environments. The
proposal stood out for its use of advanced precision optimization techniques, which achieved
improvements in recall and model accuracy, demonstrating the efectiveness of integrating computer vision
into trafic monitoring.</p>
        <p>Cheng et al. [6] explored the automatic detection of trafic regulators at intersections using a model
based on Conditional Variational Autoencoders (CVAE). Their approach combined GPS data with
satellite imagery to classify intersections into diferent categories based on the presence of trafic lights
or priority signs. Using LSTM and CNN networks, they improved the identification of critical points in
road infrastructure, facilitating their integration into trafic prediction systems.</p>
        <p>Chowdhury and Sarwat [7] introduced GeoTorchAI, a deep learning framework designed to process
spatio-temporal data in raster images and neural networks. Their methodology improved eficiency in
handling large-scale geospatial data, optimizing segmentation, and classification of satellite images for
trafic prediction applications. The use of model pretraining significantly reduced computational costs
without compromising prediction accuracy.</p>
        <p>Adamiak et al. [8] presented a method for detecting vehicles and estimating their speeds using
PlanetScope SuperDove satellite imagery. Using a Keypoint R-CNN model to track vehicle trajectories
across RGB bands, a band timing diference was used to estimate speed. The validation was carried out
using drone footage and GPS data from highways in Germany and Poland.</p>
        <p>Sheehan et al. [9] explored the use of deep learning and high-resolution WorldView satellite imagery
for large-scale trafic monitoring in Barcelona. Using the YOLOv3 object detection model, the study
identifies vehicles in the city, achieving a precision of 0.69 and a recall of 0.79 and faced challenges in
detecting vehicles on narrow streets, in shadows and under obstructions.</p>
        <p>Kashyap et al. [10] reviewed recent advances in deep learning for trafic flow prediction, covering
architectures such as CNN, RNN, LSTM, restricted Boltzmann machines (RBMs), and stacked
autoencoders (SAEs). These models leverage multiple layers to extract higher-level features from raw input
data. Similarly, Afandizadeh et al. [2] provided a detailed comparative analysis of deep learning (DL)
and classical models for trafic forecasting. The study highlights that while DL algorithms (such as
RNNs, CNNs, and LSTMs) ofer higher accuracy and adaptability, classical models (such as ARIMA
and regression-based methods) remain valuable in structured, low-complexity environments. Finally,
Mystakidis et al. [11] explore advanced Trafic Congestion Prediction (TCP) methods, focusing on
statistical models, ML, Deep Learning (DL), and ensemble approaches. They evaluated various forecasting
techniques, considering both regression and classification metrics. In addition, it outlines a step-by-step
methodology commonly used in TCP research.</p>
        <p>While prior work has demonstrated the benefits of integrating satellite or spatiotemporal data with
deep learning architectures, our method is the first to explicitly combine a HexConvLSTM model
operating on hexagonal grids with a visual MLP that processes satellite imagery, producing a unified
multimodal model for short-term speed prediction.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. HexConvLSTM</title>
        <p>Prediction of vehicle speeds in urban environments is essential for optimizing trafic flow, a task
commonly tackled using deep learning approaches such as ConvLSTM and Transformers. However, these
approaches often assume a square grid representation, introducing distortions in spatial connectivity.
Unlike square grids, the hexagonal structure ofers better connectivity, as each cell has six equidistant
neighbors instead of four or eight [12]. Recently, Bahamondes et al. [13] proposed HexConvLSTM, a
neural network based on ConvLSTM adapted to hexagonal grid sequences, optimizing the representation
of vehicular trafic and improving prediction accuracy.</p>
        <p>The proposed method consists of three key stages: (i) Hexagonal Grid Representation, where raw
trafic speed data are mapped onto a structured hexagonal grid using the H3 spatial indexing system.
We used H3 resolution level 9, corresponding to hexagons with an average edge length of approximately
174 meters, balancing spatial resolution with data sparsity; (ii) Preprocessing for Compatibility,
involving upsampling, padding, and shifting operations to transform the hexagonal structure into
a format suitable for ConvLSTM while preserving its original neighborhood relationships; and (iii)
Hexagonal-Constrained Convolution, where a custom convolutional kernel enforces hexagonal
neighborhood relationships by masking non-adjacent cells in the input tensor, ensuring only valid hex
neighbors contribute to the convolution.. This ensures that feature extraction respects the inherent
properties of hexagonal data distributions.</p>
        <p>The HexConvLSTM architecture consists of a sequence of ConvLSTM layers adapted with a
hexagonal kernel constraint, followed by fully connected layers for final speed prediction. The ConvLSTM
component captures spatial-temporal dependencies in vehicle movement, leveraging recurrent
convolutional operations to model long-term trafic patterns. Meanwhile, the hexagonal transformation
ensures that the model exploits the benefits of hexagonal connectivity while remaining compatible with
conventional deep learning frameworks.</p>
        <p>This architecture has the ability to incorporate hexagonal grid structures by introducing hex-aware
preprocessing and masking techniques, while retaining compatibility with standard ConvLSTM
implementations, enabling seamless integration into existing trafic forecasting pipelines. Figure 2 shows the
proposed HexConvLSTM architecture and its data processing. Details can be reviewed in [13].</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed method</title>
      <p>
        The proposed model combines deep learning techniques for multimodal vehicle speed prediction in
urban environments. The developed architecture integrates two complementary approaches: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) a
HexConvLSTM network to model the spatiotemporal dynamics of GPS sensor data and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) a CNN/MLP
to extract relevant features from satellite images. Each component and its integration into the final
model are detailed below.
      </p>
      <p>The model consists of two main branches that process diferent types of information before being
merged into a final prediction layer. Figure 3 illustrates the overall system architecture.</p>
      <sec id="sec-3-1">
        <title>3.1. HexConvLSTM Branch for Spatiotemporal Data</title>
        <p>The first branch of the model processes GPS sensor data using a HexConvLSTM network, a variant of
ConvLSTM designed to operate on a hexagonal grid instead of a square mesh. This approach enhances
spatial connectivity between cells and reduces distortion in the representation of trafic patterns.</p>
        <p>
          The processing flow in this branch begins with an input tensor of shape ( , 44, 15, 1) , where 
represents the temporal sequence and (44, 15) corresponds to the hexagonal grid. The data is then
processed by a ConvLSTM2D layer with 128 filters and ReLU activation, constrained to a hexagonal
kernel of size (
          <xref ref-type="bibr" rid="ref3 ref5">5,3</xref>
          ) to preserve spatial dependencies. Batch normalization is applied to enhance stability
        </p>
        <p>Add
Speed xt+1
44x15x1</p>
        <p>Image I
224x224x3</p>
        <p>Flatten
512
2048
660</p>
        <p>660</p>
        <p>
          HexConvLSTM
xt-11
Xt-10
and accelerate convergence during training. Subsequently, a final convolutional layer with a (
          <xref ref-type="bibr" rid="ref3 ref3">3,3</xref>
          )
kernel and a single filter refines spatiotemporal features. Finally, the output is reshaped to (
          <xref ref-type="bibr" rid="ref1 ref1">1, 44, 15, 1</xref>
          ),
ensuring compatibility with the multimodal integration framework.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. CNN/MLP Branch for Satellite Images</title>
        <p>The second branch of the model leverages both a Multilayer Perceptron (MLP) and convolutional
neural networks (CNNs) to extract spatial features from satellite images. Given the relatively small
and static nature of the input data (target representation: 44 × 15), an MLP can ofer a computationally
eficient alternative by avoiding unnecessary spatial convolutions while still capturing relevant feature
structures.</p>
        <p>
          The processing flow in this branch begins with RGB input images resized to (
          <xref ref-type="bibr" rid="ref3">224, 224, 3</xref>
          ) pixels. The
images are then flattened into a one-dimensional vector, followed by two fully connected layers with
512 and 2048 neurons, both using ReLU activation. Finally, the output layer is adjusted to match the
hexagonal grid, consisting of 660 neurons with a linear activation function. Flattening preserves spatial
context because each pixel index maps to a fixed geo-coordinate, letting the MLP learn location-specific
weights.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Multimodal Fusion and Training Regime</title>
        <p>
          Once the two branches finish their forward passes, their feature maps are added element-wise to produce
a tensor of shape (
          <xref ref-type="bibr" rid="ref1 ref1">1, 44, 15, 1</xref>
          ) that exactly mirrors the input hexagonal grid. Keeping this layout intact
simplifies downstream error visualisation and ensures that no spatial information is lost during fusion.
        </p>
        <p>The HexConvLSTM branch is first trained on the GPS-only subset and then frozen; initial tests showed
that letting its weights update in the multimodal stage worsened validation accuracy. Consequently,
the only trainable parts in the full network are (i) a lightweight MLP fed with the flattened 224 × 224 ×
3 satellite image, converting it into a 660-element vector that matches the grid, and (ii) the fusion bias
term.</p>
        <p>For comparison, we also tested a CNN-based visual branch (InceptionV3, EficientNetB7, Xception),
where the image retains its spatial structure and a global-average-pooling layer feeds a dense layer of
1024 units, followed by a 660-dimensional output. This branch is fine-tuned end-to-end, including the
custom regression head.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Data collection and preprocessing</title>
      <p>
        This study focuses on predicting vehicle speeds in urban environments by integrating spatiotemporal
data from GPS sensors with visual features extracted from satellite imagery. The data pipeline consists
of two primary stages: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) data collection, which involves vehicle trajectories and satellite images; and
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) data preprocessing and treatment.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Data Collection</title>
        <p>Two primary sources of information were used for model construction, ensuring a comprehensive and
multimodal approach to vehicle speed prediction by integrating both spatiotemporal and visual data.</p>
        <p>The first source was GPS Sensor Data, provided by the Transport and Logistics Center of Universidad
Andrés Bello (CTL-UNAB). This dataset recorded the speed of freight vehicles operating in downtown
Santiago and included essential attributes such as date, time, latitude, longitude, speed, and vehicle
direction. The data spans from January 4th to July 25th, 2020 , covering a total of 157 days , with the
exception of April, for which no records are available. Measurements were taken at an hourly frequency
between 8:00 a.m. and 7:00 p.m. , resulting in 12 time steps per day . In total, approximately 22 million
records were collected, providing a rich temporal dataset that captures variations in trafic conditions
across diferent hours of the day, days of the week, and seasons of the year.</p>
        <p>The second source of data consisted of Satellite Images, extracted from Google Earth Engine using
the Python library ee. These images represented the urban environment with high spatial resolution,
capturing road networks, infrastructure, and other environmental features that influence vehicle speed
and trafic flow. The images were specifically selected to align with the GPS sensor locations, ensuring
a meaningful correlation between visual and numerical data. The region of interest was defined based
on the highest density of GPS records, covering an area of central Santiago with heavy trafic activity.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Data Preprocessing</title>
        <p>Data preprocessing was essential to ensure the quality and representativeness of the information fed
into the model. To achieve this, a series of steps were carried out to refine and structure the data
efectively.</p>
        <p>Geospatial filtering was applied to select only records within the study area, defined between
the coordinates [-33.4331, -70.6253] and [-33.4524, -70.6655]. This selection ensured that the dataset
accurately represented the urban region of interest and excluded extraneous data points that could
introduce noise into the predictions. From an initial dataset of approximately 22 million GPS records,
only those relevant to the study area were retained for further processing. Additionally, records with a
speed of zero were removed, as they did not contribute useful information for velocity prediction. The
dataset was further refined by excluding incomplete data entries, ensuring consistency in the features
used by the model.</p>
        <p>To enhance spatial representation, the h3 library [15] was employed to transform GPS coordinates
into a hexagonal grid, where each hexagonal cell aggregated multiple velocity readings. This conversion
optimized spatial segmentation by reducing the distortions introduced by traditional square grids, which
often fail to capture continuous spatial relationships efectively. The hexagonal structure provided
a more precise spatial representation, improving the model’s ability to learn trafic patterns across
diferent areas.</p>
        <p>Normalization was performed using MinMax Scaling, which transformed velocity values into a
standardized range between 0 and 1. This process improved model stability by ensuring numerical
consistency across input features and preventing large disparities in scale that could hinder the learning
process. The final training dataset consisted of 1,306 sequences, each containing 12 time steps
representing hourly velocity readings, while validation and test sets contained 270 and 272 sequences, respectively.
Each sequence corresponded to a grid of 44×15 hexagonal cells, preserving the spatial-temporal structure
of the data.</p>
        <p>Parallel to the preprocessing of GPS data, satellite images were processed to align with the input
requirements of the neural network. Each image was resized to 224×224 pixels, a commonly used
dimension in deep learning applications that balances computational eficiency with suficient detail
retention. The images, originally obtained in multiple resolutions, were uniformly adjusted and
converted to RGB format to maintain color consistency across diferent captures. Subsequently, the
images were flattened and normalized using MinMax Scaling before being reshaped back into their
original format. These preprocessing steps ensured compatibility with the neural network and facilitated
multimodal integration by standardizing both spatial and temporal inputs.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>To evaluate the performance of the proposed model, experiments were conducted on a dataset obtained
from GPS sensors and satellite imagery in the city of Santiago, Chile. The evaluation focused on
comparing the multimodal model based on HexConvLSTM + MLP with traditional approaches, such
as the exclusive use of HexConvLSTM networks. The results were analyzed using standard time
series prediction metrics and visualization of errors on spatial maps. A demo code is available in
https://github.com/dsilvaa8/multimodal.</p>
      <sec id="sec-5-1">
        <title>5.1. Experimental setting</title>
        <p>5.1.1. Hardware Specifications
The experiments were performed on a virtual machine with the following resources: a GPU composed
of one Tesla T4 and three Tesla P40, totaling 80 GB of graphics memory; and a RAM Memory of 125
GB.
5.1.2. Model Training and Evaluation
The model was trained using a data partitioning scheme with 70% for training, 15% for validation, and
15% for testing. The optimization process focused on minimizing the Mean Squared Error loss (MSE).</p>
        <p>To improve training stability, several optimization strategies were implemented. Early stopping was
applied to halt training if validation loss did not improve for 15 consecutive epochs. Additionally, we
decrease the learning rate by a factor of 0.5 if the loss did not improve within 5 epochs. The model was
optimized using the Adam optimizer, with an empirically tuned initial learning rate of 0.0002.</p>
        <p>The model’s performance was evaluated using standard time-series prediction metrics: Mean Absolute
Error (MAE), Root Mean Squared Error (RMSE) and Coeficient of Determination (  2)</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. CNN Model Selection</title>
        <p>To evaluate the predictive capability of diferent convolutional neural network (CNN) architectures,
an extensive experiment was conducted, comparing multiple models in terms of training and test loss.
Widely used architectures in the literature were analyzed, including VGG16, Xception, EficientNetB7,
InceptionV3, and InceptionResNetV2.</p>
        <p>Table 1 summarizes the averaged results obtained after three training iterations for each model. Two
key metrics are reported: Validation RMSE and Train RMSE, which reflect the model’s
generalization ability and fit to the training data. EficientNetB7, Xception, and InceptionV obtained the best
performance in terms of validation RMSE.</p>
        <p>Table 2 presents the average test set performance of these three best-performing CNN architectures
in Table 1, evaluated over three independent iterations.</p>
        <p>The results indicate that InceptionV3 consistently yields the best performance, achieving the lowest
values for MAE (2.542), and RMSE (6.109), while matching EficientNetB7 in terms of coeficient of
determination ( 2 = 0.825).</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. MLP Parameter Selection</title>
        <p>To assess the impact of the number of neurons on model accuracy, experiments were conducted by
varying the number of units in each layer of the MLP network, considering a total of two layers. Table 3
presents the results of diferent configurations in terms of training and validation RMSE. Notably, the
best configuration from the first layer was used in the second layer.</p>
        <p>Validation set results indicate that the combination of 512 and 2048 neurons in the first and second
layers, respectively, provides the best balance between accuracy and computational eficiency.
Specifically, this configuration achieves a validation RMSE of 6.26 and a train RMSE of 5.11, demonstrating
a high generalization capacity without significant overfitting.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Model Comparison</title>
        <p>Table 4 presents the results obtained for each model on both the training and test sets. The reported
values correspond to the average performance across three independent runs for each model.</p>
        <p>The results show that the multimodal model based on HexConvLSTM + MLP achieves superior
performance across all metrics compared to other approaches. Specifically, it reduces the mean absolute
error by 8.3% compared to HexConvLSTM and provides a marginal improvement over the CNN +
HexConvLSTM model.</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Error Heat-Map Visualization</title>
        <p>To visualize the error distribution, heatmaps representing the MAE in each hexagonal grid cell within
the study area were generated. Figure 4 illustrates the errors in the test set.</p>
        <p>The spatial analysis reveals that the highest errors are concentrated in areas with high variability in
vehicle speed, such as intersections and major avenues. In contrast, in regions with more stable trafic
lfow, the model achieves more accurate predictions.</p>
        <p>The conducted experiments validate the hypothesis that combining spatiotemporal and visual data
enhances vehicle speed prediction. The proposed model demonstrates advantages in terms of accuracy
and stability, and the results suggest that future improvements could be achieved by incorporating
additional dynamic data, such as weather conditions and real-time trafic events.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>The results obtained in this study confirm that the proposed multimodal model, based on the combination
of HexConvLSTM and MLP, outperforms conventional approaches in vehicle speed prediction. In
terms of MAE and RMSE, the multimodal model achieved a significant error reduction compared
to HexConvLSTM and CNN, validating the hypothesis that integrating satellite imagery improves
predictive accuracy.</p>
      <p>The comparative analysis demonstrates that incorporating visual information from the urban
environment through satellite images allows the model to capture spatial patterns that traditional models
do not consider. The proposed architecture improves predictions in areas with regular trafic conditions,
although challenges were observed in maintaining accuracy during abrupt speed fluctuations caused by
unpredictable events, such as accidents or sudden congestion.</p>
      <p>Additionally, the use of hexagonal grids in the HexConvLSTM branch ofers a potentially improved
spatial representation of GPS data, mitigating some of the distortions commonly associated with
square-grid structures. This feature has been crucial to ensuring model stability in urban trafic
analysis. A somewhat surprising observation was that the MLP outperformed more advanced
CNNbased architectures such as Inception. This result is likely due to the static nature of the satellite image,
where convolutional models may not fully exploit their inherent translational invariance. Given the
relatively small resulting feature map size (44×15), the advantages of convolutional operations become
less pronounced, reducing the expected performance gap between CNNs and fully connected networks.</p>
      <p>Previous studies in the literature have explored trafic prediction using LSTM, CNN, and hybrid
models with geospatial data. However, most of them do not explicitly consider the integration of sensor
data with satellite images in a multimodal framework. Compared to previous works, our model ofers
a more comprehensive integration of spatiotemporal information. Unlike approaches that rely solely
on historical trafic data, our model incorporates the geographic context of the road environment,
providing a more dynamic and context-aware prediction.</p>
      <p>Despite the positive results, our method has several limitations that open promising research avenues.
First, generalisability is still unproven: the model was trained solely on downtown Santiago trafic, so
its behaviour in cities with diferent network layouts or demand patterns must be validated. Second,
prediction accuracy may deteriorate where only low-resolution imagery is available or where rapid
infrastructure changes outpace the satellite update cycle, calling for dynamic image-quality checks.
Finally, the hexagonal tessellation—though uniform and rotation-invariant—aggregates roads of diferent
functional classes and directions within a single cell, blurring lane- or direction-specific congestion
(e.g., a stalled freeway lane next to a free-flow local road). Consequently, the current design is best
suited to area-level tasks such as fleet dispatch or hotspot screening; applications needing direction
separation should combine the grid with road-graph or edge-level GNN features, an integration we
leave for future work.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>This study developed a multimodal predictive model that integrates spatiotemporal data from GPS
sensors with satellite imagery, leveraging HexConvLSTM and MLP neural networks. The model
was trained and evaluated using trafic data from downtown Santiago, demonstrating significant
improvements in prediction accuracy compared to conventional approaches that rely solely on historical
data.</p>
      <p>Overall, the results indicate that incorporating satellite imagery into trafic prediction models
enhances the accuracy of vehicle speed estimations. Specifically, the HexConvLSTM + MLP multimodal
model achieved lower Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) than
traditional methods, highlighting the benefits of combining spatial and temporal information. Furthermore,
the proposed methodology is adaptable to other urban environments, provided that data preprocessing
and hyperparameter tuning are adjusted accordingly.</p>
      <p>For future work, we aim to assess the generalization of the model across diferent urban settings with
varying trafic conditions. Additionally, we plan to integrate meteorological data, urban events, and
social media information to improve the model’s adaptability to sudden trafic fluctuations. From a
technical perspective, we will explore attention-based models and Graph Neural Networks (GNNs) to
better capture complex relationships within geospatial data. Furthermore, we intend to incorporate the
YOLO network for satellite image processing, enabling more precise identification of road structures,
vehicle densities, and other key environmental features that influence trafic flow. This enhancement
will refine the integration of visual data, further improving the model’s predictive performance in
dynamic urban scenarios.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>O. Nicolis and B. Peralta acknowledge support from ANID–Fondecyt grants 1241881 and 1241882.
B. Peralta and H. Lobel appreciate the support of the National Center for Artificial Intelligence CENIA
FB210017, Basal ANID.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT for translation, grammar and spelling
checks, and for paraphrasing and rewording. After using this tool, the authors reviewed and edited the
content as needed and take full responsibility for the publication’s content.
[9] A. Sheehan, A. Beddows, D. C. Green, S. Beevers, City scale trafic monitoring using worldview
satellite imagery and deep learning: A case study of barcelona, Remote Sensing 15 (2023). URL:
https://www.mdpi.com/2072-4292/15/24/5709. doi:10.3390/rs15245709.
[10] A. A. Kashyap, S. Raviraj, A. Devarakonda, S. R. N. K, S. K. V, S. J. Bhat, Trafic flow prediction
models – a review of deep learning techniques, Cogent Engineering 9 (2022) 2010510. URL:
https://doi.org/10.1080/23311916.2021.2010510. doi:10.1080/23311916.2021.2010510.
[11] A. Mystakidis, P. Koukaras, C. Tjortjis, Advances in trafic congestion prediction: An overview
of emerging techniques and methods, Smart Cities 8 (2025) 25. URL: https://www.mdpi.com/
2624-6511/8/1/25. doi:10.3390/smartcities8010025.
[12] X. He, W. Jia, Hexagonal structure for intelligent vision, in: 2005 International Conference
on Information and Communication Technologies, 2005, pp. 52–64. doi:10.1109/ICICT.2005.
1598543.
[13] F. Bahamondes, B. Peralta, O. Nicolis, A. Bronfman, Á. Soto, Convlstm neural network based on
hexagonal inputs for spatio-temporal forecasting of trafic velocities, in: Proceedings of the 3rd
International Workshop on Spatio-Temporal Reasoning and Learning (STRL 2024) co-located with
the 33rd International Joint Conference on Artificial Intelligence (IJCAI 2024), 2024, pp. 45–55.</p>
      <p>URL: https://ceur-ws.org/Vol-3827/paper5.pdf.
[14] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, W.-c. Woo, Convolutional lstm network:
A machine learning approach for precipitation nowcasting, Advances in neural information
processing systems 28 (2015).
[15] I. Brodsky, H3: Uber’s hexagonal hierarchical spatial index, https://eng.uber.com/h3/, 2018.
Available from Uber Engineering website. Accessed: 22 June 2019.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Schöder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. K.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <article-title>The impact of e-commerce development on urban logistics sustainability</article-title>
          ,
          <source>Open Journal of Social Sciences</source>
          <volume>4</volume>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . URL: https://tancuarku.com/lander/tancuarku.com/index.php?paperid=
          <volume>64089</volume>
          &amp;_=%
          <source>2Fjournal% 2Fpaperinformation%23Z0x%2FkvlZXYFNnfiVfd428GUNP8E%3D. doi:10</source>
          .4236/jss.
          <year>2016</year>
          .
          <volume>43001</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Afandizadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Abdolahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mirzahossein</surname>
          </string-name>
          ,
          <article-title>Deep learning algorithms for trafic forecasting: A comprehensive review and comparison with classical ones</article-title>
          ,
          <source>Journal of Advanced Transportation</source>
          <year>2024</year>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          . URL: https://doi.org/10.1155/
          <year>2024</year>
          /9981657. doi:
          <volume>10</volume>
          .1155/
          <year>2024</year>
          /9981657.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rajha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shiode</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shiode</surname>
          </string-name>
          ,
          <article-title>Improving trafic-flow prediction using proximity to urban features and public space</article-title>
          ,
          <source>Sustainability</source>
          <volume>17</volume>
          (
          <year>2025</year>
          )
          <article-title>68</article-title>
          . URL: https://www.mdpi.com/2071-1050/17/1/68. doi:
          <volume>10</volume>
          .3390/su17010068.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Stienen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hertog</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Wagenaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Zegher</surname>
          </string-name>
          ,
          <article-title>Better routing in developing regions: Weather and satellite-informed road speed prediction</article-title>
          , CentER,
          <source>Center for Economic Research</source>
          <year>2023</year>
          -
          <volume>025</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shuai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <surname>X. Zhang,</surname>
          </string-name>
          <article-title>Enhancing sustainable trafic monitoring: Leveraging nanosight-yolo for precision detection of micro-vehicle targets in satellite imagery</article-title>
          ,
          <source>Sustainability</source>
          <volume>16</volume>
          (
          <year>2024</year>
          ). URL: https://www.mdpi.com/2071-1050/16/17/7539. doi:
          <volume>10</volume>
          .3390/su16177539.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Cheng</surname>
          </string-name>
          , H. Lei,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zourlidou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sester</surname>
          </string-name>
          ,
          <article-title>Trafic control recognition with an attention mechanism using speed-profile and satellite imagery data</article-title>
          ,
          <source>in: ISPRS - Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci., volume XLIII-B4</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>287</fpage>
          -
          <lpage>293</lpage>
          . doi:
          <volume>10</volume>
          .15488/15582.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Chowdhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sarwat</surname>
          </string-name>
          ,
          <article-title>Deep learning with spatiotemporal data: A deep dive into geotorchai</article-title>
          ,
          <source>in: Proceedings of the 40th IEEE International Conference on Data Engineering (ICDE)</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>5156</fpage>
          -
          <lpage>5169</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICDE60146.
          <year>2024</year>
          .
          <volume>00387</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Adamiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Grinblat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Psotta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Fulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mazumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zipf</surname>
          </string-name>
          ,
          <article-title>Deep learning enhanced road trafic analysis: Scalable vehicle detection and velocity estimation using planetscope imagery</article-title>
          ,
          <source>International Journal of Applied Earth Observation and Geoinformation</source>
          <volume>142</volume>
          (
          <year>2025</year>
          )
          <article-title>104707</article-title>
          . URL: https://www.sciencedirect.com/science/article/pii/S1569843225003541. doi:https: //doi.org/10.1016/j.jag.
          <year>2025</year>
          .
          <volume>104707</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>