=Paper=
{{Paper
|id=Vol-3946/BMDA_paper5
|storemode=property
|title=A Scalable Model for Vessel-Generated Underwater Noise: Enhancing Efficiency through Parallelisation
|pdfUrl=https://ceur-ws.org/Vol-3946/BMDA-5.pdf
|volume=Vol-3946
|authors=Giulia Rovinelli,Esteban Zimányi,Marta Simeoni,Davide Rocchesso,Alessandra Raffaetà
}}
==A Scalable Model for Vessel-Generated Underwater Noise: Enhancing Efficiency through Parallelisation==
A Scalable Model for Vessel-Generated Underwater Noise:
Enhancing Efficiency through Parallelisation
Giulia Rovinelli1,* , Esteban Zimányi2 , Marta Simeoni1,3 , Davide Rocchesso4 and
Alessandra Raffaetà1
1
Ca’ Foscari University of Venice, Venice, Italy
2
Université Libre de Bruxelles, Bruxelles, Belgium
3
European Centre for Living Technology (ECLT), Venice, Italy
4
Università degli Studi di Milano Statale, Milano, Italy
Abstract
Underwater noise pollution by shipping activities is widely recognised as a significant threat to marine life. The noise emitted
by vessels can have various detrimental effects on fish and marine ecosystems. Therefore, accurately estimating and analysing
vessel-generated underwater noise is a critical challenge for the protection and conservation of marine environments. For this
reason, we have built a model for the spatio-temporal characterisation of underwater noise generated by vessels. This paper
builds on this model by optimising the code pipeline, implementing table partitioning and leveraging parallelisation techniques.
These enhancements allow us to explore various partitioning methods while significantly improving the computational
performance and enabling more efficient analysis of underwater noise. Our approach not only improves the computational
efficiency but also preserves the accuracy of the noise calculations, offering a more scalable solution for large datasets.
Keywords
Spatio-temporal databases, underwater noise, parallelisation techniques
1. Introduction Moreover, direct measurement of underwater noise is not
always feasible, particularly in remote regions or deep
Underwater noise generated by human activities, espe- waters. In these cases, acoustic models are employed to
cially from shipping, is known to produce short and long simulate sound propagation. However, these models also
term effects on marine animal species. This noise pollu- require a wide range of input data, including detailed
tion can disrupt the natural acoustic environment, lead- environmental parameters and vessel-specific character-
ing to several adverse consequences. Some of the nega- istics, in addition to huge computational effort to handle
tive impacts include interference with communication, the complex calculations involved. For this reason, the
changes in behaviour, stranding, and increased mortality development of sound propagation models that balance
rates [1, 2]. Therefore, characterising underwater noise accuracy with computational efficiency is essential. Such
is crucial for monitoring the health of aquatic life, assess- models must be capable of providing reliable predictions
ing potential risks, and providing valuable information while minimising resource consumption, enabling their
to ecologists and policy makers. This enables the devel- application on larger scales or in data-intensive scenarios.
opment of effective strategies to maintain a productive In this work, building upon the model developed in [3]
and healthy ecosystem. However, measuring underwater and refined in [4], we introduce several enhancements
noise is a complex and computationally demanding task. aimed at improving the efficiency of its implementation.
In addition to the installation of hydrophones, which re- Specifically, we optimise the computational pipeline to
quires specific resources and expertise for proper deploy- handle large-scale spatio-temporal datasets more effec-
ment and calibration, the analysis of the collected data is tively while preserving the results of the previous model.
equally challenging. Once the data is acquired, it must be The optimisations include the restructuring of the code
processed to extract meaningful insights, a process that to cope with time consuming operations and the imple-
can require substantial computational power, especially mentation of table partitioning using PostgreSQL [5] and
when monitoring large areas or extended time periods. Citus [6], as well as leveraging parallelisation techniques
to improve processing speed and scalability. The frame-
Published in the Proceedings of the Workshops of the EDBT/ICDT 2025 work has been implemented in MobilityDB [7], an open-
Joint Conference (March 25-28, 2025), Barcelona, Spain source platform for managing and analysing geospatial
*
Corresponding author. trajectory data. Our framework enables various analyses
$ giulia.rovinelli@unive.it (G. Rovinelli); esteban.zimanyi@ulb.be
(E. Zimányi); simeoni@unive.it (M. Simeoni);
to estimate the impact of fishing activities on underwater
davide.rocchesso@unimi.it (D. Rocchesso); raffaeta@unive.it noise pollution.
(A. Raffaetà) To demonstrate the potential of the developed system,
Copyright © 2025 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
we focus on the fishing activities in the Northern Adri- the source levels to all the other vessels, we need to re-
atic Sea, one of the most heavily exploited areas of the late the sound pressure level to the engine horsepower,
Mediterranean Sea, where underwater noise pollution the latter being available in our dataset. If we assume
is a recognised consequence of intensive fishing activity. that a constant fraction of engine power gets converted
The dataset used in this study includes AIS data from Ital- into acoustic power (i.e. acoustic power scales linearly
ian and Croatian fishing vessels for June 2020. Moreover, with horsepower), then 3 dB are added per doubling in
to determine the acoustic features of the vessel engines engine power. We adopt such a linear progression on
and refine the propagation model, we use direct acoustic logarithmic scale of engine power and the resulting value
measurements from the Interreg project SOUNDSCAPE1 , is denoted with 𝑆𝐿0 . For example, for engines between
which conducted acoustic monitoring in the Northern 100 Hp and 835 Hp, considering a frequency of 63 Hz, we
Adriatic Sea from March 2020 to June 2021. obtain a range between 123 dB and 136 dB.
The paper is organised as follows. Section 2 overviews Differences in source level may result from variations
the sound propagation model introduced in [3] and re- in speed. Specifically, as noted in [9], the intrinsic factor
fined in [4]. Section 3 focuses on the optimisation of of speed can influence the broadband source level of ships
the computational pipeline to enhance the model per- according to the following relation:
formance. Section 4 discusses the implementation of {︃
data partitioning techniques with PostgreSQL and ex- 𝑆𝐿0 if 𝑣 ≤ 𝑣0
𝑆𝐿 = (1)
plores the integration of the Citus extension to enable 𝑆𝐿0 + 15.39 𝑑𝐵 × 𝑙𝑜𝑔10 𝑣0 if 𝑣 > 𝑣0
𝑣
distributed processing. Finally, Section 5 presents some
concluding remarks. where 𝑣0 = 3.9 kn corresponds to the speed of the ref-
erence boat and 𝑣 is the actual speed of the vessel.
Trawling vessels typically generate higher levels of ra-
2. Underwater Noise Model diated noise compared to free-running vessels operating
In this section, we briefly describe the model for under- under the same machinery settings. While published data
water sound propagation based on our previous work [3] on the radiated noise from operating trawling vessels are
and significantly refined in [4] w.r.t. several aspects. limited, some studies have reported increases in radiated
The basic objective of noise modelling is to assess noise ranging from 5 dB to 15 dB during trawling activ-
how much noise a particular activity will generate in ities. Specifically, it is noted that the effect of trawling
the surrounding area. Specifically, the aim is to model is minimal below 100 Hz and increases with frequency.
the received noise level (RL) at a given point (or points), Accordingly, we assign an increase of 5 dB at 63 Hz when
based on the sound source level (SL) of the noise source, the vessel is trawling.
and the amount of sound energy which is lost as the To account for transmission loss, we adopt a combi-
sound wave propagates from the source to the receiver nation of spherical propagation and mode stripping [10].
(transmission loss or propagation loss, TL). The principal The resulting formula is:
sources of underwater noise are machinery, propellers, {︃
and cavitation. Our AIS dataset includes some data of 20 𝑙𝑜𝑔10 (𝑟) if 𝑟 ≤ 𝑟trans
𝑇𝐿 =
the fishing boats, such as the length overall (LOA) of the 15 𝑙𝑜𝑔10 (𝑟) + 5 𝑙𝑜𝑔10 (𝑟trans ) if 𝑟 > 𝑟trans
boat, the horsepower of the engine and also the fishing (2)
gear used. However, the dataset does not include direct The 15 𝑙𝑜𝑔10 (𝑟) dependence on range is known as mode
measurements of the sound pressure levels of the fishing stripping because it results from the gradual erosion of
vessels. So, we infer such values considering the general steep ray paths (high-order modes) after multiple bottom
literature about underwater noise and the measurements reflections. To determine 𝑟trans , we refer to the reference
provided by the SOUNDSCAPE project [8], which con- boat. At 63 Hz the transition is expected to occur at
ducted acoustic monitoring in the Northern Adriatic Sea around 400 m, approximately 10 times the water depth.
from March 2020 to June 2021. In particular, we use the Environmental absorption features may affect the
measurements of a hydrophone located in the middle transmission loss, especially for large distances and high
of the Adriatic Sea, taken on March 31, 2021 between frequencies. To take into account all the environmental
5:40 pm and 5:55 pm. Here, there is a unique fishing aspects that influence the sound propagation underwa-
vessel crossing nearby the hydrophone and taken as the ter, we add a term proportional to distance from the
reference boat. This allows us, by linear regression on source [11]:
sound pressure level measurements, to assign a vessel 𝑇 𝐿𝑡𝑜𝑡 = 𝑇 𝐿 + 𝛼 × 𝑟 (3)
with an 835 Hp engine, when not trawling, an estimated
At frequency 63 Hz, 𝛼 is on the order of 10−6 dB/m.
source level of 136 dB at 63 Hz. In order to associate
The classic sonar equation [12] provides an estimation
1
https://www.italy-croatia.eu/web/soundscape of the received noise level (𝑅𝐿) by subtracting the trans-
mission loss (𝑇 𝐿) from the sound source level (𝑆𝐿). How- 3. Noise Modelling Optimisation
ever, it does not consider the ambient (or background)
noise, which is present in the marine environment. The In this section, we first describe the setting of our experi-
𝑅𝐿 exceeding the ambient noise is the following: ment concerning the implementation of the underwater
noise model presented in [4]. Then, we propose some
𝑅𝐿 = 𝑆𝐿 − 𝑇 𝐿𝑡𝑜𝑡 − 𝐴𝑁 (4) optimisations of the process, and discuss the benefits
obtained in terms of time efficiency.
The SOUNDSCAPE measurements [13, 8] are also used For our experiment we focus on June 2020, one of the
to estimate the ambient noise. In particular, we employed months with the highest fishing activity in 2020. Dur-
the exceedance level 𝐿90 , which indicates the sound level ing this period, there are 642 fishing vessels, generating
that is exceeded 90% of the time. As mentioned in [13], 9, 841, 079 AIS data points and completing 7, 462 trips.
𝐿90 can be referred to as common natural acoustic con- Since the AIS data are limited to the Northern Adriatic
ditions. To account for spatial and temporal variability, Sea, we consider the projected coordinate system for
we partitioned the Northern Adriatic Sea into a 1 km × Italy, specifically the spatial reference identifier (SRID)
1 km grid and assigned noise values based on 𝐿90 mea- 6876. To process this data and build our model, we used
surements at hydrophone stations. These values were a machine that features 32 Intel(R) Xeon(R) CPU E5-4610
interpolated using the Inverse Distance Weighting (IDW) v2 processors running at 2.30 GHz, offering multithread
in QGIS2 , producing maps that capture the heterogeneous performance. It is equipped with 256 GB of DDR4 ECC
underwater acoustic environment. RAM and it utilises a 500 GB RAID 5 storage configu-
The implementation of the model to calculate the un- ration. On this machine we deployed PostgreSQL 16.6,
derwater noise generated by vessels is succinctly de- PostGIS 3.5, and MobilityDB 1.3.
scribed below (for more details, see [4]). First, the North- By using the approach from [4] recalled above, the
ern Adriatic Sea is partitioned into a regular grid com- reconstruction of the fishing vessels trajectories takes
posed of square spatial cells (1km×1km). This grid, con- 46 minutes, while the pipeline to calculate the underwa-
sisting of 43, 508 cells, is enriched with the ambient noise ter noise propagation requires approximately 44 hours.
and some environmental features (such as the sea surface The latter running time, referred as Original Pipeline in
temperature or the salinity) which are essential for noise Figure 3, is the target of our optimisations.
calculation. Then, starting from AIS data, we reconstruct We now outline the improvements to such a pipeline
the vessels trajectories and we deploy them in a spatio- to enhance efficiency, support scalability, and reduce
temporal database [14]. These trajectories are equipped computational overhead. One of the most costly oper-
with semantic information, such as the acoustic charac- ations is the selection of the cells affected by the noise
teristics of the vessel engines and the activities conducted propagation. In fact, every 60 seconds we get all the
along their paths, which are used to infer how the noise fishing vessel positions, compute the noise generated by
spreads in the area of interest. The entire trajectories the vessels (SL) and then propagate it. To accomplish
reconstruction and their semantic enrichment leverage this task for each point we build a buffer using the prop-
the temporal and spatio-temporal types of MobilityDB, agation radius 𝑟. Then, we perform a JOIN operation
as well as the functions provided by this spatio-temporal with the table Grid storing the grid cells, followed by
database. Subsequently, using the spatio-temporal func- an ST_Intersects operation to determine the cells af-
tions of MobilityDB, we apply a sampling process on the fected by the noise, i.e., those inside the buffer. Since the
vessel’s trajectory at one-minute intervals to determine ST_Intersects operation involves the geometry type,
the boat’s positions at specific temporal instants. For it inherently requires computationally expensive spatial
each position 𝑝, we estimate the decibels produced by operations, which can significantly impact the model
the vessel, based on its activity and speed. Next, we cal- performance. To avoid this computational overhead, we
culate the propagation radius 𝑟, i.e. the distance at which make two significant changes: (i) restructure the table
the noise generated by the fishing vessel gets drowned Grid and (ii) use a bounding box instead of a buffer in
into ambient noise, and we construct a buffer 𝑏 with ra- noise propagation. The aim is to find the cells involved
dius 𝑟 around 𝑝. Then, we select all the grid cells whose in the noise propagation without using the expensive
centroids fall within 𝑏 and compute the distance between operation ST_Intersects.
the sampled point 𝑝 and these centroids. This distance is
used to determine the received noise in the selected cells. Grid table restructuring. We add two new attributes
Finally, by grouping by cell id and time, we combine all to the cell of the grid: grid_r and grid_c, which indi-
the received sound levels to obtain the total noise level cate the row and column numbers within the grid. Hence,
to be associated with the cell. starting from the lower-left corner, the grid cells are num-
bered sequentially, so they are identified as (1, 1), (1, 2)
2
https://qgis.org/en/site/ and so on. This grid-based system allows for an efficient
identification of the cells within a bounding box, without results as the implementation described in Section 2. In
the need for costly spatial operations. The table Grid fact, the new execution time for June 2020 is reduced to
includes also the 𝑥 and 𝑦 coordinates of the cell centroid, 7 hours, making the code over six time faster than the
which will be used for calculating sound propagation. original version, saving 37 hours of execution time (see
The structure of the table Grid is as follows. Figure 3, where this is called Optimised Pipeline).
CREATE TABLE Grid (
grid_id integer PRIMARY KEY,
grid_r integer NOT NULL, 4. Partitioning and Parallelisation
grid_c integer NOT NULL,
centroid_x double precision, To further optimise the performance of the pipeline we
centroid_y double precision, present an analysis of various partitioning and paral-
elevation real, lelisation techniques. In particular, selecting the cells
ambient_noise real,
affected by noise propagation for each point 𝑝 (Step 3
alpha tfloat );
CREATE INDEX idx_grid_r ON Grid (grid_r); in Figure 1) remains a computationally expensive opera-
CREATE INDEX idx_grid_c ON Grid (grid_c); tion. This complexity arises from the need to perform a
JOIN operation between the table PointBoundingBox,
Note that we also add two indexes to the table Grid
which contains each vessel position along with its sound
on the columns grid_r and grid_c, to improve the
propagation bounding box, encompassing over 4 million
efficiency of spatial query operations.
points, and the table Grid, which consists of 43,508 cells.
Consequently, the JOIN involves a computational effort
Bounding box for Noise Propagation. To compute equivalent to approximately 4 million × 43 thousand
the total received noise level for each cell of our grid, we operations, making it inherently costly.
proceed as illustrated in Figure 1. After reconstructing In Section 4.1, we examine table partitioning tech-
the vessel trajectories from the AIS data, we get the posi- niques in PostgreSQL, applying both range and hash
tions of all the fishing vessels at the same time instants, partitioning strategies. In Section 4.2, we extend this ap-
i.e., every 60 seconds (Step 1 in Figure 1). For each point proach by combining PostgreSQL partitioning with mul-
𝑝, we determine the cell 𝑐 it belongs to, by comparing the tidimensional tiling, focusing on the spatial dimension.
coordinates of 𝑝 with the grid cell boundaries which are Finally, in Section 4.3, we leverage the Citus extension of
computed by adding or subtracting 500 meters from the PostgreSQL to apply sharding and take advantage of its
coordinates of the cell centroid. We calculate the noise parallel query execution capabilities.
generated by the fishing vessel obtained by adding to the
sound level associated with the horsepower of the boat, a
contribution related to the actual speed of the vessel in 𝑝 4.1. PostgreSQL Partitioning
(see Equation (1)), and the noise due to the fishing activ- The first technique we explore to enhance the execu-
ity, if it occurs in 𝑝. Then, we compute the propagation tion of our code is Table Partitioning in PostgreSQL. This
radius 𝑟 (expressed in meters) and we build the sound method consists in dividing a logically large table into
propagation bounding box (Step 2 in Figure 1), defined by smaller physical segments, with each partition being an
the minimum and maximum row and column identifiers independent table that stores a specific subset of the orig-
that enclose all the cells affected by the noise generated inal data. PostgreSQL natively supports three forms of
by the vessel at 𝑝. These boundaries are obtained simply partitioning [5]: (i) Range partitioning, where the table
by adding or subtracting 𝑟 from the row and columns is divided into ranges based on a key column or set of
identifiers of the cell 𝑐, grid_r and grid_c. Thanks to columns, with each partition containing non-overlapping
the row and column identifiers of the grid cell we avoid ranges of values; (ii) List partitioning, which explicitly
the use of the ST_Intersects operation, which is very assigns specific key value(s) to each partition, allowing
time consuming. This approach allows retrieving the precise control over data distribution; and (iii) Hash par-
cells involved in the noise calculation in just 10 seconds titioning, where the table is divided by applying a hash
for the entire dataset of June 2020. Next, we select all function to the partition key.
cells inside the bounding box and compute the distance Table partitioning offers several advantages that signif-
between 𝑝 and the cell centroids (Step 3 in Figure 1). We icantly improve both performance and data management.
use this distance to estimate the transmission loss, which It enhances query execution by allowing the database
allows us to determine the received noise level in the management system to filter out irrelevant partitions,
selected cells. By grouping by cell id and time, we com- thus speeding up query processing, especially for large
bine all the contributions of the points of the different datasets. Additionally, partitioning simplifies data man-
trajectories (Step 4 in Figure 1), thus obtaining for each agement tasks such as archiving, purging, backup and
cell the received noise level (RL). These optimisations led restore operations. Furthermore, data loading is also
to a more time-efficient pipeline that produces the same
Figure 1: Main steps in the calculation of the noise maps.
more efficient since it can be parallelised, and indexing
Next, we create four time-based partitions corresponding
becomes faster as partitions reduce the scope of the data
to the four weeks of June 2020. After inserting the data
being indexed [15]. into the partitioned table, the entries are automatically
routed to the appropriate partition. Some statistics re-
Range Partitioning on Time. To enhance the per- garding the number of rows in each partition, along with
formance of our pipeline, as we have already remarked, their disk usage, are presented in Table 1 (left).
we can improve the time execution of the JOIN oper- The query we want to optimise, which involves the
ation between the table PointBoundingBox and the partitioned table PointBb_RangePart, is the following.
table Grid. To accomplish this task we partition the SELECT eg.grid_r,eg.grid_c,pbb.trip_id,pbb.time,
table PointBoundingBox, which is defined as follows: pbb.db_boat, SQRT(POWER(pbb.x-eg.centroid_x,2) +
POWER(pbb.y-eg.centroid_y,2)) AS dist,
CREATE TABLE PointBoundingBox AS ( eg.elevation,eg.ambient_noise,
SELECT point_id,trip_id,mmsi,x,y,time,db_boat, valueAtTimestamp(eg.alpha,time::DATE) AS alpha
grid_r-radius AS r_min, FROM PointBb_RangePart pbb, Grid eg
grid_r+radius AS r_max, WHERE eg.grid_r>=r_min AND eg.grid_c>=c_min AND
grid_c-radius AS c_min, eg.grid_r<=r_max AND eg.grid_c<=c_max;
grid_c+radius AS c_max
FROM UnnestTripWithCell ); This query returns, for each spatio-temporal point
(pbb.x, pbb.y, pbb.time), the cells that are affected
where the point_id identifies the spatio-temporal point,
by the noise generated at that point by the fishing vessel,
trip_id is the identifier of the trip to which the point
and computes the distance between the point and the
belongs, mmsi refers to the vessel performing the trip,
centroids of these cells (Step 3 in Figure 1).
x and y are the coordinates of the point, time specifies
The query plan involves a combination of paral-
the date and hour of the point, and db_boat denotes
lel and sequential scans to optimise the data retrieval
the decibel level generated by the vessel at that point,
process. The first step is a parallel append opera-
based on its speed and activity. The remaining attributes
tion, which processes multiple partitions of the table
represent the row and column identifiers used to con-
PointBb_RangePart in parallel. Each partition (corre-
struct the sound propagation bounding box including all
sponding to a different time range) is accessed through
the cells affected by the noise generated by the vessel at
a parallel sequential scan. The second part of the plan
point_id.
involves a bitmap heap scan on the table Grid, where
We partition the table PointBoundingBox into four
rows are selected based on conditions that compare the
partitions based on time ranges to reflect the recurring
grid’s row and column identifiers with the corresponding
weekly pattern: fishing activity is intense from Monday
bounding box identifiers from the partitions. Specifically,
to Thursday, while significantly lower from Friday to Sun-
the query checks that the cells, identified by row grid_r
day. Additionally, this partitioning ensures a balanced
and column grid_c, lie within the minimum and maxi-
disk usage across the partitions (see Table 1). We can
mum row and column values of the bounding box. This
create the partitioned table as follows.
comparison is optimised through bitmap index scans on
CREATE TABLE PointBb_RangePart(LIKE PointBoundingBox) idx_grid_r and idx_grid_c, each filtering the data
PARTITION BY RANGE(time);
based on the row and column values. In essence, the
query plan performs a parallel scan of partitioned data,
Table 1 4.2. Space Tiling and Partitioning
Statistics for the partitions by range on the time column (left)
and by hash on the mmsi column (right). Multidimensional tiling is a technique that partitions an
𝑛-dimensional domain into tiles of varying dimensions.
Range Partitioning Hash Partitioning This approach has several applications. For instance, mul-
N. partition Disk Usage Rows Disk Usage Rows tidimensional tiling can be applied to partition and/or
1 81 MB 888,849 97 MB 1,090,196 distribute datasets across a cluster of servers. One key
2
3
115 MB
90 MB
1,269,182
990,148
87 MB
94 MB
977,776
1,059,732
advantage of this partitioning mechanism is that it pre-
4 93 MB 1,019,383 92 MB 1,039,858 serves spatial and temporal proximity, unlike traditional
hash-based partitioning methods. This distribution re-
duces the amount of data that needs to be exchanged
followed by an efficient indexed search of the grid, en- between nodes during query processing, a process com-
suring faster query execution by narrowing down the monly known as reshuffling [15].
relevant data points through partitioning and indexing. In our work, we focus on tiling with respect to the
By partitioning the table PointBb_RangePart while spatial dimension. Specifically, we partition the positions
leaving the rest of the code unchanged, the entire pipeline of vessels based on their spatial locations. The tiling can
now completes in just 2 hours and 25 minutes. The com- be either regular, where all tiles are of equal size in each
putation of sound propagation is 18.2 times faster than dimension, or adaptive, where the size of the cells may
the first implementation (which took 44 hours) and 2.9 vary across dimensions. In the first case, we employ a
times faster than the optimised version without partition- regular tiling, constructing a uniform grid consisting of
ing (which took 7 hours). 4 × 3 cells, as shown in Figure 2a. To generate this grid,
we used the MobilityDB function spaceTiles. The grid
Hash Partitioning on MMSI. As a second partition- size was manually tuned to balance the trade-off between
ing experiment, we use the hash partitioning on the mmsi the number of partitions and the data distribution within
column of the table PointBoundingBox. We aim to di- each partition. Then we create the partitioned table along
vide the table PointBoundingBox into four partitions with the corresponding tables for the space tiles, by using
based on a hash function. The partitioned table can be the List partitioning technique.
created as follows. CREATE TABLE PointBoundingBox_RegGrid(LIKE
PointBoundingBox) PARTITION BY LIST(TileId);
CREATE TABLE PointBoundingBox_HashPart (LIKE CREATE TABLE PointBb_RegGrid_1 PARTITION OF
PointBoundingBox) PARTITION BY HASH(mmsi); PointBoundingBox_RegGrid FOR VALUES IN (1);
CREATE TABLE PointBb_HashPart_1 PARTITION OF
PointBoundingBox_HashPart FOR VALUES WITH ( Only the creation of the first tile is specified. Once the
MODULUS 4, REMAINDER 0);
data is inserted into the partitioned table, the entries are
We have only reported the creation of the first hash automatically directed to their corresponding partitions.
partition. Next, we insert the values into the table The limitation of this type of tiling is that it does not
PointBoundingBox_HashPart, which are automati- ensure balanced workload distribution across the tiles.
cally distributed across the partitions. Table 1 (right) A possible solution to this issue is to use an adap-
presents some statistics on the number of rows and the tive grid, as illustrated in Figure 2b. In this case, we
disk usage of each partition. In this case, we can ob- create a grid that divides the region based on the distri-
serve that the data distribution across the four partitions bution of vessel points in the Northern Adriatic Sea. It is
is more balanced compared to the partitions obtained worth noting that some cells are smaller, as they contain
through time-based range partitioning. a higher density of data points. Then, we partition the ta-
Now we use table PointBoundingBox_HashPart, ble PointBoundingBox according to the adaptive grid
instead of table PointBb_RangePart, in the query we structure. The process of creating the partitioned table,
want to optimize, presented in the previous subsection. along with the corresponding tables for the spatial tiles,
The query plan is the same as that described for range follows the same steps as for the regular grid.
partitioning and consists of a Parallel Seq Scan across Table 2 presents statistics on the number of rows in
the four partitions of the hash-partitioned table and a each tile, as well as their respective disk usage, for both
Bitmap Heap Scan on the table Grid. The execution time the regular and adaptive grids. The table clearly shows
for June 2020 is 2 hours and 20 minutes, which is slightly that the data partitioned according to the adaptive grid
faster than the range partitioning approach. exhibits a more balanced distribution across the tiles
compared to the regular tiling. However, certain tiles
(specifically, tiles 1, 2, and 12) contain noticeably fewer
data points, because they mostly cover the mainland.
The query we aim to optimise is the one presented
(a) Regular grid. (b) Adaptive grid.
Figure 2: Partitioning of vessel trip data with a regular grid and an adaptive grid.
Table 2 age the power of a distributed system while maintaining
Statistics for the partitions by list on the tileId column. compatibility with existing PostgreSQL tools. By using
sharding and replication Citus scales PostgreSQL across
Regular Grid Adaptive Grid
several servers. Sharding is a method employed in dis-
Tile Disk Usage Rows Disk Usage Rows tributed systems to divide data horizontally across multi-
1 18 MB 168,403 32 kB 0 ple servers or nodes. It involves splitting a large dataset
2 95 MB 882,540 4000 kB 35,144 into smaller, more manageable pieces known as shards.
3 61 MB 571,392 53 MB 494,276 Each shard holds a portion of the data, and collectively,
4 34 MB 314,051 92 MB 859,769
5 77 MB 715,011 48 MB 445,491 they represent the entire dataset. Citus enables timeseries
6 23 MB 212,307 40 MB 373,537 data to be scaled by combining PostgreSQL single-node
7 6176 kB 54,928 44 MB 404,563 declarative table partitioning with its distributed shard-
8 32 kB 0 18 MB 165,596 ing capabilities, creating a scalable time-series database.
9 117 MB 1,090,983 64 MB 596,401
10 17 MB 152,747 68 MB 633,210 To optimise our pipeline, we first apply PostgreSQL
11 688 kB 5,200 15 MB 143,051 range partitioning based on time, followed by distributing
12 32 kB 0 1944 kB 16,524 the partitions using Citus sharding mechanism. Here, we
utilise Citus in a single-node cluster configuration,where
a single PostgreSQL server employs Citus to locally shard
in Section 4.1. The query plan, like the previous ones, the data (with the coordinator also acting as a worker).
combines parallel and sequential scans to optimise data This configuration has been implemented on the machine
retrieval. The first step is a parallel append opera- described in Section 3 running Citus 12.1.6. As outlined in
tion, which processes multiple partitions of the table Section 4.1 we want to partition the PointBoundingBox
PointBoundingBox_RegGrid concurrently. This is table based on time ranges. The partitions can be defined
followed by a bitmap heap scan on the table Grid, where using the following Citus function.
rows are selected based on conditions that compare the SELECT create_time_partitions (
grid’s row and column identifiers with the corresponding table_name := ‘PointBoundingBox_RangePart’,
bounding box identifiers from the partitions. By tiling the partition_interval := ‘1 week’,
space with the regular grid, the full pipeline is executed start_from := ‘2020-06-01 00:00:00’,
end_at := ‘2020-06-30 23:59:59’ );
in 2 hours 46 minutes, while using the adaptive grid it
completes in just 2 hours and 16 minutes, which slightly The function above creates weekly partitions start-
improves the techniques in Section 4.1. ing from the dates specified. Furthermore, the tables
PointBoundingBox and Grid are distributed using Ci-
4.3. Using Citus for parallelisation tus functions as follows.
SELECT create_distributed_table(
Citus3 is an extension of PostgreSQL designed to ease ‘PointBoundingBox_RangePart’, ‘point_id’);
horizontal scaling, making it suitable for handling large SELECT create_reference_table(‘Grid’);
datasets across multiple machines. It distributes both
data and queries across a cluster, allowing users to lever- The first function distributes the table
PointBoundingBox into multiple horizontal shards on
3
https://www.citusdata.com/ the point_id column. The second function distributes
Citus Adaptive Grid 3 h 30 min
Table 3 Citus Range Partitioning on Time 4h
Statistics for the partitions by range on column time with Adaptive Grid 2 h 16 min
Regular Grid 2 h 46 min
Citus. Hash Partitioning on MMSI 2 h 20 min
Range Partitioning on Time 2 h 25 min
Partition Table Disk Usage Rows Optimised Pipeline 7h
Original Pipeline 44 h
PointBoundingBox_RangePart_p2020w23 104 MB 704,384 0 10 20 30 40
PointBoundingBox_RangePart_p2020w24 137 MB 948,703 Execution Time (hours)
PointBoundingBox_RangePart_p2020w25 148 MB 1,033,152
PointBoundingBox_RangePart_p2020w26 145 MB 1,002,853
PointBoundingBox_RangePart_p2020w27 73 MB 478,470 Figure 3: Execution times (in hours) of the implementations.
the table Grid into a single shard and replicates the 5. Concluding Remarks
shard to every worker node. Tables distributed in the
second way are called reference tables and are employed Monitoring underwater noise pollution caused by hu-
to store data that requires frequent access by multiple man activities is crucial for preserving a healthy marine
nodes within a cluster. Table 3 presents statistics on ecosystem. In this paper, we presented several optimi-
the number of rows in each partition, along with their sations to the underwater noise propagation pipeline
respective disk usage. presented in [3, 4]. The goal was to enhance efficiency,
The objective, as in the previous cases, is to optimise support scalability and reduce computational overhead.
the query described in Section 4.1. When executed us- Figure 3 collects the results of our experiments on
ing Citus, the query plan reveals that the workload is June 2020 described in the previous sections. A clear
distributed across multiple tasks, with a total of 32 tasks improvement is observed between the original pipeline
created. Each task is assigned to a specific execution implementation presented in [3, 4] and the optimisations
node, ensuring efficient parallel processing. Within each proposed in this work. In particular, the space tiling tech-
task, a gathering operation takes place, using multiple nique based on an adaptive grid provided the best result,
worker threads to further parallelise the workload. The which is over 19 times faster than the original running
query plan performs two main operations: the Parallel time. The pipeline incorporating Citus (single-node) did
Append retrieves data from multiple partitioned tables, not yield better performance compared to partitioning
and the Bitmap heap scan identifies the relevant grid cells alone, mainly due to distribution planning overhead.
by verifying that their positions fall within the bounding As future work, we would like to investigate the Ci-
box. This step is optimised by index-based filtering on tus deployment in a multi-node cluster, to fully lever-
the row and column attributes, further enhancing the per- age its distributed processing capabilities. Additionally,
formance. Using Citus the entire pipeline is executed in we aim to conduct experiments with different partition
4 hours. The computation of sound propagation is 1.75 numbers (e.g., 2, 4, 8, 16) to determine whether perfor-
times faster than the optimised pipeline without parti- mance improves as the number of partitions increases,
tioning in Section 3) but it takes about 1.65 times longer or if overhead dominates at some point. Moreover, in
than the partitioned PostgreSQL version (presented in addition to space tiling with both regular and adaptive
Section 4.1). grids, quadtree-based spatial partitioning could be ex-
We also utilise Citus for the space tiling pre- plored. Finally, we plan to analyse the entire year of 2020
sented in Section 4.2. Specifically, we partition the to gain deeper insights into how partitioning and paral-
PointBoundingBox table according to the adaptive grid lelisation perform with a larger volume of data, where
structure and distribute it using the Citus function pre- their advantages are likely to become more pronounced.
viously discussed. The query plan is clearly similar to This work enhances our original underwater sound
the case described above, with the workload distributed propagation model with greater computational efficiency,
across multiple tasks. The main difference lies in the pres- offering a scalable solution for modelling underwater
ence of 12 partitioned tables. The execution time for the noise. By balancing estimation accuracy with computa-
entire pipeline, using Citus and distributing the points tional effort, it can provide a convenient alternative to
according to the adaptive grid, is 3 hours and 30 minutes, existing approaches, which often rely on hydrophone
which is slightly faster than the partitioning by the time measurements or acoustic simulations and require ex-
column. However, the pipeline incorporating Citus did tensive input data along with significant computational
not yield better performance compared to partitioning resources to manage complex calculations.
alone. As detailed in Cubukcu et al. [6], a single-node
Citus configuration does not provide immediate perfor-
mance benefits. Thus, single-node Citus is slightly slower
than single server PostgreSQL due to distributed query
planning overhead.
Acknowledgments M. Radulović, N. Rako-Gospić, D. Sabbatini, G. Sol-
dano, J. Tęgowski, T. Vučur-Blazinić, P. Vukadin,
This publication was supported by the European Union J. Zdroik, F. Madricardo, First assessment of under-
- Next Generation EU - Project ECS000043 - Innovation water sound levels in the Northern Adriatic Sea at
Ecosystem Program "Interconnected Northeast Innovation the basin scale, Scientific Data 10 (2023) 137.
Ecosystem (iNEST)", CUP H43C22000540006. [9] C. Chion, D. Lagrois, J. Dupras, A Meta-Analysis to
This work took place within the framework of the Understand the Variability in Reported Source Lev-
DoE 2023-2027 (MUR, AIS.DIP.ECCELLENZA2023_27.FF els of Noise Radiated by Ships From Opportunistic
project). Studies, Frontiers in Marine Science 6 (2019) 714.
[10] M. Ainslie, Principles of Sonar Performance Mod-
elling, Springer Praxis Books, Springer Berlin, Hei-
References delberg, Berlin, Heidelberg, 2010.
[1] H. Slabbekoorn, N. Bouton, I. van Opzeeland, A. Co- [11] C. Erbe, A. Duncan, K. J. Vigness-Raposa, Introduc-
ers, C. ten Cate, A. N. Popper, A noisy spring: the tion to Sound Propagation Under Water, Springer
impact of globally rising underwater sound levels International Publishing, Cham, 2022, pp. 185–216.
on fish, Trends in ecology & evolution 25 (2010) [12] R. J. Urick, Principles of underwater sound 3rd
419–427. edition, Peninsula Publising Los Atlos, California
[2] R. Williams, A. Wright, E. Ashe, L. Blight, R. Bruint- 22 (1983) 23–24.
jes, R. Canessa, C. Clark, S. Cullis-Suzuki, D. Dakin, [13] M. Picciulin, A. Petrizzo, F. Madricardo, A. Bar-
C. Erbe, P. Hammond, N. Merchant, P. O’Hara, banti, M. Bastianini, I. Biagiotti, S. Bosi, M. Cen-
J. Purser, A. Radford, S. Simpson, L. Thomas, turelli, A. Codarin, I. Costantini, V. Dadić, R. Falkner,
M. Wale, Impacts of anthropogenic noise on marine T. Folegot, D. Galvez, I. Leonori, S. Menegon, H. Mi-
life: Publication patterns, new discoveries, and fu- hanović, S. Muslim, A. Pari, S. Pari, G. Pleslić,
ture directions in research and management, Ocean M. Radulović, N. Rako-Gospić, D. Sabbatini,
& Coastal Management 115 (2015) 17–24. J. Tegowski, P. Vukadin, M. Ghezzo, First basin
[3] G. Rovinelli, D. Rocchesso, M. Simeoni, A. Raffaetà, scale spatial–temporal characterization of under-
Using semantic trajectories for spatio-temporal water sound in the Mediterranean Sea, Scientific
characterisation of underwater noise, in: Proceed- Reports 13 (2023) 22799.
ings of the 6th International Workshop on Big Mo- [14] B. Brandoli, A. Raffaetà, M. Simeoni, P. Adibi, F. K.
bility Data Analytics (BMDA 2024) - EDBT/ICDT Bappee, F. Pranovi, G. Rovinelli, E. Russo, C. Sil-
Workshops, volume 3651, 2024. vestri, A. Soares, S. Matwin, From multiple aspect
[4] G. Rovinelli, D. Rocchesso, M. Simeoni, E. Zimányi, trajectories to predictive analysis: a case study on
A. Raffaetà, Spatio-temporal characterisation of un- fishing vessels in the Northern Adriatic sea, GeoIn-
derwater noise through semantic trajectories, arXiv formatica 26 (2022) 551–579.
(2025). URL: http://arxiv.org/abs/2501.11131. [15] M. Sakr, A. Vaisman, E. Zimányi, Mobility Data
[5] The PostgreSQL Global Development Group, Science, Data-Centric Systems and Applications,
PostgreSQL 16.6 Documentation, 2024. URL: 1 ed., Springer Cham, 2025. Due: 04 March 2025
https://www.postgresql.org/files/documentation/ (Hardcover), 04 March 2026 (Softcover), 04 March
pdf/16/postgresql-16-A4.pdf. 2025 (eBook).
[6] U. Cubukcu, O. Erdogan, S. Pathak, S. Sannakkay-
ala, M. Slot, Citus: Distributed postgresql for data-
intensive applications, in: Proceedings of the 2021
International Conference on Management of Data,
2021, pp. 2490–2502.
[7] E. Zimányi, M. Sakr, A. Lesuisse, MobilityDB: A mo-
bility database based on PostgreSQL and PostGIS,
ACM Trans. Database Syst. 45 (2020).
[8] A. Petrizzo, A. Barbanti, G. Barfucci, M. Bastian-
ini, I. Biagiotti, S. Bosi, M. Centurelli, R. Chavanne,
A. Codarin, I. Costantini, M. Cukrov Car, V. Dadić,
F. M. Falcieri, R. Falkner, G. Farella, M. Felli, C. Fer-
rarin, T. Folegot, R. Gallou, D. Galvez, M. Ghezzo,
A. Kruss, I. Leonori, S. Menegon, H. Mihanović,
S. Muslim, A. Pari, S. Pari, M. Picciulin, G. Pleslić,