=Paper=
{{Paper
|id=Vol-2322/BMDA_3
|storemode=property
|title=Performance Evaluation of MongoDB and PostgreSQL for Spatio-temporal Data
|pdfUrl=https://ceur-ws.org/Vol-2322/BMDA_3.pdf
|volume=Vol-2322
|authors=Antonios Makris,Konstantinos Tserpes,Giannis Spiliopoulos,Dimosthenis Anagnostopoulos
|dblpUrl=https://dblp.org/rec/conf/edbt/MakrisTSA19
}}
==Performance Evaluation of MongoDB and PostgreSQL for Spatio-temporal Data==
<pdf width="1500px">https://ceur-ws.org/Vol-2322/BMDA_3.pdf</pdf>
<pre>
     Performance Evaluation of MongoDB and PostgreSQL for
                      spatio-temporal data
                           Antonios Makris                                                         Konstantinos Tserpes
         Dept. of Informatics and Telematics, Harokopio                            Dept. of Informatics and Telematics, Harokopio
                       University of Athens                                                      University of Athens
                         Athens, Greece                                                            Athens, Greece
                        amakris@hua.gr                                                             tserpes@hua.gr

                        Giannis Spiliopoulos                                              Dimosthenis Anagnostopoulos
                            MarineTraffic                                          Dept. of Informatics and Telematics, Harokopio
                     London, United Kingdom                                                      University of Athens
              giannis.spiliopoulos@marinetraffic.com                                               Athens, Greece
                                                                                                  dimosthe@hua.gr

ABSTRACT                                                                            Distributed database systems have been proven instrumental
Several modern day problems need to deal with large amounts of                  in the effort to dealing with this data deluge. These systems are
spatio-temporal data. As such, in order to meet the application                 distinguished by two key-characteristics: a) system scalability:
requirements, more and more systems are adapting to the speci-                  the underlying database system must be able to manage and
ficities of those data. The most prominent case is perhaps the                  store a huge amount of spatial data and to allow applications to
data storage systems, that have developed a large number of func-               efficiently retrieve it; and, b) interactive performance: very fast
tionalities to efficiently support spatio-temporal data operations.             response times to client requests. Some systems that natively
This work is motivated by the question of which of those data                   meet those requirements for spatial data are: Hadoop-GIS [6] and
storage systems is better suited to address the needs of industrial             SpatialHadoop [7], MD-HBase [8], GeoMesa [9], SMASH [10]
applications. In particular, the work conducted, set to identify                and systems that use spatial resilient distributed datasets (SRDD)
the most efficient data store system in terms of response times,                such as SpatialSpark [11], GeoTrellis [12] and GeoSpark [13].
comparing two of the most representative of the two categories                      The plethora of available systems and underlying technologies
(NoSQL and relational), i.e. MongoDB and PostgreSQL. The evalu-                 have left the researchers and practitioners alike puzzled as to
ation is based upon real, business scenarios and their subsequent               what is the best option to employ in order to solve their big spatial
queries as well as their underlying infrastructures, and concludes              data problem at hand. The query and data characteristics only
in confirming the superiority of PostgreSQL. Specifically, Post-                add to the confusion. It is imperative for the research community
greSQL is four times faster in terms of response time in most                   to contribute to the clarification of the purposes and highlight
cases and presents an average speedup around 2 in first query,                  the pros and cons of certain distributed database platforms. This
4 in second query and 4,2 in third query in a five node cluster.                work aspires to contribute towards this direction by comparing
Also, we observe that the average response time is significantly                two suchlike platforms for a particular class of requirements, i.e.
reduced at half with the use of indexes almost in all cases, while              those that the response time in complex spatio-temporal queries
the reduction is significantly lower in PostgreSQL.                             is of high importance.
                                                                                    In particular, we compare the performance in terms of re-
                                                                                sponse time between a scalable document-based NoSQL datastore-
1    INTRODUCTION                                                               MongoDB [14] and an open source object-relational database
                                                                                system (ORDBMS)-PostgreSQL [15] with the PostGIS extension.
The volumes of spatial data that modern-day systems are gen-
                                                                                PostGIS is a spatial extender that adds support for geographic ob-
erating has met staggering growth during the last few years.
                                                                                jects. The performance is measured using a set of spatio-temporal
Managing and analyzing these data is becoming increasingly im-
                                                                                queries that mimic real case scenarios that performed in a dataset
portant, enabling novel applications that may transform science
                                                                                provided by MarineTraffic1 . The evaluation of the systems was
and society. For example, mysteries are unravelled by harnessing
                                                                                examined in different scenarios; in a 5 node cluster setup versus
the 1 TB of data that is generated per day from NASA’s Earth
                                                                                1 node implementation and with the use of indexes versus not.
Observing System [1], or the more than 140 GB of raw science
                                                                                Each database system was deployed on EC2 instances on Amazon
spatial data every week generated by space Hubble telescope [2].
                                                                                Web Services (AWS)2 and for storing/retrieving the data we used
At the same time, numerous business applications are emerging
                                                                                the Amazon S3 bucket.
by processing the 285 billion points regarding aircraft movements
                                                                                    The results show that PostgreSQL with the PostGIS extension,
per year gathered from the Automatic Dependent Surveillance
                                                                                outperforms MongoDB in all queries. Specifically, PostgreSQL
Broadcast (ADS-B) system [3] and the 60Mb of AIS and weather
                                                                                is four times faster in terms of response time in most cases and
data collected every second by MarineTraffic’s on-line monitor-
                                                                                presents an average speedup around 2 in the first query, 4 in the
ing service [4] or the 4 millions geotagged tweets daily produced
                                                                                second query and 4,2 in a third query in a 5 node cluster. The
at Twitter [5].
                                                                                5 node cluster setup outperforms the one node implementation

© 2019 Copyright held by the author(s). Published in the Workshop Proceedings        1 MarineTraffic is an open, community-based maritime information collection
of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on    project, which provides information services and allows tracking the movements
CEUR-WS.org.                                                                    of any ship in the world. It is available at: https://www.marinetraffic.com
                                                                                     2 Amazon Web Services, https://aws.amazon.com/
in each system although the reduction is much more noticeable         query, spatial joins between various geometric datatypes, dis-
in PostgreSQL. Finally the results demonstrate that indexing          tance join, and kNN join) and four different datatypes (points,
affects response times in query execution by reducing them at         linestrings, rectangles, and polygons). In order to evaluate these
half almost in all cases while the reduction is significantly lower   modern, in-memory spatial systems, real world datasets are used
with the use of indexes in PostgreSQL.                                and the experiments are focusing on major features that are
   The document is structured as follows: Section 2 provides          supported by the systems. The results show the strengths and
details about the related work in spatio-temporal systems and         weaknesses of the compared systems. In specific, GeoSpark seems
benchmark analysis; Section 4 describes the technology overview;      to be the most complete spatial analytic system because of data
Section 4 describes the evaluation of spatio-temporal database        types and queries supported.
systems used; Section 5 presents the experimental results while          In [9] are presented and evaluated two distributed database
Section 6 presents the final conclusions of this study and future     technologies, GeoMESA which focuses on geotemporal indexes
work.                                                                 and Elasticsearch which is a document oriented data store that
                                                                      can handle arbitrary data which may have a geospatial index. In
                                                                      general GeoMesa is an open-source, distributed, spatio-temporal
2   RELATED WORK                                                      database built on a number of distributed cloud data storage
The volume of spatial data is increasing exponentially on a daily     systems, including Accumulo, HBase, Cassandra, and Kafka. It
basis. Geospatial services such as GPS systems, Google Maps           can provide spatio-temporal indexing for BigTable and its clones
and NASA’s Earth Observing system are producing terabytes of          (HBase, Apache Accumulo) using space filling curves to project
spatial data every day and in combination with the growing pop-       multi-dimensional spatio-temporal data into the single dimension
ularity of location-based services and map-based applications,        linear key space imposed by the database. On the other hand
there is an increasing demand in the spatial support of databases     Elasticsearch uses Z-order spatial-prefix-based indexes that work
systems. There are challenges in managing and querying the mas-       for all types of vector data (points, lines and polygons) as well as
sive scale of spatial data such as the high computation complexity    a Balanced KD-tree which works better for point data. For batch
of spatial queries and the efficient handling the big data nature     processing, GeoMESA leverages Apache Spark and for stream
of them. There is a need for an interactive performance in terms      geospatial event processing, Apache Storm and Apache Kafka.
of response time and a scalable architecture. Benchmarks play a          A computing system for processing large-scale spatial data
crucial role in evaluating the performance and functionality of       called GeoSpark, is presented in [13]. Apache Spark is an in-
spatial databases both for commercial users and developers.           memory cluster computing system that provides a data abstrac-
    In [16] are presented some database benchmarks such as Wis-       tion called Resilient Distributed Datasets (RDDs) which consist
consin which was developed for the evaluation of relational data-     of collections of objects partitioned across a cluster. The main
base systems [16], AS3 AP that contains a mixed workload of           drawback is that it does not support spatial operation on data.
database transactions, queries and utility functions and SetQuery     This gap is filled by GeoSpark which extends the core of Apache
that supports more complex queries and designed to evaluate           Spark to support spatial data types, spatial indexes and computa-
systems that support decisions making. Above benchmarks mea-          tions. GeoSpark provides a set of out-of-the-box Spatial Resilient
sure the performance of the system in general, but there are also     Distributed Dataset (SRDD) types that provide support for geo-
benchmarks that are explicitly designed to evaluate the capabil-      metrical and distance operations and spatial data index strategies
ities of spatio-temporal databases such as SEQUOIA 2000 [17]          which partition the SRDDs using a grid structure and thereafter
and Paradise Geo-Spatial DBMS (PGS-DBMS) [18]. SEQUOIA                assign grids to machines for parallel execution.
2000 propose a set of 11 queries to evaluate the performance             SMASH [10] is a highly scalable cloud based solution and the
while PGS-DBMS presents 14 queries (the first nine queries are        technologies involved with SMASH architecture are: GeoServer,
the same in both benchmark systems). Although above bench-            GeoMesa, Apache Accumulo, Apache Spark and Apache Hadoop.
marks seems to be adequate to evaluate a spatial database, things     SMASH is a collection of software components that work to-
change when the evaluation consists the temporal factor. Only         gether in order to create a complete framework which can tackle
a few queries from both benchmarks have a temporal compo-             issues such as fetching, searching, storing and visualizing the data
nent. The 3-Dimensional spatio-temporal benchmark, expands            directly for the demands of traffic analytics. For spatio-temporal
the benchmarks into 3 dimensions in order to simulate real life       indexing on geospatial data, GeoMesa and GeoServer are used.
scenarios. The main difference from the other systems is the addi-    GeoMesa provides spatio-temporal indexing on top of the Accu-
tion of two new features: temporal processing/temporal updates        mulo BigTable DBMS and is able to provide high levels of spatial
and three dimensional support.                                        querying and data manipulation, leveraging a highly parallel in-
    Another benchmark for spatial database evaluation is pre-         dexing strategy using a geohashing algorithm (three-dimensional
sented in [19]. Although a number of other benchmarks limited         Z-order curve). GeoServer is used to serve maps and vector data
to a specific database or application, Jackpine presents one impor-   to geospatial clients and allows users to share, process and edit
tant feature, portability in terms that can support any database      geospatial data.
(JDBC driver implementation). It supports micro benchmarking             Finally in [6] is presented a system called Hadoop-GIS, a
that is a number of spatial queries, analysis and loading func-       scalable and high performance spatial data warehousing sys-
tions with spatial relationships and macro benchmarking with          tem which can efficiently perform large scale spatial queries on
queries which address real world problems. Also includes all          Hadoop. It provides spatial data partitioning for task paralleliza-
vector queries from the SEQUOIA 2000 benchmark.                       tion through MapReduce, an index-driven spatial query engine
    In [20] the authors compare five Spark based spatial analyt-      to support various types of spatial queries (point, join, cross-
ics systems (SpatialSpark, GeoSpark, Simba, Magellan, Location-       matching and nearest neighbor), an expressive spatial query lan-
Spark) using five different spatial queries (range query, kNN         guage by extending HiveQL with spatial constructs and boundary
handling to generate correct results. In order to achieve high per-    $near and $nearSphere and uses the WGS84 reference system for
formance, the system partitions time consuming spatial query           geospatial queries on GeoJSON objects.
components into smaller tasks and process them in parallel while          On the other hand, PostgreSQL is an open source object-
preserving the correct query semantics.                                relational database system (ORDBMS). There is a special exten-
                                                                       sion available called PostGIS that integrates several geofunc-
                                                                       tions and supports geographic objects. PostGIS implementation
3   TECHNOLOGY OVERVIEW                                                is based on "light-weight" geometries and the indexes are opti-
In order to evaluate the set of spatio-temporal queries, two dif-      mized to reduce disk and memory usage. The interface language
ferent systems are employed and compared: MongoDB and Post-            of the PostgreSQL database is the standard SQL. PostGIS has
greSQL with PostGIS extension.                                         the most comprehensive geofunctionalities with more than one
   MongoDB [21] is an open source document based NoSQL data-           thousand spatial functions. It supports geometry types for Points,
store which is supported commercial by 10gen. Although Mon-            LineStrings, Polygons, MultiPoints, MultiLineStrings, Multip-
goDB is non-relational, it implements many features of relational      Polygons and GeometryCollections. There are several spatial
databases, such as sorting, secondary indexing, range queries and      operators for geospatial measurements like area, distance, length
nested document querying. Operators like create, insert, read,         and perimeter. PostgreSQL supports several types of indexes such
update and remove as well as manual indexing, indexing on em-          as: BTree, Hash, Generalized Inverted Indexes (GIN) and Gener-
bedded documents and index location-based data also supported.         alized Search Tree (GiST) called R-tree-over-GiST. The default
In such systems, data are stored in collections called documents       index type is BTree that can work with all datatypes and can
which are entities that provide some structure and encoding on         be used for equality and range queries efficiently. For general
the managed data. Each document is essentially an associative          balanced tree structures and high-speed spatial querying, Post-
array of a scalar value, lists or nested arrays. Every document has    greSQL uses GiST indexes that can be used to index geometric
a unique special key "ObjectId", used for explicitly identification    data types, as well as full-text search.
while this key and the corresponding document are conceptually
similar to a key-value pair. MongoDB documents are serialized              Feature                  Description
naturally as Javascript Object Notation (JSON) objects and stored          ship_id                  Unique identifier for each ship
internally using a binary encoding of JSON called BSON [22].               latitude, longitude      Geographical location in digital
As all NoSQL systems, in MongoDB there are no schema restric-                                       degrees
tions and can support semi-structured data and multi-attribute             status                   Current position status
lookups on records which may have different kinds of key-value             speed                    Speed over ground in knots
pairs [23]. In general, documents are semi-structured files like           course                   Course over ground in degrees
XML, JSON, YALM and CSV. For data storing there are two ways:                                       with 0 corresponding to north
a) nesting documents inside each other, an option that can work            heading                  Ship’s heading in degrees with
for one-to-one or one-to-many relationships and b) reference                                        0 corresponding to north
to documents, in which the referenced document only retrieved              timestamp                Full UTC timestamp
when the user requests data inside this document. To support                             Table 1. Dataset Attributes
spatial functionality, data are stored in GeoJSON which is a for-
mat for encoding a variety of geographical data structures [24].
GeoJSON supports: a) Geometry types as Point, LineString, Poly-
gon, MultiPoint, MultiLineString and MultiPolygon, b) Feature,         4  EVALUATING SPATIO-TEMPORAL
which is a geometric object with additional properties and c)
                                                                          DATABASES
FeatureCollection, which consist a set of features. Each GeoJSON
document is composed of two fields: i) Type, the shape being           4.1 Dataset Overview
represented, which informs a GeoJSON reader how to interpret           In order to evaluate the performance of the spatio-temporal
the "coordinates" field and ii) Coordinates, an array of points, the   databases, we employed a dataset (11 GB), which was provided
particular arrangement of which is determined by "type" field.         to us by the community based AIS vessel tracking system (VTS)
The geographical representation need to follow the GeoJSON for-        of MarineTraffic. The dataset provides information for 43.288
mat structure in order to be able to set a geospatial index on the     unique vessels and contains 146.491.511 AIS records in total, each
geographic information. First, MongoDB computes the geohash            comprising 8 attributes as described in Table 1. The area that our
values for the coordinate pairs and then indexes these geohash         dataset covers is bounded by a rectangle within Mediterranean
values. Indexing is an important factor to speed up query pro-         sea. The vessels have been monitored for a 3 months period
cessing. MongoDB provides BTree indexes to support specific            starting at May 1st, 2016 and ending at July 31th, 2016.
types of data and queries such as: Single Field, Compound Index,
Multikey Index, Text Indexes, Hashed Indexes and Geospatial            4.2    Use Case - Queries
Index. To support efficient queries on geospatial coordinate data,     To test the performance of each spatial database we utilize a set
MongoDB provides two special indexes: 2d index that uses planar        of queries that mimic real world scenarios. We employed 3-three
geometry when returning results and 2dsphere index that use            complex spatio-temporal queries and the reason of the selection
spherical geometry to return results. A 2dsphere index supports        of these particular queries is because they contain spatial and
queries that calculate geometries on an earth-like sphere and          temporal predicates:
can handle all geospatial queries: queries for inclusion, intersec-
                                                                          (1) Find coordinates of different amount of vessels from 1/May/
tion and proximity. It supports four geospatial query operators
                                                                              2016 - 31/July/2016 (entire time window) within the whole
for spatio-temporal functionality: $geoIntersects, $geoWithin,
                                                                              bounded area, Q1
                                                                              Polygons        Records Retuned Amount of ships
                                                                              polygon2.S      15.502.808         8.854
                                                                              polygon1.S      11.564.115         10.730
                                                                              polygon2.F      5.194.874          6.185
                                                                              polygon1.F      745.902            4.182
                                                                                               Table 4. Q3 results

      (a) Polygon1.F                          (b) Polygon1.S


                                                                       is by using replica set. While MongoDB offers standard primary-
                                                                       secondary replication, it is more common to use MongoDB’s
                                                                       replica sets. A replica set is a group of multiple coordinated in-
                                                                       stances that host the same dataset and work together to ensure
                                                                       superior availability. In a replica set, only one node acts as pri-
      (c) Polygon2.F                          (d) Polygon2.S           mary that receives all write operations while the other instances
                                                                       called secondaries and apply operations from the primary. All
Figure 1. Geographical polygons with trajectories of the               data are replicated from the primary to secondary nodes. If the
vessels                                                                primary node ever fails or becomes unavailable or maintained,
                                                                       one of the replicas will automatically be elected through a con-
                                                                       sortium as the replacement. After the recovery of failed node, it
              Amount of ships Records returned                         joins the replica set again and works this time as a secondary
              43.288           146.491.511                             node.
              21.644           72.349.832                                  The problem with replica set configuration is the selection
              10.822           36.928.530                              of a member to perform a query in case of read requests. The
              5.481            18.909.184                              question is which node to choose, the primary or a secondary
                     Table 2. Q1 results                               and if a request queries a secondary, which one should be used. In
                                                                       MongoDB it is possible to control this choice with read preferences
                                                                       solution. When an application queries a replica set, there is the
      Time window      Records Returned Amount of ships                opportunity to trade off consistency, availability, latency and
      2 months         95.332.760          37.368                      throughput for each kind of query. This is the problem that read
      1 month          48.884.829          31.531                      preferences solve: how to specify the read preferences among
      10 days          14.362.160          22.403                      above trade offs, so the request falls to the best member of the
      1 day            1.142.337           11.122                      replica set for each query. For the distribution of queries, the
                                                                       read preference provides the solution. The categories of read
                       Table 3. Q2 results
                                                                       preferences are: i) PRIMARY: Read from the primary, ii) PRIMARY
                                                                       PREFERRED: Read from the primary if available, otherwise read
                                                                       from a secondary, iii) SECONDARY: Read from a secondary, iv)
                                                                       SECONDARY PREFERRED: Read from a secondary if available,
   (2) Find coordinates of vessels for different time windows
                                                                       otherwise from the primary and finally v) NEAREST: Read from
       within the whole bounded area, Q2
                                                                       any available member. Because our goal is to achieve maximum
   (3) Find coordinates of vessels for different geographical poly-
                                                                       throughput and an evenly distribution of load across the members
       gons within the entire time window, Q3
                                                                       of the set we used NEAREST preference and we set the value
   Specifically, Q1 fetches coordinates for an increased amount        secondary_acceptable_latency_ms very high in 500ms. This value
of vessels as shown in Table 2. The query is performed for all         is used in MongoDB to track each member’s ping time and queries
unique vessels in the dataset, for half of them, for 1/4 and finally   only the "nearest" member, or any random member that is no
for 1/8 of them, in order to examine the scalability of the data-      more than the default value of 15ms "farther" than it. In general,
base systems. Q2 fetches coordinates of vessels for different time     load balancing and query distribution consist one of the most
windows; 1 day, 10 days, 1 month and 2 months as shown in Ta-          important factors in distributed systems. An even distribution of
ble 3. Q3 fetches coordinates for different geographical polygons      load within the nodes of the system reduces the probability that a
as shown in Table 4. The polygons (polygon1.F, polygon2.F, poly-       node turns to a hotspot and his property also acts as a safeguard
gon1.S, polygon2.S) are bounding boxes within Mediterranean            to the system reliability [25].
Sea. Figure 1 shows a graphical representation of these polygons           Additionally, we have deployed a PostgreSQL cluster that con-
with the trajectories of the vessels inside. We used QGIS 3 , a tool   tains a master server and four slaves in Streaming Replication
that visualize, analyses and publish geospatial information.           mode. Slaves keep an exact copy of master’s data, apply the
                                                                       stream to their data and staying in a "Hot Standby" mode, ready
4.3    System Architecture                                             to be promoted as master in case of failure. Streaming replica-
We have deployed a MongoDB cluster that contains a master              tion is a good tactic for data redundancy between nodes but for
node server (primary) and four replication slaves (secondaries)        load balancing a mechanism is required that splits the requests
in Replica Set mode. One way to achieve replication in MongoDB         between the copies of data. For this reason we use Pgpool-24 , a
    3 QGIS: A Free and Open Source Geographic Information System,
                                                                          4 Pgpool-2, http://www.pgpool.net/mediawiki/index.php/Main_Page
https://qgis.org/en/site/
middleware that works between PostgreSQL servers and a Post-         node is the primary and the other played the role of the replicas.
greSQL database client. Pgpool-2 examines each query and if          We installed MongoDB 3.6.5 version in Replica Set mode. Each
the query is read only, it is forwarded to one of the slave nodes    instance operates on Amazon Linux 2 AMI OS and consist of 4
otherwise in case of write it is forwarded to the master server.     CPUs x 2.30 GHz, 30.5 GB DDR4 RAM, 500 GB of general purpose
With this configuration the read load splits between the slave       SSD storage type EBS, up to 10 Gigabit network performance
nodes of the cluster, achieving thus an improved system perfor-      and IPv6 support. Amazon Elastic Block Store (Amazon EBS) pro-
mance. Pgpool-2 provides the following features, Connection          vides persistent block storage volumes for use with Amazon EC2
Pooling: connections are saved and reused whenever a new con-        instances in the AWS Cloud. Each EBS volume is automatically
nection with the same properties arrives, thus reducing connec-      replicated within its Availability Zone to protect from compo-
tion overhead and improving system throughput, Replication:          nent failures, thus offering high availability and durability. Also
replication creates a real time backup on physical disks, so that    the instances are EBS-optimized which means that they provide
the service can continue without stopping servers in case of a       additional throughput for EBS I/O and as a result an improved
disk failure, Load Balancing: the load on each server is reduced     performance.
by distributing SELECT queries among multiple servers, thus             PostgreSQL. Exactly the same configuration is used in Post-
improving system’s overall throughput and Limiting Exceeding         greSQL. We installed PostgreSQL 9.5.13 and PostGIS 2.2.1 in
Connections: connections are rejected after a limit on the maxi-     streaming replication mode. One node is the master while the
mum number of concurrent connections.                                others play the role of slaves. Also an extra instance for Pgpool-2
                                                                     was deployed.
4.4    Data Ingestion
The dataset used for the experiments was initially in CSV format.    5   EXPERIMENTS
As we mentioned above, in MongoDB the geographical repre-            This section contains a detailed experimental evaluation that ex-
sentation needs to follow the GeoJSON format structure in order      amines the run time performance of the spatio-temporal queries Q1,
to be able to set a geospatial index on the geographic informa-      Q2 and Q3 in MongoDB and PostgreSQL through seven experi-
tion, thus the first step was the conversion of the data into the    ments. Five consecutive separate calls are conducted, in order to
appropriate format. For data ingestion we used the mongoimport       gather the experimental results and collect the average values
tool to import data into MongoDB database. The total size the        concerning response time of above queries in different case sce-
dataset occupied in the collection in MongoDB is 116 GB and          narios. We compare the response time in a 5-node cluster versus
each record has a size of about 275 bytes.                           a 1-node implementation. We also test the settings when indices
   In PostgreSQL there is a copy mechanism for bulk loading          are used against the case that they are not.
data that can achieve a good throughput. The data must be in            For Q1 a regular BTree index is created in MongoDB and Post-
CSV format and the command can accept a number of delimiters.        greSQL for attribute "ship_id". The size of index varies between
The next step after data loading into database, is the conversion    the different database systems. In MongoDB the index size is
of latitude and longitude columns to PostGIS POINT geometry.         about 6 GB while in PostgreSQL the size is 3,1 GB. For Q2 we
These columns must be converted into geometry data which sub-        implemented also a BTree index of size 6 GB in both DBMSs for
sequently can be spatially queried. We created a column called       attribute "timestamp". Finally, for Q3 we implemented an index
the_geom using a defined PostGIS function, which in essence con-     in field "$geometry" in MongoDB with size 6 GB. As mentioned
tains the POINT geometry created from latitude and longitude         above the data are stored in MongoDB as GeoJSON and the "$ge-
of each record. The spatial reference ID (SRID) of the geometry      ometry" field that is created contains the coordinates values,
instance (latitude, longitude) is 4326 (WGS84). The World Geo-       latitude and longitude. Because these data are geographical, we
detic System (WGS) is the defined geographic coordinate system       create a 2dsphere index type which supports geospatial queries.
(three-dimensional) for GeoJSON used by GPS to express loca-         In PostgreSQL we created an index of type GiST in field the_geom
tions on the earth. The latest revision is WGS 84 (also known as     which contains the POINT geometry created from latitude and
WGS 1984, EPSG:4326) [26]. The total size the dataset occupied       longitude of each record with size 8,2 GB. For high-speed spa-
in PostgreSQL db is 32 GB and each row’s size is about 96 bytes      tial querying, PostgreSQL uses GiST indexes that can be used to
while in MongoDB is almost 4 times larger.                           index geometric data types. The index size varies between the
   The reason for this behavior is that the data stored in Mon-      two database systems even for the same attribute that performed.
goDB are in GeoJson format and each record consist of many           The two systems store data differently and the concept of "index"
extra characters and a unique auto created id called ObjectId.       is different too.
Thus, each record is significant bigger in size than it was in its      Figure 2 illustrates the average response time concerning the
original CSV format. On the other hand, in PostgreSQL the data       set of queries Q1, Q2 and Q3 in 5 node cluster between MongoDB
ingested in database as CSV, with the addition of the_geom col-      and PotgreSQL. It’s quite clear that PostgreSQL outperforms
umn that contains the POINT geometries of each latitude and          MongoDB by a large extent in all queries. The response time is
longitude.                                                           almost 4 times faster in some cases (Q2, Q3) comparing to Mon-
                                                                     goDB. Only in Q1 the response time presents smaller fluctuations
4.5    Cluster Setup - AWS                                           between the DBMSs. Exactly the same behavior is observed in
To benchmark MongoDB and PostgreSQL we deployed each                 the 1 node implementation as shown in Figure 3. The response
database system on Amazon Web Services (AWS) EC2 instances.          time is significant lower in case of PostgreSQL.
For storing the dataset, we used the S3 bucket provided also by         Subsequently, we compared the average response time in the
AWS. The configuration used is described below:                      set of queries for the 5-nodes cluster versus the 1-node imple-
   MongoDB. The MongoDB cluster consists of 5 x r4.xlarge            mentation for the two database systems as shown in Figure 4 and
instances within a Virtual Private Cloud (Amazon VPC). One           Figure 5 respectively. The results show that the average response
                                                             MongoDB       PostgreSQL                                                 MongoDB      PostgreSQL    500                                 MongoDB    PostgreSQL
                        3,000                                                             3,000


                                                                                          2,500                                                                  400
                        2,500
  Response Time (sec)


                        2,000                                                             2,000
                                                                                                                                                                 300

                        1,500                                                             1,500
                                                                                                                                                                 200
                        1,000                                                             1,000

                                                                                                                                                                 100
                         500                                                               500


                           0                                                                    0                                                                  0
                                5481   10822            21644            43288                       1 day    10 days          1 month          2 months               pol1.F   pol1.S         pol2.F          pol2.S


                                               (a)                                                                      (b)                                                              (c)

                           Figure 2. Average response time of Q1 (a), Q2 (b) and Q3 (c) in 5 node cluster between MongoDB and PostgreSQL.

                                                                                                                                                                 600
                                                             MongoDB       PostgreSQL                                                 MongoDB      PostgreSQL                                        MongoDB    PostgreSQL
                        3,000                                                             3,000
                                                                                                                                                                 500

                        2,500                                                             2,500
  Response Time (sec)


                                                                                                                                                                 400
                        2,000                                                             2,000
                                                                                                                                                                 300
                        1,500                                                             1,500

                                                                                                                                                                 200
                        1,000                                                             1,000


                                                                                           500                                                                   100
                         500

                           0                                                                    0                                                                  0
                                5481   10822            21644            43288                       1 day    10 days          1 month          2 months               pol1.F   pol1.S         pol2.F          pol2.S


                                               (a)                                                                      (b)                                                              (c)

Figure 3. Average response time of Q1 (a), Q2 (b) and Q3 (c) in 1 node implementation between MongoDB and PostgreSQL.

                                                                                                                                                                 600
                                                MongoDB 1 node       MongoDB 5 nodes                                    MongoDB 1 node      MongoDB 5 nodes                     MongoDB 1 node          MongoDB 5 nodes
                        3,000                                                             3,000
                                                                                                                                                                 500

                        2,500                                                             2,500
  Response Time (sec)


                                                                                                                                                                 400
                        2,000                                                             2,000
                                                                                                                                                                 300
                        1,500                                                             1,500

                                                                                                                                                                 200
                        1,000                                                             1,000


                                                                                           500                                                                   100
                         500

                           0                                                                    0                                                                  0
                                5481   10822            21644            43288                       1 day    10 days          1 month          2 months               pol1.F   pol1.S         pol2.F          pol2.S


                                               (a)                                                                      (b)                                                              (c)

Figure 4. Average response time of Q1 (a), Q2 (b) and Q3 (c) between 5 nodes cluster and 1 node implementation in MongoDB.

                                                 PostgreSQL 1 node   PostgreSQL 5 nodes                                 PostgreSQL 1 node   PostgreSQL 5 nodes                   PostgreSQL 1 node      PostgreSQL 5 nodes
                        1,000
                                                                                          600                                                                    150
  Response Time (sec)


                         800

                                                                                          400                                                                    100
                         600


                         400
                                                                                          200                                                                     50

                         200


                           0                                                                0                                                                      0
                                5481   10822            21644            43288                      1 day    10 days          1 month         2 months                 pol1.F   pol1.S         pol2.F          pol2.S


                                               (a)                                                                      (b)                                                              (c)

Figure 5. Average response time of Q1 (a), Q2 (b) and Q3 (c) between 5 nodes cluster and 1 node implementation in Post-
greSQL.
                             1,600
                                                                      PostgreSQL_NoIndex       PostgreSQL_BTree                                                    PostgreSQL_NoIndex   PostgreSQL_BTree
                                                                                                                                                                                                                           400                                 PostgreSQL_NoIndex     PostgreSQL_Gist
                                                                                                                            800
                             1,400

                             1,200
                                                                                                                                                                                                                           300
       Response Time (sec)


                                                                                                                            600
                             1,000

                                 800                                                                                                                                                                                       200
                                                                                                                            400
                                 600

                                 400                                                                                        200                                                                                            100

                                 200

                                  0                                                                                           0                                                                                              0
                                              5481           10822                 21644            43288                                 1 day          10 days         1 month         2 months                                          pol1.F           pol1.S           pol2.F       pol2.S


                                                                     (a)                                                                                           (b)                                                                                               (c)

                             Figure 6. Average response time of Q1 (a), Q2 (b) and Q3 (c) in 5 nodes PostgreSQL cluster with indexing versus not.

                                                                           4,000                                                                                          1,200
                                                                                                                       MongoDB_NoIndex        MongoDB_BTree                                                MongoDB_NoIndex             MongoDB_2D_sphere

                                                                                                                                                                          1,000
                                                                           3,000
                                                                                                                                                                            800


                                                                           2,000                                                                                            600


                                                                                                                                                                            400
                                                                           1,000
                                                                                                                                                                            200


                                                                              0                                                                                               0
                                                                                            1 day           10 days           1 month         2 months                                   pol1.F             pol1.S          pol2.F                pol2.S


                                                                                                                      (a)                                                                                            (b)

                                          Figure 7. Average response time of Q2 (a) and Q3 (b) in 5 nodes MongoDB cluster with indexing versus not.

                                                                                                                                                                                                                                                                                       PostgreSQL


                                                                                                                              4.6
                             3                                                                                                                                                                                                    5
                                                                                                                              4.4

                                                                                                                              4.2                                                                                                4.5
                       2.5
 Speedup


                                                                                                                                  4
                                                                                                                                                                                                                                  4
                                                                                                                              3.8
                             2
                                                                                                                                                                                                                                 3.5
                                                                                                                              3.6

                                                                                                                              3.4                                                                                                 3
                       1.5
                                                                                                                              3.2
                                       5481          10822            21644                43288                                      1 day          10 days           1 month          2 months                                         pol1.F            pol1.S          pol2.F         pol2.S


                                                             (a)                                                                                                   (b)                                                                                                 (c)

Figure 8. Average speedup of PostgreSQL over MongoDB in Q1 (a), Q2 (b) and Q3 (c) in 5 node cluster with the use of indexes.


time presents a small reduction when the experiments are per-                                                                                                                every document in a collection, to select those documents that
formed in a 5-node MongoDB cluster versus the 1-node imple-                                                                                                                  match the query statement (collection scan). Figure 8 presents
mentation. The reduction is much more noticeable in the case of                                                                                                              the average speedups of PostgreSQL comparing to MongoDB in
PostgreSQL. The reason is that in the MongoDB replica set, the                                                                                                               5 node cluster with the use of indexes. As it shown the average
client requests are distributed using the read preference-Nearest                                                                                                            speedup concerning Q1 is almost 2, 4 in case of Q2 and 4,2 in Q3.
option that can achieve a maximum throughput and an evenly
distribution of load across the members of the set. In case of
PostgreSQL, with the use of Pgpool-2 the load of read requests
                                                                                                                                                                             6          CONCLUSIONS
is distributed between the nodes of the cluster in a much more                                                                                                               In this paper, we analyzed and compared the performance in
effective way, thus improving system’s overall throughput.                                                                                                                   terms of response time between two different database systems,
   In the next set of experiments we examined how indexes affect                                                                                                             a document-based NoSQL datastore, MongoDB, and an open-
the response time in queries execution. As shown in Figure 6 the                                                                                                             source object-relational database system, PostgreSQL with Post-
reduction is significantly lower, almost at half with the use of                                                                                                             GIS extension. Each database system was deployed on Amazon
indexes in PostgreSQL. An index allows the database server to                                                                                                                Web Services (AWS) EC2 cluster instances. We employed a replica
find and retrieve specific rows much faster than without an index.                                                                                                           set and a streaming replication cluster setup for MongoDB and
In Figure 7 concerning MongoDB we excluded Q1 because the                                                                                                                    PostgreSQL system respectively. For the evaluation between the
response time was extremely high in case of no index (> 4 hours).                                                                                                            two systems, we employed a set of spatio-temporal queries that
In the remaining queries, also the response time is significantly re-                                                                                                        mimic real world scenarios and present spatial and temporal
duced. Efficient query execution in MongoDB is supported using                                                                                                               predicates, with the use of a dataset which was provided to us
indexing. MongoDB can use indices to limit the number of docu-                                                                                                               by the community based AIS vessel tracking system (VTS) of
ments it must inspect otherwise a scan operation is performed in                                                                                                             MarineTraffic.
   The performance is measured in terms of response time in                               [11] Simin You, Jianting Zhang, and Le Gruenwald. Large-scale spatial join query
different case scenarios; in a 5 node cluster versus 1 node imple-                             processing in cloud. In 2015 31st IEEE International Conference on Data Engi-
                                                                                               neering Workshops (ICDEW), pages 34–41. IEEE, 2015.
mentation and with the use of indexes versus not. The results                             [12] Ameet Kini and Rob Emanuele. Geotrellis: Adding geospatial capabilities to
are show that PostgreSQL outperforms MongoDB in all cases                                      spark. Spark Summit, 2014.
                                                                                          [13] Jia Yu, Jinxuan Wu, and Mohamed Sarwat. Geospark: A cluster computing
and queries and presents an average speedup around 2 in first                                  framework for processing large-scale spatial data. In Proceedings of the 23rd
query, 4 in second query and 4,2 in third query in a five node                                 SIGSPATIAL International Conference on Advances in Geographic Information
cluster. Also, the replica set mode implementation of 5 nodes                                  Systems, page 70. ACM, 2015.
                                                                                          [14] Peter Membrey, Eelco Plugge, Tim Hawkins, and DUPTim Hawkins. The de-
in MongoDB as well as the streaming replication of 5 nodes in                                  finitive guide to MongoDB: the noSQL database for cloud and desktop computing.
PostgreSQL outperforms the one node implementation in each                                     Springer, 2010.
system respectively. Subsequently, as demonstrated by the experi-                         [15] Neil Matthew and Richard Stones. Beginning Databases with PostgreSQL.
                                                                                               Apress, 2005.
mental results, the average response time is significantly reduced                        [16] David J DeWitt. The wisconsin benchmark: Past, present, and future., 1993.
at half, almost in all cases with the use of indexes while again                          [17] Michael Stonebraker, Jim Frew, Kenn Gardels, and Jeff Meredith. The sequoia
                                                                                               2000 storage benchmark. In ACM SIGMOD Record, volume 22, pages 2–11.
the reduction is significantly lower in PostgreSQL. Finally, the                               ACM, 1993.
dataset size occupied in the system db, reduced 4x in case of                             [18] Jignesh Patel, JieBing Yu, Navin Kabra, Kristin Tufte, Biswadeep Nag, Josef
PostgreSQL, since it stores data in a more effective way.                                      Burger, Nancy Hall, Karthikeyan Ramasamy, Roger Lueder, Curt Ellmann,
                                                                                               et al. Building a scaleable geo-spatial dbms: technology, implementation, and
   Our future plan include the extension of our system architec-                               evaluation. In ACM SIGMOD Record, volume 26, pages 336–347. ACM, 1997.
ture to what it is called Shared Cluster. A Shared Cluster consists                       [19] Suprio Ray, Bogdan Simion, and Angela Demke Brown. Jackpine: A benchmark
of shards which in turn contain a subset of the sharded data.                                  to evaluate spatial database performance. In Data Engineering (ICDE), 2011
                                                                                               IEEE 27th International Conference on, pages 1139–1150. IEEE, 2011.
Sharding is a method for distributing data across multiple ma-                            [20] Varun Pandey, Andreas Kipf, Thomas Neumann, and Alfons Kemper. How
chines. Every machine contains only a portion of the shared data                               good are modern spatial analytics systems? Proceedings of the VLDB Endow-
                                                                                               ment, 11(11):1661–1673, 2018.
and each machine replicated to secondaries nodes for data redun-                          [21] Mongodb. https://www.mongodb.com/. Accessed: 2018-7-15.
dancy and for fault tolerance reasons. We plan to evaluate and                            [22] Binary json. http://bsonspec.org/. Accessed: 2018-7-15.
compare the two types of cluster and draw conclusions of which                            [23] Antonios Makris, Konstantinos Tserpes, Vassiliki Andronikou, and Dimos-
                                                                                               thenis Anagnostopoulos. A classification of nosql data stores based on key
system is best for different cases.                                                            design characteristics. Procedia Computer Science, 97:94–103, 2016.
                                                                                          [24] The geojson format specification. http://geojson.org/geojson-spec.html. Ac-
                                                                                               cessed: 2018-7-15.
ACKNOWLEDGMENTS                                                                           [25] Antonios Makris, Konstantinos Tserpes, and Dimosthenis Anagnostopoulos.
The research work was supported by the Hellenic Foundation                                     A novel object placement protocol for minimizing the average response time
                                                                                               of get operations in distributed key-value stores. In Big Data (Big Data), 2017
for Research and Innovation (HFRI) and the General Secretariat                                 IEEE International Conference on, pages 3196–3205. IEEE, 2017.
for Research and Technology (GSRT), under the HFRI PhD Fel-                               [26] B LOUIS Decker. World geodetic system 1984. Technical report, Defense
lowship grant (GA. no. 2158).                                                                  Mapping Agency Aerospace Center St Louis Afs Mo, 1986.
   This work has been developed in the frame of the MASTER
project, which has received funding from the European Union’s
Horizon 2020 research and innovation programme under the
Marie Skłodowska-Curie grant agreement No 777695.
   This work was supported in part by MarineTraffic which pro-
vided data access for research purposes.

REFERENCES
 [1] G Leptoukh. Nasa remote sensing data in earth sciences: Processing, archiving,
     distribution, applications at the ges disc. In Proc. of the 31st Intl Symposium of
     Remote Sensing of Environment, 2005.
 [2] Telescope hubble site. http://hubble.stsci.edu/the_telescope/hubble_essentials/
     quick_facts.php. Accessed: 2018-7-15.
 [3] Automatic dependent surveillance-broadcast (ads-b). https://www.faa.gov/
     nextgen/programs/adsb/. Accessed: 2018-7-15.
 [4] Iraklis Varlamis, Konstantinos Tserpes, and Christos Sardianos. Detecting
     search and rescue missions from ais data. In 2018 IEEE 34th International
     Conference on Data Engineering Workshops (ICDEW). IEEE, 2018.
 [5] Luke Sloan and Jeffrey Morgan. Who tweets with their location? under-
     standing the relationship between demographic characteristics and the use of
     geoservices and geotagging on twitter. PloS one, 10(11):e0142209, 2015.
 [6] Ablimit Aji, Fusheng Wang, Hoang Vo, Rubao Lee, Qiaoling Liu, Xiaodong
     Zhang, and Joel Saltz. Hadoop gis: a high performance spatial data warehous-
     ing system over mapreduce. Proceedings of the VLDB Endowment, 6(11):1009–
     1020, 2013.
 [7] Ahmed Eldawy and Mohamed F Mokbel. Spatialhadoop: A mapreduce frame-
     work for spatial data. In Data Engineering (ICDE), 2015 IEEE 31st International
     Conference on, pages 1352–1363. IEEE, 2015.
 [8] Shoji Nishimura, Sudipto Das, Divyakant Agrawal, and Amr El Abbadi. Md-
     hbase: design and implementation of an elastic data infrastructure for cloud-
     scale location services. Distributed and Parallel Databases, 31(2):289–319,
     2013.
 [9] Andrew Hulbert, Thomas Kunicki, James N Hughes, Anthony D Fox, and
     Christopher N Eichelberger. An experimental study of big spatial data systems.
     In Big Data (Big Data), 2016 IEEE International Conference on, pages 2664–2671.
     IEEE, 2016.
[10] Yikai Gong, Luca Morandini, and Richard O Sinnott. The design and bench-
     marking of a cloud-based platform for processing and visualization of traffic
     data. In Big Data and Smart Computing (BigComp), 2017 IEEE International
     Conference on, pages 13–20. IEEE, 2017.

</pre>