Vecstra: An Efficient and Scalable Geo-spatial In-Memory
                           Cache

                                                                Yiwen Wang∗
                                               supervised by Marcos Antonio Vaz Salles
                                                Database Management System Group
                                                 University of Copenhagen, Denmark
                                                             y.wang@di.ku.dk

ABSTRACT                                                                      farmers to tailor their management according to the local
Nowadays a huge amount of geospatial data are generated                       and actual conditions in the field. Since the importance and
and widely used in a variety of areas. In this way, there is                  wide societal relevance of spatial data, many platforms pro-
an extensive range of database systems that support geospa-                   vide geospatial services, necessitating specialized data man-
tial services but fall behind in coping with increasingly de-                 agement solutions. GeoServer [12], for example, is the back-
manding high throughput, low latency and efficient resource                   end of a number of such services and builds its geospatial
usage requirements. Meanwhile, in-memory databases have                       database on PostGIS [17].
become more prevalence due to increasing capability as well                     On the other hand, due to the increasing capability and
as decreasing price of the memory. To take advantage of                       the decreasing price of memory, there is higher possible to
the fast I/O speed and overhead reduction of memory-based                     cache substantial amounts of geospatial data in-memory.
storage, in-memory geospatial data management systems                         Storing data in memory is especially valuable for running
are created to overcome drawbacks of disk-based geospatial                    concurrent workloads as it eliminates the inevitable bottle-
databases. However, to the best of our knowledge, current                     necks caused by disk-centric architectures [21]. To take ad-
in-memory caches are not fully-featured and do not com-                       vantage of the benefits of object caching systems, there are
pletely utilize the benefits of widely used geospatial stan-                  widely used web services caches such as Memcached [15] and
dards. To fill this gap, we specialize to the API of standard                 Redis [9]. Now more and more classic databases providers
geospatial services and seek to achieve by this method a                      have developed their in-memory features and systems. For
much higher performing in-memory geospatial cache imple-                      instance, in-memory database TimesTen [27] owned by Or-
mentation than possible by combining generic components                       acle and memory-optimized OLTP engine Hekaton [13] in
such as a geospatial application server and SQL database.                     SQL Server. Moreover, multi-core servers also help to achieve
To that end, an efficient and scalable OGC standard-compliant                 performance benefits and can yield significant gains. Multi-
in-memory geospatial cache Vecstra is built. Furthermore,                     core processors can simultaneously handle multiple requests
we conduct experiments on Vecstra and analyze the prelim-                     that can help reduce latency when amounts of concurrent
inary results to formulate research opportunities.                            requests occur, which is a very likely situation with skew in
                                                                              geospatial services [1].
                                                                                Despite the popularity of geospatial databases such as
1.    INTRODUCTION                                                            PostGIS, however, their design has benefited little from re-
  There has a considerable increasing of geospatial data us-                  cent developments in in-memory databases. PostGIS still
age in today’s society in many areas, such as precision agri-                 utilizes disk as its primary storage, which raises the prob-
culture [22] and social media streams querying [38]. Mean-                    lem to reduce expensive geospatial data querying cost due
while, an explosion of the Internet of Things (IoT) has re-                   to disk I/O and system overheads. Since geospatial services
cently helped increase the generation of geospatial data [20].                are experiencing growing popularity, this problem is made
In the agriculture field, there is a huge amount of data from                 worse by large numbers of concurrent requests. In-memory
satellites and sensors that can provide massive geospatial                    database techniques and multi-core hardware hold promise
information about terrain and crop status. By integrating                     to offer a solution to this problem, while at the same time
a range of data and providing the information, that enables                   increasing throughput and resource utilization.
                                                                              Existing Systems. Currently, there are many geospatial
∗
  Work performed in the context of the Future Cropping                        databases that provide services to clients, as shown in Fig-
partnership, supported by Innovation Fund Denmark.                            ure 1. There are disk storage databases for batch processing
                                                                              and multiple types of spatial queries such as SpatialHadoop
                                                                              [14] and Hadoop-GIS [2]. With the availability of larger
                                                                              and cheaper main memory, to take the advantage of that,
                                                                              data management systems such as MemSQL [35] are built
                                                                              based on main memory to improve response times. Mem-
                                                                              SQL is a in-memory database that stores data and allows
Proceedings of the VLDB 2018 Ph.D. Workshop, August 27, 2018. Rio de          for quickly analyzing and processing data. And there are
Janeiro, Brazil.                                                              transactions for the small interactive requests, to support
Copyright (C) 2018 for this paper by its authors. Copying permitted for       such requests, a number of specialized systems have been
private and academic purposes.


                                                                          1
                                                                         a specialized interface, we have opportunities to provide a
                                                                         much better implementation. Targeted advantages of Vec-
                                                                         stra are:
                                                                              1. Use of tuned in-memory data structures for geospatial
                                                                                 data, leading to lower latency;
                                                                              2. Thread management that is optimized for the inter-
                                                                                 face of a geospatial cache supporting operations over
                                                                                 geospatial layers, leading to lower overhead and better
                                                                                 scalability;
                                                                              3. Open-source system that is standards-compliant, offer-
           Figure 1: Related Work Summary.                                       ing integrated WFS and WCS interfaces. And in the
developed. For example, GeoServer that based on PostGIS                          future, it is possible to extend Vectra with standard
we discussed above, and GeoMesa [19], a spatio-temporal                          WPS as well.
distributed database, also for GeoWave [3] and H2GIS [6].
There are also disk-based caches such as GeoWebCache [28]                  The remainder of this paper is organized as follows. In
integrated with GeoServer that can store different types of              Section 2 we illustrate our solution to build Vecstra and
map. MapCache [7] is also a disk cache that similar with Ge-             present preliminary experiment results. In Section 3, based
oWebCache. As we can see, even though many such geospa-                  on these results, we discuss research challenges and conclude
tial systems exist, only a few are based on main memory. To              in Section 4.
the best of our knowledge, two systems stand out in this cat-
egory. One is HyperSpace [32], a geospatial main-memory                  2.     SOLUTION
database system that can fastly process geospatial queries.
                                                                            To explore a solution to the problems in Section 1, we
The other one is SpaceBase [40], a system that allows han-
                                                                         build a scalable, fully functional in-memory geospatial cache
dling concurrent geospatial queries with low latency based
                                                                         called Vecstra that can provide high efficient services and low
on R-Tree indexing. But those two described in-memory
                                                                         latency response to geospatial data requests, operating on
caches fail to support widespread geospatial data such as
                                                                         a multi-core computer for the interest of high performance.
raster data. Also, these caches also have limited support
                                                                         This section focus on how we build Vecstra, including API
spatial data structures, which can substantially affect pro-
                                                                         design, architecture, and experiments.
cessing efficiency in main memory. Meanwhile, latency and
efficiency in resource usage of modern multi-core machines               2.1      API design
are not discussed in these caches. For the temporally skewed
                                                                            Vesctra is an in-memory storage cache that provides ser-
geospatial query, keep scalability of systems when large num-
                                                                         vices to clients to obtain geospatial data for calculations and
bers of users and request volumes happen which is a highly
                                                                         analysis goals. Vecstra implements the standard WFS and
possible scenario should be taken into consideration as well.
                                                                         WCS interfaces, which we analyzed to collect the functions
Contribution. To fill the gap as described above, inspired
                                                                         that Vecstra should support. Abiding by these APIs is ad-
by the criticism of ”One size fits all” in [39], we specialize the
                                                                         vantageous as this allow us to specialize the whole system
design of our cache to the particular geospatial standards.
                                                                         to particular operations as opposed to providing a generic
We build OGC standards [8] compliant in-memory cache
                                                                         implementation.
Vecstra that support spatial data structures to achieve high
                                                                            Every API method in Vecstra is invoked by user requests
performance in geospatial services. There are many stan-
                                                                         to an instance v of Vecstra and the design of them are as
dards developed by OGC. Web Feature Service (WFS) [42]
                                                                         follows. The v.wfs/wcs::Add(l) and v.wfs/wcs::Remove(l)
and Web Coverage Service (WCS) [41] are two of the most
                                                                         add and remove vector or raster layers in v, respectively.
widely used standards. WFS is used to serve geographic
                                                                         The method v.wfs/wcs::GetCapabilities provides basic in-
feature information that mostly used in web-based client
                                                                         formation of all vector or raster layers and services in v,
applications for GIS data editing. WCS is used to provide
                                                                         while v.wfs::GetFeature(l, bbox) and v.wcs::GetCoverage(l,
information on coverages. Known WFS/WCS implemen-
                                                                         bbox), respectively, provide full information in the in-
tations include open-source systems such as rasdaman [4],
                                                                         tersection area of vector or raster layers l and bound-
MapServer [26] and GeoServer. Commercial implementa-
                                                                         ing box bbox in v. Finally, v.wfs::DescribeFeatureType(l)
tions include Envitia1 , Pyxis WorldView Studio2 , ESRI Ar-
                                                                         and v.wcs::DescribeCoverage(l) obtain description and geo-
cGIS, among others. Most current disk-based geospatial sys-
                                                                         information of vector or raster layers in v.
tems implement these two standards; however, in-memory
                                                                            The API opens up opportunities for specialization of the
storage systems such as HyperSpace and SpaceBase fail to
                                                                         implementation. For example, updates are always at layer
implement complete services. Given the popularity of these
                                                                         granularities, but queries typically read small parts of a few
OGC standards, it would be ideal if a cache can provide sup-
                                                                         layers that are intersecting with a window in space. We ob-
port to them and that can help the cache directly integrated
                                                                         serve that services often rely on PostGIS, which allows them
into existing disk-based geospatial systems. One approach
                                                                         to satisfy linearizability, so to keep consistent semantics we
to build the cache could be to build or extend existing SQL
                                                                         end up with the requirement to maintain linearizability for
based systems such as HyPer or SpaceBase that provides
                                                                         concurrent operations in Vecstra. As we will see in the next
features for SQL require for data. However, since providing
                                                                         section, we can exploit these characteristics in the design of
1
    https://www.envitia.com/                                             Vecstra to support the API above more efficiently than alter-
2
    https://www.pyxisglobe.com/                                          native generic designs based on an SQL geospatial database.


                                                                     2
                                                                      r4.4xlarge instance of Ubuntu server 16.04 LTS.
                                                                      Results. At presesent, we use single read-write lock to
                                                                      achieve automicity of Vecstra which is expected to be not
                                                                      scalable. Therefore, we do our experiment only focus on
                                                                      queries. We analysis the time in server side to show which
                                                                      parts the latency located that can be improved in future
                                                                      work. Now for vector data, most part of latency is located
                                                                      in geometry intersection (around 50%-64%) and for raster
                                                                      data, the highest portion of latency is located in data ex-
                                                                      traction (around 89%-95%).

                                                                      3.   RESEARCH CHALLENGES
                                                                         The work presented so far was done in the first 9 months as
          Figure 2: Architecture of Vecstra.
                                                                      a Ph.D. candidate. Based on the preliminary results from
                                                                      the evaluation of Vecstra, we describe the future research
2.2    Architecture                                                   directions to overcome shortcomings described in Section 3.
   Figure 2 illustrates the architecture of Vecstra. Vecstra          Optimization for Geospatial Window Queries. A
contains both vector and raster data which are two primary            geospatial range query is a very common database oper-
types of spatial data in GIS that most commonly used in               ation that retrieves geospatial information located between
geospatial services [10]. Based on the results in [37], we            an upper and lower boundary[30]. The refine step in geomet-
choose R-Tree indices for the spatial intersection of vector          ric intersection time is the biggest part of WFS range query
data and currently we use multi-dimensional array indices             latency and data extraction use the majority of WCS range
for raster data. Essentially, query windows are for a number          query processing time, that can be seen from the server side
of layers and initially update operations are coarse-grained          evaluation results. Spatial indexing is used to support spa-
at layer granularities. Because the query operations usually          tial selection [16]. Now, the minimum bounding rectangle is
do not read all layers and also in order to support the layer-        used as the geometric key of our vector data and the most
level update operations, the design of indices is partitioned         time of vector range query is located in intersection calcula-
by layers, i.e., every index corresponds to one layer.                tion, [18] implemented non-leaf and leaf MBR optimizations
   To exploit multi-core benefits, multi-threading is used to         techniques in Oracle Spatial. A new optimization approach
gain performance. Concurrent request threads are assigned             adds on top of this technique to reduce the intersection time
to different cores. The multiple threads running in Vecstra           is one of our goals.
are executing both queries and updates. Index structures                 Now our raster data is continuously and linearly stored in
such as the high-concurrency locking R-Tree proposed by               an array from left to right and from top to bottom. We cal-
[25] can be exploited to utilize concurrent threads. How-             culate the intersected cells by decomposing the range query
ever, as shown in Section 2.1, modification requests in Vec-          to a set of range sections, then extracting data block-by-
stra are coarse-grained at layer granularities, so fine-grained       block. Now we can see the latency of processing of raster
concurrency in indices has been unnecessary so far. We can            data extract is relatively high. In order to make this process
explore other possibilities where several particular layers are       faster, a better data partitioning method should be used[31].
combined into one single index as future work. Currently,             Now Q-Tree and R-Tree are two most common indexes in
we utilize a concurrent hash map [33] to store the indices of         commercial database systems [5]. Due to most geo range
layers, which supports full concurrency of query operations           queries are rectangle shape, we consider exploring a better
and high expected concurrency for update operations. In               range indexing for raster data. [29] use space-filling curves
addition, to minimize synchronization costs, a thread per-            to index the data, but this proposed curve-design method
forms indexing on a given layer without synchronization,              is heuristic. Therefore, an efficient curve-exploration ap-
and then uses the concurrent hash map to synchronously                proach and valuating the best combination of curves in an
store the built index in a single operation. Based on the             in-memory cache are challenges that we will pursue as an
OGC standards and to integrate with the current systems               avenue for future work.
such as GeoServer, there is an HTTP communication inter-              Efficient Updates in Multi-Core Cache. Concurrency
face that is provided to users. Users provide GET requests            control should be used to manage requests to ensure the se-
to obtain WCS and WFS services from Vecstra.                          mantics of Vecstra. However, the current scalability of our
                                                                      shared everything implementation is still limited by poten-
2.3    Preliminary Results                                            tial synchronization bottlenecks in the presence of update
Prototype and Workload. Vecstra prototype implement                   skew towards certain layers of geospatial data [23]. A com-
architecture of Figure 2 based on C++11 and is compiled by            posable thread management approach [24] will be explored
gcc 5.4.0. Realistic data are used for experiments. Raster            in future work. Also, in the future, we can utilize lock-free
data we used are Field polygons, Soil, Rain distribution              mechanisms to tackle this problem. Unlike work in con-
layers and a topographic layer of Denmark, DTK/Kort25,                ventional updates to spatial data structures [36], this mech-
for raster data. Used workload designed based on precision            anism will exploit the concurrent updates of Vecstra and
farming scenario [21]. Queries are generated for fields and           recent work on [11] for the maximal performance.
update requests are for layers. Currently, all queries and            Design of Integration Language for Geospatial Web
requests are uniform random generated but in the future,              services. Now Vecstra provides WCS and WFS to users;
we can design a skew request mechanism based on the real              however, queries are only for raster or vector data, which
request log. We conduct our experiments on an AWS EC2                 is the same with the current geospatial service databases.


                                                                  3
We want to go further in this aspect, that is, to design an                [15] B. Fitzpatrick. Distributed caching with memcached. Linux
integration language to combine geospatial data. For ex-                        journal, 2004, 2004.
                                                                           [16] R. H. Güting. An introduction to spatial database systems.
ample, to stay standards-compliant, we can implement Web                        VLDB, 3, 1994.
Processing Service (WPS) [34] that can help users define the               [17] S. Holl and H. Plum. Postgis. GeoInformatics, 3, 2009.
execution way of a process and also handle the output of the               [18] Y. Hu, S. Ravada, R. Anderson, and B. Bamba. Topological
process. Widely used standard Web Map Service (WMS)                             relationship query processing for complex regions in Oracle
                                                                                Spatial. In Proceedings of the 20th International Conference
[43] is also a solution. By doing that, a user can get geospa-                  on Advances in Geographic Information Systems. ACM, 2012.
tial information such as both raster and vector data from                  [19] J. N. Hughes, A. Annex, C. N. Eichelberger, A. Fox,
one single query which can provide them a better view in a                      A. Hulbert, and M. Ronquest. Geomesa: a distributed
convenient way. Further research can be done on the basis                       architecture for spatio-temporal fusion. In SPIE: Geospatial
                                                                                Informatics, Fusion, and Motion Video Analytics V, volume
of this to achieve a better balance expressivity and perfor-                    9473, 2015.
mance; for instance, we can provide specialized support for                [20] A. P. Iyer and I. Stoica. A scalable distributed spatial index for
a subset of scripts such as support adapting data to scale in                   the internet-of-things. In Proc. ACM SoCC, 2017.
zoomable maps.                                                             [21] M. E. Jacobsen. A Scalable Geospatial System for Range
                                                                                Queries in Data-Driven Precision Agriculture. Master’s thesis,
                                                                                University of Copenhagen, Denmark, 2017.
4.   CONCLUSION                                                            [22] J. H. Jeppesen, E. Ebeid, R. H. Jacobsen, and T. S.
                                                                                Toftegaard. Open geospatial infrastructure for data
   In this paper, we investigated current geospatial data man-                  management and analytics in interdisciplinary research.
agement systems and discussed challenges. To address these                      Computers and Electronics in Agriculture, 145:130–141, 2018.
challenges, our goal is to build an efficient and scalable OGC             [23] J. H. Jeppesen, R. H. Jacobsen, R. N. Jørgensen, and T. S.
                                                                                Toftegaard. Towards data-driven precision agriculture using
standard-compliant in-memory geospatial cache. We spe-                          open data and open source software. In International
cialized API of standard geospatial services and sought to                      Conference on Agricultural Engineering, 2016.
achieve by this method a much higher performing in-memory                  [24] R. Johnson, I. Pandis, and A. Ailamaki. Eliminating unscalable
                                                                                communication in transaction processing. The VLDB Journal,
geospatial cache implementation than possible by combining                      23, 2014.
generic components such as a geospatial application server                 [25] M. Kornacker and D. Banks. High-concurrency locking in
and SQL database. To achieve that, we started by building                       R-trees. In VLDB, volume 95, 1995.
our initial version of Vecstra and obtained preliminary re-                [26] B. Kropla. Beginning MapServer: open source GIS
                                                                                development. Apress, 2006.
sults on it. We described the problems that need to be solved
                                                                           [27] T. Lahiri, M.-A. Neimat, and S. Folkman. Oracle TimesTen:
in our future research and proposed possible methods. First,                    An In-Memory Database for Enterprise Applications. IEEE
find an optimized window query approach. Second, exploit                        Data Eng. Bull., 36, 2013.
the efficient updates in the multi-core cache. Third, design               [28] Y.-F. Nie, H.-L. Liu, and H. Xu. Research on GeoWebCache
                                                                                tile map service middleware. Science of Surveying and
an integration language for geospatial services.                                Mapping, 36, 2011.
                                                                           [29] S. Nishimura and H. Yokota. Quilts: Multidimensional Data
                                                                                Partitioning Framework Based on Query-Aware and
5.   REFERENCES                                                                 Skew-Tolerant Space-Filling Curves. In Proc. ACM SIGMOD,
 [1] J. Ahmed, M. Y. Siyal, S. Najam, and Z. Najam. Challenges                  2017.
     and Issues in Modern Computer Architectures. In Fuzzy Logic
                                                                           [30] M. Orru, R. Paolillo, A. Detti, G. Rossi, and N. B. Melazzi.
     Based Power-Efficient Real-Time Multi-Core System.
                                                                                Demonstration of OpenGeoBase: The ICN NoSQL
     Springer, 2017.
                                                                                spatio-temporal database. In IEEE LANMAN, 2017.
 [2] A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. H.
                                                                           [31] B.-U. Pagel, H.-W. Six, H. Toben, and P. Widmayer. Towards
     Saltz. Hadoop-GIS: A High Performance Spatial Data
                                                                                an analysis of range query performance in spatial data
     Warehousing System over MapReduce. Proc. VLDB, 6 11,
                                                                                structures. In Proc. PODS, 1993.
     2013.
                                                                           [32] V. Pandey, A. Kipf, D. Vorona, T. Mühlbauer, T. Neumann,
 [3] A. Annex. Geowave provides geospatial and temporal indexing
                                                                                and A. Kemper. High-performance geospatial analytics in
     on top of Accumulo and HBase, 2018.
                                                                                hyperspace. In Proc. ACM SIGMOD, 2016.
 [4] P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and
                                                                           [33] J. Reinders. Intel threading building blocks: outfitting C++
     N. Widmann. The multidimensional database system
                                                                                for multi-core processor parallelism. ” O’Reilly Media, Inc.”,
     RasDaMan. In Acm Sigmod Record, volume 27, 1998.
                                                                                2007.
 [5] N. Black. Geo-location custom indexes, 2018. US Patent
                                                                           [34] P. Schut and A. Whiteside. Opengis web processing service.
     9,875,321.
                                                                                OGC project document, 2007.
 [6] E. Bocher, G. Petit, N. Fortin, and S. Palominos. H2gis a
                                                                           [35] N. Shamgunov. The MemSQL In-Memory Database System. In
     spatial database to feed urban climate issues. In ICUC9, 2015.
                                                                                IMDM@ VLDB, 2014.
 [7] T. Bonfort and T. Bonfort. MapCache: The Fast Tiling Server
                                                                           [36] D. Šidlauskas, S. Šaltenis, and C. S. Jensen. Processing of
     From The MapServer Project. 2013.
                                                                                extreme moving-object update and query workloads in main
 [8] M. Botts, G. Percivall, C. Reed, and J. Davidson. Ogc R sensor             memory. The VLDB Journal, 23, 2014.
     web enablement: Overview and high level architecture. In
                                                                           [37] B. Sowell, M. V. Salles, T. Cao, A. Demers, and J. Gehrke. An
     GeoSensor networks. 2008.
                                                                                experimental analysis of iterated spatial joins in main memory.
 [9] J. L. Carlson. Redis in action. Manning Publications Co., 2013.            Proc. VLDB, 6, 2013.
[10] K.-T. Chang. Introduction to geographic information systems.          [38] A. Stefanidis, A. Crooks, and J. Radzikowski. Harvesting
     McGraw-Hill Science/Engineering/Math, 2015.                                ambient geospatial information from social media feeds.
[11] A. T. Clements, M. F. Kaashoek, N. Zeldovich, R. T. Morris,                GeoJournal, 78, 2013.
     and E. Kohler. The scalable commutativity rule: Designing             [39] M. Stonebraker and U. Cetintemel. ” one size fits all”: an idea
     scalable software for multicore processors. ACM TOCS, 32,                  whose time has come and gone. In Proc. IEEE ICDE 2005.
     2015.
                                                                           [40] T. P. Universe. Introducing Spacebase: a New Realtime Spatial
[12] J. Deoliveira. Geoserver: uniting the GeoWeb and spatial data              Data Store, 21 March, 2012.
     infrastructures. In Proc. GSDI-10, 2008.
                                                                           [41] O. WCS. Web Coverage Service, 2010.
[13] C. Diaconu, C. Freedman, E. Ismert, P.-A. Larson, P. Mittal,
                                                                           [42] O. WFS. Web Feature Service. OGC. URL: http://www.
     R. Stonecipher, N. Verma, and M. Zwilling. Hekaton: SQL
                                                                                opengeospatial. org/standards/wfs, 2005.
     server’s memory-optimized OLTP engine. In Proc. 2013 ACM
     SIGMOD.                                                               [43] O. WMS. Web Map Service, 2004.
[14] A. Eldawy and M. F. Mokbel. Spatialhadoop: A mapreduce
     framework for spatial data. In ICDE, IEEE, 2015.


                                                                       4