Vecstra: An Efficient and Scalable Geo-spatial In-Memory Cache Yiwen Wang∗ supervised by Marcos Antonio Vaz Salles Database Management System Group University of Copenhagen, Denmark y.wang@di.ku.dk ABSTRACT farmers to tailor their management according to the local Nowadays a huge amount of geospatial data are generated and actual conditions in the field. Since the importance and and widely used in a variety of areas. In this way, there is wide societal relevance of spatial data, many platforms pro- an extensive range of database systems that support geospa- vide geospatial services, necessitating specialized data man- tial services but fall behind in coping with increasingly de- agement solutions. GeoServer [12], for example, is the back- manding high throughput, low latency and efficient resource end of a number of such services and builds its geospatial usage requirements. Meanwhile, in-memory databases have database on PostGIS [17]. become more prevalence due to increasing capability as well On the other hand, due to the increasing capability and as decreasing price of the memory. To take advantage of the decreasing price of memory, there is higher possible to the fast I/O speed and overhead reduction of memory-based cache substantial amounts of geospatial data in-memory. storage, in-memory geospatial data management systems Storing data in memory is especially valuable for running are created to overcome drawbacks of disk-based geospatial concurrent workloads as it eliminates the inevitable bottle- databases. However, to the best of our knowledge, current necks caused by disk-centric architectures [21]. To take ad- in-memory caches are not fully-featured and do not com- vantage of the benefits of object caching systems, there are pletely utilize the benefits of widely used geospatial stan- widely used web services caches such as Memcached [15] and dards. To fill this gap, we specialize to the API of standard Redis [9]. Now more and more classic databases providers geospatial services and seek to achieve by this method a have developed their in-memory features and systems. For much higher performing in-memory geospatial cache imple- instance, in-memory database TimesTen [27] owned by Or- mentation than possible by combining generic components acle and memory-optimized OLTP engine Hekaton [13] in such as a geospatial application server and SQL database. SQL Server. Moreover, multi-core servers also help to achieve To that end, an efficient and scalable OGC standard-compliant performance benefits and can yield significant gains. Multi- in-memory geospatial cache Vecstra is built. Furthermore, core processors can simultaneously handle multiple requests we conduct experiments on Vecstra and analyze the prelim- that can help reduce latency when amounts of concurrent inary results to formulate research opportunities. requests occur, which is a very likely situation with skew in geospatial services [1]. Despite the popularity of geospatial databases such as 1. INTRODUCTION PostGIS, however, their design has benefited little from re- There has a considerable increasing of geospatial data us- cent developments in in-memory databases. PostGIS still age in today’s society in many areas, such as precision agri- utilizes disk as its primary storage, which raises the prob- culture [22] and social media streams querying [38]. Mean- lem to reduce expensive geospatial data querying cost due while, an explosion of the Internet of Things (IoT) has re- to disk I/O and system overheads. Since geospatial services cently helped increase the generation of geospatial data [20]. are experiencing growing popularity, this problem is made In the agriculture field, there is a huge amount of data from worse by large numbers of concurrent requests. In-memory satellites and sensors that can provide massive geospatial database techniques and multi-core hardware hold promise information about terrain and crop status. By integrating to offer a solution to this problem, while at the same time a range of data and providing the information, that enables increasing throughput and resource utilization. Existing Systems. Currently, there are many geospatial ∗ Work performed in the context of the Future Cropping databases that provide services to clients, as shown in Fig- partnership, supported by Innovation Fund Denmark. ure 1. There are disk storage databases for batch processing and multiple types of spatial queries such as SpatialHadoop [14] and Hadoop-GIS [2]. With the availability of larger and cheaper main memory, to take the advantage of that, data management systems such as MemSQL [35] are built based on main memory to improve response times. Mem- SQL is a in-memory database that stores data and allows Proceedings of the VLDB 2018 Ph.D. Workshop, August 27, 2018. Rio de for quickly analyzing and processing data. And there are Janeiro, Brazil. transactions for the small interactive requests, to support Copyright (C) 2018 for this paper by its authors. Copying permitted for such requests, a number of specialized systems have been private and academic purposes. 1 a specialized interface, we have opportunities to provide a much better implementation. Targeted advantages of Vec- stra are: 1. Use of tuned in-memory data structures for geospatial data, leading to lower latency; 2. Thread management that is optimized for the inter- face of a geospatial cache supporting operations over geospatial layers, leading to lower overhead and better scalability; 3. Open-source system that is standards-compliant, offer- Figure 1: Related Work Summary. ing integrated WFS and WCS interfaces. And in the developed. For example, GeoServer that based on PostGIS future, it is possible to extend Vectra with standard we discussed above, and GeoMesa [19], a spatio-temporal WPS as well. distributed database, also for GeoWave [3] and H2GIS [6]. There are also disk-based caches such as GeoWebCache [28] The remainder of this paper is organized as follows. In integrated with GeoServer that can store different types of Section 2 we illustrate our solution to build Vecstra and map. MapCache [7] is also a disk cache that similar with Ge- present preliminary experiment results. In Section 3, based oWebCache. As we can see, even though many such geospa- on these results, we discuss research challenges and conclude tial systems exist, only a few are based on main memory. To in Section 4. the best of our knowledge, two systems stand out in this cat- egory. One is HyperSpace [32], a geospatial main-memory 2. SOLUTION database system that can fastly process geospatial queries. To explore a solution to the problems in Section 1, we The other one is SpaceBase [40], a system that allows han- build a scalable, fully functional in-memory geospatial cache dling concurrent geospatial queries with low latency based called Vecstra that can provide high efficient services and low on R-Tree indexing. But those two described in-memory latency response to geospatial data requests, operating on caches fail to support widespread geospatial data such as a multi-core computer for the interest of high performance. raster data. Also, these caches also have limited support This section focus on how we build Vecstra, including API spatial data structures, which can substantially affect pro- design, architecture, and experiments. cessing efficiency in main memory. Meanwhile, latency and efficiency in resource usage of modern multi-core machines 2.1 API design are not discussed in these caches. For the temporally skewed Vesctra is an in-memory storage cache that provides ser- geospatial query, keep scalability of systems when large num- vices to clients to obtain geospatial data for calculations and bers of users and request volumes happen which is a highly analysis goals. Vecstra implements the standard WFS and possible scenario should be taken into consideration as well. WCS interfaces, which we analyzed to collect the functions Contribution. To fill the gap as described above, inspired that Vecstra should support. Abiding by these APIs is ad- by the criticism of ”One size fits all” in [39], we specialize the vantageous as this allow us to specialize the whole system design of our cache to the particular geospatial standards. to particular operations as opposed to providing a generic We build OGC standards [8] compliant in-memory cache implementation. Vecstra that support spatial data structures to achieve high Every API method in Vecstra is invoked by user requests performance in geospatial services. There are many stan- to an instance v of Vecstra and the design of them are as dards developed by OGC. Web Feature Service (WFS) [42] follows. The v.wfs/wcs::Add(l) and v.wfs/wcs::Remove(l) and Web Coverage Service (WCS) [41] are two of the most add and remove vector or raster layers in v, respectively. widely used standards. WFS is used to serve geographic The method v.wfs/wcs::GetCapabilities provides basic in- feature information that mostly used in web-based client formation of all vector or raster layers and services in v, applications for GIS data editing. WCS is used to provide while v.wfs::GetFeature(l, bbox) and v.wcs::GetCoverage(l, information on coverages. Known WFS/WCS implemen- bbox), respectively, provide full information in the in- tations include open-source systems such as rasdaman [4], tersection area of vector or raster layers l and bound- MapServer [26] and GeoServer. Commercial implementa- ing box bbox in v. Finally, v.wfs::DescribeFeatureType(l) tions include Envitia1 , Pyxis WorldView Studio2 , ESRI Ar- and v.wcs::DescribeCoverage(l) obtain description and geo- cGIS, among others. Most current disk-based geospatial sys- information of vector or raster layers in v. tems implement these two standards; however, in-memory The API opens up opportunities for specialization of the storage systems such as HyperSpace and SpaceBase fail to implementation. For example, updates are always at layer implement complete services. Given the popularity of these granularities, but queries typically read small parts of a few OGC standards, it would be ideal if a cache can provide sup- layers that are intersecting with a window in space. We ob- port to them and that can help the cache directly integrated serve that services often rely on PostGIS, which allows them into existing disk-based geospatial systems. One approach to satisfy linearizability, so to keep consistent semantics we to build the cache could be to build or extend existing SQL end up with the requirement to maintain linearizability for based systems such as HyPer or SpaceBase that provides concurrent operations in Vecstra. As we will see in the next features for SQL require for data. However, since providing section, we can exploit these characteristics in the design of 1 https://www.envitia.com/ Vecstra to support the API above more efficiently than alter- 2 https://www.pyxisglobe.com/ native generic designs based on an SQL geospatial database. 2 r4.4xlarge instance of Ubuntu server 16.04 LTS. Results. At presesent, we use single read-write lock to achieve automicity of Vecstra which is expected to be not scalable. Therefore, we do our experiment only focus on queries. We analysis the time in server side to show which parts the latency located that can be improved in future work. Now for vector data, most part of latency is located in geometry intersection (around 50%-64%) and for raster data, the highest portion of latency is located in data ex- traction (around 89%-95%). 3. RESEARCH CHALLENGES The work presented so far was done in the first 9 months as Figure 2: Architecture of Vecstra. a Ph.D. candidate. Based on the preliminary results from the evaluation of Vecstra, we describe the future research 2.2 Architecture directions to overcome shortcomings described in Section 3. Figure 2 illustrates the architecture of Vecstra. Vecstra Optimization for Geospatial Window Queries. A contains both vector and raster data which are two primary geospatial range query is a very common database oper- types of spatial data in GIS that most commonly used in ation that retrieves geospatial information located between geospatial services [10]. Based on the results in [37], we an upper and lower boundary[30]. The refine step in geomet- choose R-Tree indices for the spatial intersection of vector ric intersection time is the biggest part of WFS range query data and currently we use multi-dimensional array indices latency and data extraction use the majority of WCS range for raster data. Essentially, query windows are for a number query processing time, that can be seen from the server side of layers and initially update operations are coarse-grained evaluation results. Spatial indexing is used to support spa- at layer granularities. Because the query operations usually tial selection [16]. Now, the minimum bounding rectangle is do not read all layers and also in order to support the layer- used as the geometric key of our vector data and the most level update operations, the design of indices is partitioned time of vector range query is located in intersection calcula- by layers, i.e., every index corresponds to one layer. tion, [18] implemented non-leaf and leaf MBR optimizations To exploit multi-core benefits, multi-threading is used to techniques in Oracle Spatial. A new optimization approach gain performance. Concurrent request threads are assigned adds on top of this technique to reduce the intersection time to different cores. The multiple threads running in Vecstra is one of our goals. are executing both queries and updates. Index structures Now our raster data is continuously and linearly stored in such as the high-concurrency locking R-Tree proposed by an array from left to right and from top to bottom. We cal- [25] can be exploited to utilize concurrent threads. How- culate the intersected cells by decomposing the range query ever, as shown in Section 2.1, modification requests in Vec- to a set of range sections, then extracting data block-by- stra are coarse-grained at layer granularities, so fine-grained block. Now we can see the latency of processing of raster concurrency in indices has been unnecessary so far. We can data extract is relatively high. In order to make this process explore other possibilities where several particular layers are faster, a better data partitioning method should be used[31]. combined into one single index as future work. Currently, Now Q-Tree and R-Tree are two most common indexes in we utilize a concurrent hash map [33] to store the indices of commercial database systems [5]. Due to most geo range layers, which supports full concurrency of query operations queries are rectangle shape, we consider exploring a better and high expected concurrency for update operations. In range indexing for raster data. [29] use space-filling curves addition, to minimize synchronization costs, a thread per- to index the data, but this proposed curve-design method forms indexing on a given layer without synchronization, is heuristic. Therefore, an efficient curve-exploration ap- and then uses the concurrent hash map to synchronously proach and valuating the best combination of curves in an store the built index in a single operation. Based on the in-memory cache are challenges that we will pursue as an OGC standards and to integrate with the current systems avenue for future work. such as GeoServer, there is an HTTP communication inter- Efficient Updates in Multi-Core Cache. Concurrency face that is provided to users. Users provide GET requests control should be used to manage requests to ensure the se- to obtain WCS and WFS services from Vecstra. mantics of Vecstra. However, the current scalability of our shared everything implementation is still limited by poten- 2.3 Preliminary Results tial synchronization bottlenecks in the presence of update Prototype and Workload. Vecstra prototype implement skew towards certain layers of geospatial data [23]. A com- architecture of Figure 2 based on C++11 and is compiled by posable thread management approach [24] will be explored gcc 5.4.0. Realistic data are used for experiments. Raster in future work. Also, in the future, we can utilize lock-free data we used are Field polygons, Soil, Rain distribution mechanisms to tackle this problem. Unlike work in con- layers and a topographic layer of Denmark, DTK/Kort25, ventional updates to spatial data structures [36], this mech- for raster data. Used workload designed based on precision anism will exploit the concurrent updates of Vecstra and farming scenario [21]. Queries are generated for fields and recent work on [11] for the maximal performance. update requests are for layers. Currently, all queries and Design of Integration Language for Geospatial Web requests are uniform random generated but in the future, services. Now Vecstra provides WCS and WFS to users; we can design a skew request mechanism based on the real however, queries are only for raster or vector data, which request log. We conduct our experiments on an AWS EC2 is the same with the current geospatial service databases. 3 We want to go further in this aspect, that is, to design an [15] B. Fitzpatrick. Distributed caching with memcached. Linux integration language to combine geospatial data. For ex- journal, 2004, 2004. [16] R. H. Güting. An introduction to spatial database systems. ample, to stay standards-compliant, we can implement Web VLDB, 3, 1994. Processing Service (WPS) [34] that can help users define the [17] S. Holl and H. Plum. Postgis. GeoInformatics, 3, 2009. execution way of a process and also handle the output of the [18] Y. Hu, S. Ravada, R. Anderson, and B. Bamba. Topological process. Widely used standard Web Map Service (WMS) relationship query processing for complex regions in Oracle Spatial. In Proceedings of the 20th International Conference [43] is also a solution. By doing that, a user can get geospa- on Advances in Geographic Information Systems. ACM, 2012. tial information such as both raster and vector data from [19] J. N. Hughes, A. Annex, C. N. Eichelberger, A. Fox, one single query which can provide them a better view in a A. Hulbert, and M. Ronquest. Geomesa: a distributed convenient way. Further research can be done on the basis architecture for spatio-temporal fusion. In SPIE: Geospatial Informatics, Fusion, and Motion Video Analytics V, volume of this to achieve a better balance expressivity and perfor- 9473, 2015. mance; for instance, we can provide specialized support for [20] A. P. Iyer and I. Stoica. A scalable distributed spatial index for a subset of scripts such as support adapting data to scale in the internet-of-things. In Proc. ACM SoCC, 2017. zoomable maps. [21] M. E. Jacobsen. A Scalable Geospatial System for Range Queries in Data-Driven Precision Agriculture. Master’s thesis, University of Copenhagen, Denmark, 2017. 4. CONCLUSION [22] J. H. Jeppesen, E. Ebeid, R. H. Jacobsen, and T. S. Toftegaard. Open geospatial infrastructure for data In this paper, we investigated current geospatial data man- management and analytics in interdisciplinary research. agement systems and discussed challenges. To address these Computers and Electronics in Agriculture, 145:130–141, 2018. challenges, our goal is to build an efficient and scalable OGC [23] J. H. Jeppesen, R. H. Jacobsen, R. N. Jørgensen, and T. S. Toftegaard. Towards data-driven precision agriculture using standard-compliant in-memory geospatial cache. We spe- open data and open source software. In International cialized API of standard geospatial services and sought to Conference on Agricultural Engineering, 2016. achieve by this method a much higher performing in-memory [24] R. Johnson, I. Pandis, and A. Ailamaki. Eliminating unscalable communication in transaction processing. The VLDB Journal, geospatial cache implementation than possible by combining 23, 2014. generic components such as a geospatial application server [25] M. Kornacker and D. Banks. High-concurrency locking in and SQL database. To achieve that, we started by building R-trees. In VLDB, volume 95, 1995. our initial version of Vecstra and obtained preliminary re- [26] B. Kropla. Beginning MapServer: open source GIS development. Apress, 2006. sults on it. We described the problems that need to be solved [27] T. Lahiri, M.-A. Neimat, and S. Folkman. Oracle TimesTen: in our future research and proposed possible methods. First, An In-Memory Database for Enterprise Applications. IEEE find an optimized window query approach. Second, exploit Data Eng. Bull., 36, 2013. the efficient updates in the multi-core cache. Third, design [28] Y.-F. Nie, H.-L. Liu, and H. Xu. Research on GeoWebCache tile map service middleware. Science of Surveying and an integration language for geospatial services. Mapping, 36, 2011. [29] S. Nishimura and H. Yokota. Quilts: Multidimensional Data Partitioning Framework Based on Query-Aware and 5. REFERENCES Skew-Tolerant Space-Filling Curves. In Proc. ACM SIGMOD, [1] J. Ahmed, M. Y. Siyal, S. Najam, and Z. Najam. Challenges 2017. and Issues in Modern Computer Architectures. In Fuzzy Logic [30] M. Orru, R. Paolillo, A. Detti, G. Rossi, and N. B. Melazzi. Based Power-Efficient Real-Time Multi-Core System. Demonstration of OpenGeoBase: The ICN NoSQL Springer, 2017. spatio-temporal database. In IEEE LANMAN, 2017. [2] A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. H. [31] B.-U. Pagel, H.-W. Six, H. Toben, and P. Widmayer. Towards Saltz. Hadoop-GIS: A High Performance Spatial Data an analysis of range query performance in spatial data Warehousing System over MapReduce. Proc. VLDB, 6 11, structures. In Proc. PODS, 1993. 2013. [32] V. Pandey, A. Kipf, D. Vorona, T. Mühlbauer, T. Neumann, [3] A. Annex. Geowave provides geospatial and temporal indexing and A. Kemper. High-performance geospatial analytics in on top of Accumulo and HBase, 2018. hyperspace. In Proc. ACM SIGMOD, 2016. [4] P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and [33] J. Reinders. Intel threading building blocks: outfitting C++ N. Widmann. The multidimensional database system for multi-core processor parallelism. ” O’Reilly Media, Inc.”, RasDaMan. In Acm Sigmod Record, volume 27, 1998. 2007. [5] N. Black. Geo-location custom indexes, 2018. US Patent [34] P. Schut and A. Whiteside. Opengis web processing service. 9,875,321. OGC project document, 2007. [6] E. Bocher, G. Petit, N. Fortin, and S. Palominos. H2gis a [35] N. Shamgunov. The MemSQL In-Memory Database System. In spatial database to feed urban climate issues. In ICUC9, 2015. IMDM@ VLDB, 2014. [7] T. Bonfort and T. Bonfort. MapCache: The Fast Tiling Server [36] D. Šidlauskas, S. Šaltenis, and C. S. Jensen. Processing of From The MapServer Project. 2013. extreme moving-object update and query workloads in main [8] M. Botts, G. Percivall, C. Reed, and J. Davidson. Ogc R sensor memory. The VLDB Journal, 23, 2014. web enablement: Overview and high level architecture. In [37] B. Sowell, M. V. Salles, T. Cao, A. Demers, and J. Gehrke. An GeoSensor networks. 2008. experimental analysis of iterated spatial joins in main memory. [9] J. L. Carlson. Redis in action. Manning Publications Co., 2013. Proc. VLDB, 6, 2013. [10] K.-T. Chang. Introduction to geographic information systems. [38] A. Stefanidis, A. Crooks, and J. Radzikowski. Harvesting McGraw-Hill Science/Engineering/Math, 2015. ambient geospatial information from social media feeds. [11] A. T. Clements, M. F. Kaashoek, N. Zeldovich, R. T. Morris, GeoJournal, 78, 2013. and E. Kohler. The scalable commutativity rule: Designing [39] M. Stonebraker and U. Cetintemel. ” one size fits all”: an idea scalable software for multicore processors. ACM TOCS, 32, whose time has come and gone. In Proc. IEEE ICDE 2005. 2015. [40] T. P. Universe. Introducing Spacebase: a New Realtime Spatial [12] J. Deoliveira. Geoserver: uniting the GeoWeb and spatial data Data Store, 21 March, 2012. infrastructures. In Proc. GSDI-10, 2008. [41] O. WCS. Web Coverage Service, 2010. [13] C. Diaconu, C. Freedman, E. Ismert, P.-A. Larson, P. Mittal, [42] O. WFS. Web Feature Service. OGC. URL: http://www. R. Stonecipher, N. Verma, and M. Zwilling. Hekaton: SQL opengeospatial. org/standards/wfs, 2005. server’s memory-optimized OLTP engine. In Proc. 2013 ACM SIGMOD. [43] O. WMS. Web Map Service, 2004. [14] A. Eldawy and M. F. Mokbel. Spatialhadoop: A mapreduce framework for spatial data. In ICDE, IEEE, 2015. 4