ceur-ws.org/Vol-2399/paper10.pdf Optimized Spatio-Temporal Data Structures for Hybrid Transactional and Analytical Workloads on Columnar In-Memory Databases Keven Richly supervised by Prof. Dr. h.c. Hasso Plattner Hasso Plattner Institute, University of Potsdam August-Bebel-Str. 88 Potsdam, Germany keven.richly@hpi.de ABSTRACT and the inherent inaccuracy of the data. For these reasons, Rapid advances in location-acquisition technologies have led it is a nontrivial task to manage and store vast amounts of to large amounts of trajectory data. This data is the founda- these data, which are rapidly accumulated. Especially, if we tion for a broad spectrum of services driven and improved consider hybrid transactional and analytical workloads (so- by trajectory data mining. However, for hybrid transac- called HTAP or mixed workloads) on spatio-temporal data, tional and analytical workloads, the storing and process- which are challenging concerning space and time complexity. ing of rapidly accumulated trajectory data is a non-trivial The most common format to store trajectory data is the task. In this paper, we propose an approach to optimize sample point format. Here, each observed location is stored the trajectory data management capabilities for relational as a tuple with the following attributes: trajectory iden- database systems. Based on the observations that the re- tifier, moving object identifier, multi-dimensional location, lational database structure is well-suited to store trajectory and timestamp. In trajectory query processing, we distin- data in the sample point format and that the access pat- guish the four query types: (i) trajectory-based queries, (ii) terns for trajectory data change over time, we develope a spatio-temporal range queries, (iii) KNN queries, and (iv) concept for optimized spatio-temporal data structures for in- top-k queries [7]. Trajectory-based queries refer to the tra- memory column stores and describe a workload-aware tiered jectory of a single moving object and return the entire trajec- data compression approach. The first evaluations of the ap- tory, a specific segment of the trajectory, or related informa- proach demonstrate that the observed data access patterns tion like the length of the trajectory. Relational databases of di↵erent real-world use cases are supporting the proposed are able to store data in the sample point format and pro- system architecture. cess queries of the four mentioned types. For that reason, it could be promising to investigate this research area. Fur- thermore, relational data management systems have some 1. INTRODUCTION further advantages that could be used for trajectory data In recent years, rapid advances in location-acquisition tech- (e.g., standardized query language, highly optimized data nologies have led to large amounts of time-stamped loca- processing capabilities). To process spatio-temporal data tion data. Positioning technologies like GPS-based or com- efficiently, we have to develop optimized data structures for munication network-based systems enable the tracking of spatio-temporal data. various moving objects. This data is the foundation for a Due to the large amount of trajectory data, which have to wide spectrum of applications [6, 10, 9]. A trajectory is be stored in various use cases, we also have to evaluate di↵er- represented by a series of chronologically ordered sampling ent compression mechanisms for trajectory data to reduce points. Each sampling point contains spatial information, the data footprint. Particularly for in-memory databases, which is represented by a multidimensional coordinate in a we have to utilize the available memory efficiently. Addi- geographical space, and temporal information, which is rep- tionally, in di↵erent use cases we can observe that a spe- resented by a timestamp. Additionally, an object identifier cific trajectory segment is less accessed over time. Also, the assigns each sampling point to a specific moving object and query types and granularity of queries, which access a spe- the corresponding trajectory. Thereby, the duration and cific segment change over time. Similar characteristics can sampling rate depends on the application. Based on the be observed in time-series data analysis. For that reason, characteristics of spatio-temporal trajectory data, there ex- it is necessary to analyze how we can adopt compression ist four key challenges: the data volume, the high update mechanisms for such kind of data access patterns to further rate (data velocity), the query latency of analytical queries, reduce the memory consumption and increase the perfor- mance by minimizing the number of sample points. 2. RESEARCH ISSUES In this section, we want to highlight the research aspects, Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los Angeles, California. Copyright (C) 2019 for this paper by its authors. that we derived from the observations mentioned in the pre- Copying permitted for private and academic purposes. vious section. We propose a concept to integrate storage mechanisms for trajectory data into a relational in-memory strongly depends on the addressed use cases: (i) scalability, database with particular attention to leverage the capabil- (ii) footprint reduction, (iii) elasticity, (iv) efficiency, or (v) ities of the columnar database layout and the adoption to query performance [7]. observed data access patterns. The objective is to reduce the data footprint and to better utilize the available mem- ory bandwidth. We accordingly focus on data structures and compression techniques. 1. Optimized temporal and spatial data structures for columnar in-memory databases: The sample point for- mat, which is used to store trajectory data by the ma- jority of spatio-temporal data management systems [7], is well suited for the table schema of relational database systems. Also, di↵erent database systems integrated optimized spatial data types [4]. In di↵erent applica- tions, we could observe that the performance of spatio- temporal range queries is strongly dominated by the temporal scan operation. For that reason, it is nec- essary to develop new concepts, which eliminate this Figure 1: A classification model for trajectory data bottleneck. The di↵erent types of time awareness that management systems various use cases provide should be considered [7]. Ad- ditionally, we should evaluate the spatial functions of Although the various systems are designed for di↵erent database systems with regard to trajectory data man- infrastructures and optimized for di↵erent use cases, simi- agement. larities in the general structure can be determined and sum- marized in a general blueprint. Most trajectory data man- 2. Compression techniques for trajectory data: Zhang et agement systems are optimized for spatio-temporal range al. [13] evaluated the compression ratio and quality of queries or k-nearest neighbor queries. They use a hybrid various trajectory simplification algorithms. Colum- partitioning approach to efficiently skip data partitions dur- nar in-memory databases and databases, in general, ing the query execution. To archive spatio-temporal data lo- use additional data compression techniques apart form cality the data is partitioned by the temporal dimension first delta encoding (e.g., dictionary or run-length encod- and the spatial dimension second. The trajectory of a mov- ing). For that reason, it is necessary to evaluate the ing object is not stored together in one partition. The sys- e↵ect of these compression techniques in the context of tems divide a trajectory into di↵erent trajectory segments columnar in-memory databases and analyze the poten- and distribute them on the data partitions on the basis of the tials of compression mechanisms that combine tradi- partitioning scheme. Inside a data partition, the observed tional database compression approaches with special- locations are often stored in a sample point data format, ized trajectory simplification algorithms. whereby the sample points are ordered by the trajectory 3. Workload-aware tiered data compression: The struc- identifier to increase the performance of trajectory-based tures of various trajectory data management systems queries. To compress the data all systems use delta com- have a simar blueprint (see Figure 2). In this blueprint, pression. Additionally, a wide range of systems applies fur- the trajectory data is stored in data partitions. In gen- ther compression techniques for all data partitions. To op- eral, the systems do not distinguish the di↵erent data timize the query performance, the trajectory data manage- partitions and treat all equally. In various applications ment systems use a layered index structure. As displayed in and especially in systems, which analysze times series, Figure 2, the structure consists of a three-level hybrid index. we could observe that the data access pattern for a The first layer is a temporal index structure, which divides database entry changed over time [1, 5]. In time series the temporal dimensions into di↵erent time intervals. There data management, an approach is to increase the com- are di↵erent implementations for this index (e.g., skip-list, pression ratio for old values [5]. Considering the mas- range index). For each time interval in the first layer, there sive amounts of trajectory data, which are tracked by exists a spatial index. In most cases, the spatial index is various companies (e.g., transportation network com- represented by a traditional KD-tree, quad-tree or oct-tree. panies), a tiered compression approach could reduce The last layer is normally a tree structure (e.g., b+ tree), the data footprint reasonably. which index the di↵erent spatio-temporal partitions. There are also several systems, which already integrate data structures to store spatial data and spatio-temporal data in relational database systems [4, 12]. Integrating tra- 3. RELATED WORK jectory data management into relational database systems As shown in Figure 1, there are several systems, which fo- has the advantages that there is a standardized query lan- cus on the storage, indexing, and compression of trajectory guage, the spatio-temporal data can be combined with other data. Various systems were primarily designed to store spa- data (e.g., business data), and it is possible to leverage the tial data points but they are also used for trajectory data. highly optimized data processing capabilities of these sys- With the increased availability of spatio-temporal data man- tems. agement system were built for di↵erent system infrastruc- tures, which leverage the specific characteristics of trajec- tory data. The optimization focus of the developed system our proposed system, these changes lead to a new optimal ordering of data tuples in chunks or compression technique. Additionally, we developed an approach to store times- tamps more efficiently in columnar databases to address the previously mentioned problem that the temporal scan per- formance often dominates the execution time of trajectory database queries. For that reason, we propose a separate columns approach. The values for year, month, day, hour, minute, second, and fractional second are each stored in a separate column. This approach has some advantages over traditional timestamp concepts. Since there are only sev- eral possible values for the di↵erent columns, dictionary en- coding is e↵ective due to saturated and small dictionaries. Another aspect of reducing the data footprint is that in one chunk only a subset of the di↵erent columns (e.g., minutes or seconds) is modified. For that reason, we could mini- mize the memory consumption of the unmodified columns Figure 2: General blueprint of trajectory manage- by applying run-length encoding. Also, we could improve ment systems the bandwidth utilization for temporal scan operations. A disadvantage is that for between queries, it is necessary to scan eventually multiple columns. 4. RESERACH PLAN 4.1.2 Workload-aware tiered data compression In this section, we describe the approach we propose to Based on the observation that the data access patterns optimize the capabilities of relational databases to store for a trajectory segment change over time, we propose a and process trajectory data. Furthermore, we explain our tiered data compression approach. To select a suitable com- planned evaluation of the concept. pression configuration, detailed knowledge about the data characteristics and the workload are required. Most cur- 4.1 Approach rent databases use simple heuristics to choose a compres- The foundation of the proposed approach is the chunk sion scheme for a given column [2]. However, we propose concept implemented by the relational in-memory research a heuristic that selects a compression configuration, which database Hyrise [3]. As displayed in Figure 3, a database defines the applied compression technique for columns on a table is divided into chunks, which are temporally ordered per-chunk basis. horizontal partitions of a certain size. A chunk contains Depending on the workload, the determined configuration fragments of all columns of a database table, which are defines for each chunk, if the spatio-temporal data should called segments. Besides efficient multiprocessing and chunk be stored uncompressed, lossless compressed with a specific pruning, this structure enables the creation of auxiliary data compression algorithm or simplified based on a trajectory structures (e.g., indices) on a per-chunk basis. Additionally, simplification algorithm. Additionally, it should also spec- it provides the possibility to change the order criteria be- ify the compression ratio of the simplification algorithm per tween di↵erent chunks (e.g., trajectory identifier, location) chunk. In general, we expect that recent chunks are uncom- and enables the application of varying trajectory simplifi- pressed or only compressed with lossless compression tech- cation algorithms. We can apply the same concept to the niques. In contrast, the compression ratio increases with the encodings of segments. For that reason, some segments of a age of a chunk (see Figure 3). Besides reducing the mem- column can be unencoded, others dictionary-encoded, and ory footprint, the selected configuration has also impact on further segments can be encoded with additional compres- the query performance as well as the accuracy of the re- sion techniques [2]. sults. We see potential to optimize the compression schema for a given memory budget and error-based application con- 4.1.1 Optimized temporal and spatial data structures straints (e.g., accuracy metrics [11]). It might be prefer- for columnar in-memory databases able to lose performance or accuracy for less often accessed These capabilities enable the workload-aware selection of chunks to leverage the gained space to apply only light- compression techniques. Furthermore, it is possible to adopt weight compression methods on frequently accessed chunks. the ordering of data tuples per chunk. For example, if we assume a workload that consists of mainly trajectory-based 4.2 Planned Evaluation queries on recent trajectories and in contrast mainly spatio- For the evaluations of our approach, we plan to use two temproal range queries on past trajectories it would be ben- real-world use cases. The first one is in the sports sector eficial to order the newer chunks by trajectory identifier and and includes positional information of soccer players during the older ones by location (see Figure 3). Also, space-filling professional soccer games [8]. The data is collected with spe- curves might be used to improve the scan performance of cialized camera systems, which track around 3.5 million data spatial filter queries. Moreover, this approach allows adop- points per game. By analyzing this data, we could observe tion to changes in the data distribution or the workload. data access patterns that support our approach [6]. In the Zhang et al. [14] also address this topic in TrajSpark by post-processing mainly the data of the last game is analyzed. using a time-decay model to monitor the data distribution Additionally, algorithms are used to detect and classify spe- and adopt the used indexing schema correspondingly. In cific game events (e.g., passes) on the fine-grained trajectory Table T [3] M. Dreseler, J. Kossmann, M. Boissier, S. Klauck, Column T.Trajid Column T.Trajobj Column T.Trajlocation Column T.Trajtime M. Uflacker, and H. Plattner. Hyrise re-engineered: Chunk #1 immutable An extensible database system for research in Segment a Segment b Segment c Segment d Sorted by: relational in-memory data management. In 22nd Location unencoded run length- encoded dictionary- encoded dictionary- encoded Trajectory Simplification: International Conference on Extending Database Strong Technology (EDBT), pages 313–324, 3 2019. Chunk #2 immutable [4] V. Pandey, A. Kipf, D. Vorona, T. Mühlbauer, Sorted by: Segment a unencoded Segment b dictionary- Segment c dictionary- Segment d dictionary- Location T. Neumann, and A. Kemper. High-performance Trajectory Simplification: encoded encoded encoded medium geospatial analytics in hyperspace. In Proceedings of the 2016 International Conference on Management of … Data, pages 2145–2148. ACM, 2016. Chunk #n-1 immutable Segment a Segment b Segment c Segment d Sorted by: [5] T. Pelkonen, S. Franklin, J. Teller, P. Cavallaro, unencoded unencoded unencoded unencoded Object Identifier Trajectory Simplification: Q. Huang, J. Meza, and K. Veeraraghavan. Gorilla: A - fast, scalable, in-memory time series database. Chunk #n immutable Proceedings of the VLDB Endowment, Segment a Segment b Segment c Segment d Sorted by: 8(12):1816–1827, 2015. unencoded unencoded unencoded unencoded Time Trajectory Simplification: - [6] K. Richly. Leveraging spatio-temporal soccer data to define a graphical query language for game recordings. In IEEE International Conference on Big Data, Big Data 2018, Seattle, WA, USA, December 10-13, 2018, Figure 3: Depiction of the storage layout for an ex- pages 3456–3463, 2018. emplary table T with n chunks and four trajectory relevant attributes [7] K. Richly. A survey on trajectory data management for hybrid transactional and analytical workloads. In IEEE International Conference on Big Data, Big Data data. The trajectory data of other games of a season is not 2018, Seattle, WA, USA, December 10-13, 2018, pages analyzed in that depth. All analyses on data of past seasons 562–569, 2018. are typically only on an event-based level. The second one [8] K. Richly, M. Bothe, T. Rohlo↵, and C. Schwarz. is a dataset of a transportation network company, which in- Recognizing compound events in spatio-temporal cludes several million driver positions per hour. Due to the football data. In International Conference on Internet amount of the collected data, it is only possible to perform of Things and Big Data (IoTBD), 4 2016. analysis on compressed data or subsets of the data. For vari- [9] K. Richly, F. Moritz, and C. Schwarz. Utilizing ous use cases (e.g., driver-passenger assignment) only recent artificial neural networks to detect compound events data is analyzed. Also, older trajectories might not repre- in spatio-temporal soccer data. In SIGKDD17 sent the current traffic situation. Workshop on Mining and Learning from Time Series (MiLeTS), 2017. 5. CONCLUSIONS [10] K. Richly, R. Teusner, A. Immer, F. Windheuser, and In this paper, we have presented our approach to store L. Wolf. Optimizing routes of public transportation trajectory data in a relational database system. Based on systems by analyzing the data of taxi rides. In our observation that data points of trajectories are less fre- Proceedings of the 1st International ACM quent accessed over time, we developed a workload-aware SIGSPATIAL Workshop on Smart Cities and Urban tired data compression approach to reduce the data foot- Analytics, UrbanGIS@SIGSPATIAL 2015, Bellevue, print and better utilize the available memory bandwidth. WA, USA, November 3-6, 2015, pages 70–76, 2015. Additionally, we presented an optimized data layout for tem- [11] P. Sun, S. Xia, G. Yuan, and D. Li. An overview of poral data in the context of in-memory column stores. The moving object trajectory compression algorithms. first evaluations of the approach demonstrate that the ob- Mathematical Problems in Engineering, 2016, 2016. served data access patterns of di↵erent real-world use cases [12] H. Wang, K. Zheng, J. Xu, B. Zheng, X. Zhou, and are supporting the proposed system architecture. Upcoming S. Sadiq. Sharkdb: An in-memory column-oriented tasks include the further analysis of trajectory compression trajectory storage. In Proceedings of the 23rd ACM techniques for columnar databases, the implementation of international conference on conference on information workload-aware metrics to select compression techniques for and knowledge management, pages 1409–1418. ACM, chunks, assessing the impact of di↵erent configurations and 2014. an extensive evaluation with real-world systems. [13] D. Zhang, M. Ding, D. Yang, Y. Liu, J. Fan, and H. T. Shen. Trajectory simplification: an experimental 6. REFERENCES study and quality analysis. Proceedings of the VLDB [1] Reducing the footprint of main memory htap systems: Endowment, 11(9):934–946, 2018. Removing, compressing, tiering, and ignoring data. In [14] Z. Zhang, C. Jin, J. Mao, X. Yang, and A. Zhou. PhD Workshop at VLDB, volume 2175 of CEUR Trajspark: A scalable and efficient in-memory Workshop Proceedings. CEUR-WS.org, 8 2018. management system for big trajectory data. In [2] M. Boissier and M. Jendruk. Workload-driven and Asia-Pacific Web (APWeb) and Web-Age Information robust selection of compression schemes for column Management (WAIM) Joint Conference on Web and stores. In 22nd International Conference on Extending Big Data, pages 11–26. Springer, 2017. Database Technology (EDBT), pages 674–677, 3 2019.