ceur-ws.org/Vol-2399/paper10.pdf


      Optimized Spatio-Temporal Data Structures for Hybrid
      Transactional and Analytical Workloads on Columnar
                     In-Memory Databases

                                                               Keven Richly
                                            supervised by Prof. Dr. h.c. Hasso Plattner
                                           Hasso Plattner Institute, University of Potsdam
                                                       August-Bebel-Str. 88
                                                        Potsdam, Germany
                                                         keven.richly@hpi.de

ABSTRACT                                                                 and the inherent inaccuracy of the data. For these reasons,
Rapid advances in location-acquisition technologies have led             it is a nontrivial task to manage and store vast amounts of
to large amounts of trajectory data. This data is the founda-            these data, which are rapidly accumulated. Especially, if we
tion for a broad spectrum of services driven and improved                consider hybrid transactional and analytical workloads (so-
by trajectory data mining. However, for hybrid transac-                  called HTAP or mixed workloads) on spatio-temporal data,
tional and analytical workloads, the storing and process-                which are challenging concerning space and time complexity.
ing of rapidly accumulated trajectory data is a non-trivial                 The most common format to store trajectory data is the
task. In this paper, we propose an approach to optimize                  sample point format. Here, each observed location is stored
the trajectory data management capabilities for relational               as a tuple with the following attributes: trajectory iden-
database systems. Based on the observations that the re-                 tifier, moving object identifier, multi-dimensional location,
lational database structure is well-suited to store trajectory           and timestamp. In trajectory query processing, we distin-
data in the sample point format and that the access pat-                 guish the four query types: (i) trajectory-based queries, (ii)
terns for trajectory data change over time, we develope a                spatio-temporal range queries, (iii) KNN queries, and (iv)
concept for optimized spatio-temporal data structures for in-            top-k queries [7]. Trajectory-based queries refer to the tra-
memory column stores and describe a workload-aware tiered                jectory of a single moving object and return the entire trajec-
data compression approach. The first evaluations of the ap-              tory, a specific segment of the trajectory, or related informa-
proach demonstrate that the observed data access patterns                tion like the length of the trajectory. Relational databases
of di↵erent real-world use cases are supporting the proposed             are able to store data in the sample point format and pro-
system architecture.                                                     cess queries of the four mentioned types. For that reason,
                                                                         it could be promising to investigate this research area. Fur-
                                                                         thermore, relational data management systems have some
1.    INTRODUCTION                                                       further advantages that could be used for trajectory data
   In recent years, rapid advances in location-acquisition tech-         (e.g., standardized query language, highly optimized data
nologies have led to large amounts of time-stamped loca-                 processing capabilities). To process spatio-temporal data
tion data. Positioning technologies like GPS-based or com-               efficiently, we have to develop optimized data structures for
munication network-based systems enable the tracking of                  spatio-temporal data.
various moving objects. This data is the foundation for a                   Due to the large amount of trajectory data, which have to
wide spectrum of applications [6, 10, 9]. A trajectory is                be stored in various use cases, we also have to evaluate di↵er-
represented by a series of chronologically ordered sampling              ent compression mechanisms for trajectory data to reduce
points. Each sampling point contains spatial information,                the data footprint. Particularly for in-memory databases,
which is represented by a multidimensional coordinate in a               we have to utilize the available memory efficiently. Addi-
geographical space, and temporal information, which is rep-              tionally, in di↵erent use cases we can observe that a spe-
resented by a timestamp. Additionally, an object identifier              cific trajectory segment is less accessed over time. Also, the
assigns each sampling point to a specific moving object and              query types and granularity of queries, which access a spe-
the corresponding trajectory. Thereby, the duration and                  cific segment change over time. Similar characteristics can
sampling rate depends on the application. Based on the                   be observed in time-series data analysis. For that reason,
characteristics of spatio-temporal trajectory data, there ex-            it is necessary to analyze how we can adopt compression
ist four key challenges: the data volume, the high update                mechanisms for such kind of data access patterns to further
rate (data velocity), the query latency of analytical queries,           reduce the memory consumption and increase the perfor-
                                                                         mance by minimizing the number of sample points.

                                                                         2. RESEARCH ISSUES
                                                                           In this section, we want to highlight the research aspects,
Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los
Angeles, California. Copyright (C) 2019 for this paper by its authors.   that we derived from the observations mentioned in the pre-
Copying permitted for private and academic purposes.                     vious section. We propose a concept to integrate storage
mechanisms for trajectory data into a relational in-memory           strongly depends on the addressed use cases: (i) scalability,
database with particular attention to leverage the capabil-          (ii) footprint reduction, (iii) elasticity, (iv) efficiency, or (v)
ities of the columnar database layout and the adoption to            query performance [7].
observed data access patterns. The objective is to reduce
the data footprint and to better utilize the available mem-
ory bandwidth. We accordingly focus on data structures
and compression techniques.

     1. Optimized temporal and spatial data structures for
        columnar in-memory databases: The sample point for-
        mat, which is used to store trajectory data by the ma-
        jority of spatio-temporal data management systems [7],
        is well suited for the table schema of relational database
        systems. Also, di↵erent database systems integrated
        optimized spatial data types [4]. In di↵erent applica-
        tions, we could observe that the performance of spatio-
        temporal range queries is strongly dominated by the
        temporal scan operation. For that reason, it is nec-
        essary to develop new concepts, which eliminate this         Figure 1: A classification model for trajectory data
        bottleneck. The di↵erent types of time awareness that        management systems
        various use cases provide should be considered [7]. Ad-
        ditionally, we should evaluate the spatial functions of         Although the various systems are designed for di↵erent
        database systems with regard to trajectory data man-         infrastructures and optimized for di↵erent use cases, simi-
        agement.                                                     larities in the general structure can be determined and sum-
                                                                     marized in a general blueprint. Most trajectory data man-
     2. Compression techniques for trajectory data: Zhang et         agement systems are optimized for spatio-temporal range
        al. [13] evaluated the compression ratio and quality of      queries or k-nearest neighbor queries. They use a hybrid
        various trajectory simplification algorithms. Colum-         partitioning approach to efficiently skip data partitions dur-
        nar in-memory databases and databases, in general,           ing the query execution. To archive spatio-temporal data lo-
        use additional data compression techniques apart form        cality the data is partitioned by the temporal dimension first
        delta encoding (e.g., dictionary or run-length encod-        and the spatial dimension second. The trajectory of a mov-
        ing). For that reason, it is necessary to evaluate the       ing object is not stored together in one partition. The sys-
        e↵ect of these compression techniques in the context of      tems divide a trajectory into di↵erent trajectory segments
        columnar in-memory databases and analyze the poten-          and distribute them on the data partitions on the basis of the
        tials of compression mechanisms that combine tradi-          partitioning scheme. Inside a data partition, the observed
        tional database compression approaches with special-         locations are often stored in a sample point data format,
        ized trajectory simplification algorithms.                   whereby the sample points are ordered by the trajectory
     3. Workload-aware tiered data compression: The struc-           identifier to increase the performance of trajectory-based
        tures of various trajectory data management systems          queries. To compress the data all systems use delta com-
        have a simar blueprint (see Figure 2). In this blueprint,    pression. Additionally, a wide range of systems applies fur-
        the trajectory data is stored in data partitions. In gen-    ther compression techniques for all data partitions. To op-
        eral, the systems do not distinguish the di↵erent data       timize the query performance, the trajectory data manage-
        partitions and treat all equally. In various applications    ment systems use a layered index structure. As displayed in
        and especially in systems, which analysze times series,      Figure 2, the structure consists of a three-level hybrid index.
        we could observe that the data access pattern for a          The first layer is a temporal index structure, which divides
        database entry changed over time [1, 5]. In time series      the temporal dimensions into di↵erent time intervals. There
        data management, an approach is to increase the com-         are di↵erent implementations for this index (e.g., skip-list,
        pression ratio for old values [5]. Considering the mas-      range index). For each time interval in the first layer, there
        sive amounts of trajectory data, which are tracked by        exists a spatial index. In most cases, the spatial index is
        various companies (e.g., transportation network com-         represented by a traditional KD-tree, quad-tree or oct-tree.
        panies), a tiered compression approach could reduce          The last layer is normally a tree structure (e.g., b+ tree),
        the data footprint reasonably.                               which index the di↵erent spatio-temporal partitions.
                                                                        There are also several systems, which already integrate
                                                                     data structures to store spatial data and spatio-temporal
                                                                     data in relational database systems [4, 12]. Integrating tra-
3.     RELATED WORK                                                  jectory data management into relational database systems
   As shown in Figure 1, there are several systems, which fo-        has the advantages that there is a standardized query lan-
cus on the storage, indexing, and compression of trajectory          guage, the spatio-temporal data can be combined with other
data. Various systems were primarily designed to store spa-          data (e.g., business data), and it is possible to leverage the
tial data points but they are also used for trajectory data.         highly optimized data processing capabilities of these sys-
With the increased availability of spatio-temporal data man-         tems.
agement system were built for di↵erent system infrastruc-
tures, which leverage the specific characteristics of trajec-
tory data. The optimization focus of the developed system
                                                                 our proposed system, these changes lead to a new optimal
                                                                 ordering of data tuples in chunks or compression technique.
                                                                   Additionally, we developed an approach to store times-
                                                                 tamps more efficiently in columnar databases to address the
                                                                 previously mentioned problem that the temporal scan per-
                                                                 formance often dominates the execution time of trajectory
                                                                 database queries. For that reason, we propose a separate
                                                                 columns approach. The values for year, month, day, hour,
                                                                 minute, second, and fractional second are each stored in a
                                                                 separate column. This approach has some advantages over
                                                                 traditional timestamp concepts. Since there are only sev-
                                                                 eral possible values for the di↵erent columns, dictionary en-
                                                                 coding is e↵ective due to saturated and small dictionaries.
                                                                 Another aspect of reducing the data footprint is that in one
                                                                 chunk only a subset of the di↵erent columns (e.g., minutes
                                                                 or seconds) is modified. For that reason, we could mini-
                                                                 mize the memory consumption of the unmodified columns
Figure 2: General blueprint of trajectory manage-                by applying run-length encoding. Also, we could improve
ment systems                                                     the bandwidth utilization for temporal scan operations. A
                                                                 disadvantage is that for between queries, it is necessary to
                                                                 scan eventually multiple columns.
4.    RESERACH PLAN                                              4.1.2    Workload-aware tiered data compression
  In this section, we describe the approach we propose to
                                                                    Based on the observation that the data access patterns
optimize the capabilities of relational databases to store
                                                                 for a trajectory segment change over time, we propose a
and process trajectory data. Furthermore, we explain our
                                                                 tiered data compression approach. To select a suitable com-
planned evaluation of the concept.
                                                                 pression configuration, detailed knowledge about the data
                                                                 characteristics and the workload are required. Most cur-
4.1     Approach                                                 rent databases use simple heuristics to choose a compres-
   The foundation of the proposed approach is the chunk          sion scheme for a given column [2]. However, we propose
concept implemented by the relational in-memory research         a heuristic that selects a compression configuration, which
database Hyrise [3]. As displayed in Figure 3, a database        defines the applied compression technique for columns on a
table is divided into chunks, which are temporally ordered       per-chunk basis.
horizontal partitions of a certain size. A chunk contains           Depending on the workload, the determined configuration
fragments of all columns of a database table, which are          defines for each chunk, if the spatio-temporal data should
called segments. Besides efficient multiprocessing and chunk     be stored uncompressed, lossless compressed with a specific
pruning, this structure enables the creation of auxiliary data   compression algorithm or simplified based on a trajectory
structures (e.g., indices) on a per-chunk basis. Additionally,   simplification algorithm. Additionally, it should also spec-
it provides the possibility to change the order criteria be-     ify the compression ratio of the simplification algorithm per
tween di↵erent chunks (e.g., trajectory identifier, location)    chunk. In general, we expect that recent chunks are uncom-
and enables the application of varying trajectory simplifi-      pressed or only compressed with lossless compression tech-
cation algorithms. We can apply the same concept to the          niques. In contrast, the compression ratio increases with the
encodings of segments. For that reason, some segments of a       age of a chunk (see Figure 3). Besides reducing the mem-
column can be unencoded, others dictionary-encoded, and          ory footprint, the selected configuration has also impact on
further segments can be encoded with additional compres-         the query performance as well as the accuracy of the re-
sion techniques [2].                                             sults. We see potential to optimize the compression schema
                                                                 for a given memory budget and error-based application con-
4.1.1    Optimized temporal and spatial data structures          straints (e.g., accuracy metrics [11]). It might be prefer-
         for columnar in-memory databases                        able to lose performance or accuracy for less often accessed
   These capabilities enable the workload-aware selection of     chunks to leverage the gained space to apply only light-
compression techniques. Furthermore, it is possible to adopt     weight compression methods on frequently accessed chunks.
the ordering of data tuples per chunk. For example, if we
assume a workload that consists of mainly trajectory-based       4.2 Planned Evaluation
queries on recent trajectories and in contrast mainly spatio-       For the evaluations of our approach, we plan to use two
temproal range queries on past trajectories it would be ben-     real-world use cases. The first one is in the sports sector
eficial to order the newer chunks by trajectory identifier and   and includes positional information of soccer players during
the older ones by location (see Figure 3). Also, space-filling   professional soccer games [8]. The data is collected with spe-
curves might be used to improve the scan performance of          cialized camera systems, which track around 3.5 million data
spatial filter queries. Moreover, this approach allows adop-     points per game. By analyzing this data, we could observe
tion to changes in the data distribution or the workload.        data access patterns that support our approach [6]. In the
Zhang et al. [14] also address this topic in TrajSpark by        post-processing mainly the data of the last game is analyzed.
using a time-decay model to monitor the data distribution        Additionally, algorithms are used to detect and classify spe-
and adopt the used indexing schema correspondingly. In           cific game events (e.g., passes) on the fine-grained trajectory
 Table T                                                                                                               [3] M. Dreseler, J. Kossmann, M. Boissier, S. Klauck,
        Column T.Trajid   Column T.Trajobj   Column T.Trajlocation   Column T.Trajtime
                                                                                                                           M. Uflacker, and H. Plattner. Hyrise re-engineered:
     Chunk #1                                                                                   immutable                  An extensible database system for research in
         Segment a          Segment b           Segment c             Segment d                 Sorted by:                 relational in-memory data management. In 22nd
                                                                                                 Location
         unencoded         run length-
                             encoded
                                                dictionary-
                                                 encoded
                                                                      dictionary-
                                                                       encoded           Trajectory Simplification:        International Conference on Extending Database
                                                                                                  Strong
                                                                                                                           Technology (EDBT), pages 313–324, 3 2019.
     Chunk #2                                                                                   immutable
                                                                                                                       [4] V. Pandey, A. Kipf, D. Vorona, T. Mühlbauer,
                                                                                                Sorted by:
         Segment a
         unencoded
                           Segment b
                           dictionary-
                                                Segment c
                                                dictionary-
                                                                      Segment d
                                                                      dictionary-
                                                                                                 Location                  T. Neumann, and A. Kemper. High-performance
                                                                                         Trajectory Simplification:
                            encoded              encoded               encoded                   medium                    geospatial analytics in hyperspace. In Proceedings of
                                                                                                                           the 2016 International Conference on Management of
                                             …

                                                                                                                           Data, pages 2145–2148. ACM, 2016.
     Chunk #n-1                                                                                 immutable

         Segment a         Segment b            Segment c             Segment d                 Sorted by:
                                                                                                                       [5] T. Pelkonen, S. Franklin, J. Teller, P. Cavallaro,
         unencoded         unencoded            unencoded             unencoded               Object Identifier
                                                                                         Trajectory Simplification:
                                                                                                                           Q. Huang, J. Meza, and K. Veeraraghavan. Gorilla: A
                                                                                                     -
                                                                                                                           fast, scalable, in-memory time series database.
     Chunk #n                                                                                   immutable                  Proceedings of the VLDB Endowment,
         Segment a         Segment b            Segment c             Segment d                 Sorted by:                 8(12):1816–1827, 2015.
         unencoded         unencoded            unencoded             unencoded                    Time
                                                                                         Trajectory Simplification:
                                                                                                     -
                                                                                                                       [6] K. Richly. Leveraging spatio-temporal soccer data to
                                                                                                                           define a graphical query language for game recordings.
                                                                                                                           In IEEE International Conference on Big Data, Big
                                                                                                                           Data 2018, Seattle, WA, USA, December 10-13, 2018,
Figure 3: Depiction of the storage layout for an ex-
                                                                                                                           pages 3456–3463, 2018.
emplary table T with n chunks and four trajectory
relevant attributes                                                                                                    [7] K. Richly. A survey on trajectory data management
                                                                                                                           for hybrid transactional and analytical workloads. In
                                                                                                                           IEEE International Conference on Big Data, Big Data
data. The trajectory data of other games of a season is not                                                                2018, Seattle, WA, USA, December 10-13, 2018, pages
analyzed in that depth. All analyses on data of past seasons                                                               562–569, 2018.
are typically only on an event-based level. The second one                                                             [8] K. Richly, M. Bothe, T. Rohlo↵, and C. Schwarz.
is a dataset of a transportation network company, which in-                                                                Recognizing compound events in spatio-temporal
cludes several million driver positions per hour. Due to the                                                               football data. In International Conference on Internet
amount of the collected data, it is only possible to perform                                                               of Things and Big Data (IoTBD), 4 2016.
analysis on compressed data or subsets of the data. For vari-                                                          [9] K. Richly, F. Moritz, and C. Schwarz. Utilizing
ous use cases (e.g., driver-passenger assignment) only recent                                                              artificial neural networks to detect compound events
data is analyzed. Also, older trajectories might not repre-                                                                in spatio-temporal soccer data. In SIGKDD17
sent the current traffic situation.                                                                                        Workshop on Mining and Learning from Time Series
                                                                                                                           (MiLeTS), 2017.
5.      CONCLUSIONS                                                                                                   [10] K. Richly, R. Teusner, A. Immer, F. Windheuser, and
   In this paper, we have presented our approach to store                                                                  L. Wolf. Optimizing routes of public transportation
trajectory data in a relational database system. Based on                                                                  systems by analyzing the data of taxi rides. In
our observation that data points of trajectories are less fre-                                                             Proceedings of the 1st International ACM
quent accessed over time, we developed a workload-aware                                                                    SIGSPATIAL Workshop on Smart Cities and Urban
tired data compression approach to reduce the data foot-                                                                   Analytics, UrbanGIS@SIGSPATIAL 2015, Bellevue,
print and better utilize the available memory bandwidth.                                                                   WA, USA, November 3-6, 2015, pages 70–76, 2015.
Additionally, we presented an optimized data layout for tem-                                                          [11] P. Sun, S. Xia, G. Yuan, and D. Li. An overview of
poral data in the context of in-memory column stores. The                                                                  moving object trajectory compression algorithms.
first evaluations of the approach demonstrate that the ob-                                                                 Mathematical Problems in Engineering, 2016, 2016.
served data access patterns of di↵erent real-world use cases                                                          [12] H. Wang, K. Zheng, J. Xu, B. Zheng, X. Zhou, and
are supporting the proposed system architecture. Upcoming                                                                  S. Sadiq. Sharkdb: An in-memory column-oriented
tasks include the further analysis of trajectory compression                                                               trajectory storage. In Proceedings of the 23rd ACM
techniques for columnar databases, the implementation of                                                                   international conference on conference on information
workload-aware metrics to select compression techniques for                                                                and knowledge management, pages 1409–1418. ACM,
chunks, assessing the impact of di↵erent configurations and                                                                2014.
an extensive evaluation with real-world systems.                                                                      [13] D. Zhang, M. Ding, D. Yang, Y. Liu, J. Fan, and
                                                                                                                           H. T. Shen. Trajectory simplification: an experimental
6.      REFERENCES                                                                                                         study and quality analysis. Proceedings of the VLDB
 [1] Reducing the footprint of main memory htap systems:                                                                   Endowment, 11(9):934–946, 2018.
     Removing, compressing, tiering, and ignoring data. In                                                            [14] Z. Zhang, C. Jin, J. Mao, X. Yang, and A. Zhou.
     PhD Workshop at VLDB, volume 2175 of CEUR                                                                             Trajspark: A scalable and efficient in-memory
     Workshop Proceedings. CEUR-WS.org, 8 2018.                                                                            management system for big trajectory data. In
 [2] M. Boissier and M. Jendruk. Workload-driven and                                                                       Asia-Pacific Web (APWeb) and Web-Age Information
     robust selection of compression schemes for column                                                                    Management (WAIM) Joint Conference on Web and
     stores. In 22nd International Conference on Extending                                                                 Big Data, pages 11–26. Springer, 2017.
     Database Technology (EDBT), pages 674–677, 3 2019.