=Paper= {{Paper |id=Vol-1671/paper4 |storemode=property |title=Distributed Moving Objects Database Based on Key-Value Stores |pdfUrl=https://ceur-ws.org/Vol-1671/paper4.pdf |volume=Vol-1671 |authors=Hong Van Le |dblpUrl=https://dblp.org/rec/conf/vldb/Le16 }} ==Distributed Moving Objects Database Based on Key-Value Stores== https://ceur-ws.org/Vol-1671/paper4.pdf
              Distributed Moving Objects Database Based on
                            Key–Value Stores

                                                                Hong Van Le
                                    Supervised by Atsuhiro Takasu, takasu@nii.ac.jp
                              SOKENDAI (The Graduate University for Advanced Studies), Japan
                                                              l-van@nii.ac.jp

ABSTRACT                                                                  volumes of data, support high insertion throughputs, and
Moving objects databases (MOD) have been studied inten-                   respond to queries in real-time.
sively and extensively in the database field. Recently, the                  There are three main challenges for a MOD to be useful for
emerging Big Data trend, which refers to a collection of                  these systems. The Volume Challenge. The database should
large, complex, and rapidly growing geographical data col-                have high scalability, fault-tolerance and availability while
lected from sensors and GPS-enabled devices, has posed new                dealing with large volumes of collected data. The High Com-
requirements for MOD: the ability to manage massive vol-                  putational Complexity Challenge. Many queries will involve
ume of data, the support for low-latency spatio-temporal                  geometric computations, such as very expensive logical oper-
queries and the need for high insertion throughput rate. Al-              ations on spatial relationships, that are continuously chang-
though key–value stores handle large-scale data efficiently,              ing. The Real-time Processing Challenge. Because many
they are not equipped with effective functions for supporting             agents keep registering their location updates continuously,
spatio-temporal data. In this project, we aim to build a dis-             the database must have high insertion throughput to han-
tributed MOD which fulfills all these requirements. We fo-                dle the volume of data. It must also guarantee satisfactory
cus on the design of an efficient distributed spatio-temporal             performance on queries, but as the dataset become bigger,
index that can support indexing and querying moving object                query time can increase dramatically.
data in key–value stores.                                                    Some systems have been proposed, such as SpatialHadoop
                                                                          [6] and Hadoop-GIS [1], that are based on MapReduce, a
                                                                          parallel, distributed and scalable framework for processing
1.    INTRODUCTION                                                        large volumes of data. These systems support high-perfor-
   Recent rapid improvements in positioning technologies such             mance spatial queries, but so far have not provided support
as satellites, the Global Positioning System (GPS), sensors,              for temporal constraints with spatial data. Moreover, such
and wireless networks has resulted in an explosion of data                systems are suitable for batch processing; they still have a
generated by moving objects. For instance, traffic manage-                high latency compared with the requirements of a real-time
ment systems in busy cities such as Tokyo have been collect-              system. Key–value stores (KVS) such as HBase1 , with their
ing large numbers of location updates from probe cars such                scalability, fault tolerance, availability and random real-time
as taxis, buses and GPS-enabled devices at a rate of mul-                 read/write capability, have shown promise. However, they
tiple updates per minute from each vehicle. The collected                 do not have native support for spatial and spatio-temporal
data often include location data (longitude and latitude) and             queries.
time data (a timestamp).                                                     Spatial support has been extended to KVS. For instance,
   Many systems, in both scientific research and daily life,              MD-HBase [11] layers a multidimensional index over the
have taken advantage of the data collected from moving ob-                KVS by using Z-order and multidimensional index struc-
jects. For example, intelligent transportation systems are                tures. However, it does not support spatial queries con-
exploiting the massive volume of sensor data from probe cars              cerned with evolution, in which spatial relationships change
and GPS-enabled devices to provide efficient route planning               over time. GeoMesa [7] introduces a spatio-temporal index
and traffic balancing based on the current traffic situation.             structure based on Geohash2 on top of Apache Accumulo3 .
People can take advantage of these systems by continuously                It interleaves Geohash and timestamps into the design in-
sending data about their current location and time, then                  dex and achieves acceptable results when storing data using
receiving relevant real-time analyses of their traffic situa-             35-bit Geohash strings. Nonetheless, the performance of its
tion. Such systems require a MOD; that can store massive                  proposed spatio-temporal index depends significantly on the
                                                                          number of Geohash characters in the rowkey and on the res-
                                                                          olution level; hence, further study of their impact on queries
                                                                          and dataset is required.
                                                                             The goal of our work is to build a complete MOD based
                                                                          on KVS that overcomes the above challenges. We first pro-
                                                                          posed a novel two-tier index structure to handle spatial data
Proceedings of the VLDB 2016 PhD Workshop, September 9, 2016. New         1
                                                                            https://hbase.apache.org
Delhi, India.                                                             2
Copyright (c) 2016 for this paper by its authors. Copying permitted for     http://geohash.org
                                                                          3
private and academic purposes.                                              http://accumulo.apache.org
in HBase. Our experimental results indicate that queries
using the proposed index outperformed queries with other
indexes and queries in a MapReduce-based system. Then
we indexed moving objects data by presenting a lightweight
spatio-temporal index structure based on STCode [9], an en-
coding algorithm that treats time in the same way as spatial
information. However, time is not just another dimension
but a special one beside the two spatial dimensions, so it is
necessary to explore new methods for incorporating tempo-
ral information into spatial indexes. We discuss this further                   Figure 1: Geohash generation
in Section 3 and briefly describe our research plan in Sec-
tion 4.
                                                                   prefix will be close to each other geographically. If we store
                                                                   objects in lexicographical order of row keys in the KVS, ob-
2.    KEY–VALUE STORES                                             jects that are close to each other spatially will also be close to
   In our project, we build the database based on the ar-          each other in the database and we can scan relevant objects
chitecture of KVS. This is modeled after Google’s BigTable         efficiently by using prefix filters. However, using Geohash
[3] such as Cassandra, Accumulo, HBase. These KVS share            alone does not guarantee efficient spatial query processing.
some important characteristics inherited from BigTable. In         For instance, a prefix filter may give insufficient results when
this section, we describe these characteristics of HBase, as       finding k nearest neighbors of a point that is near the rect-
we use it in our experiments.                                      angle border of the prefix. Geohash uses Z-order to order
   HBase is a distributed scalable database that takes ad-         the rectangles, as shown in Figure 2, so when the system
vantage of the HDFS. Tables in HBase include rows and              scans all points ranging from rectangles with Geohash 1 to
columns like other databases, but they can scale to large          4, it will scan the unrelated rectangles 2 and 3. To obtain
numbers of rows and columns. To store such large tables            accurate query results and prune unnecessary scanning, we
in a distributed cluster, tables are split into a number of        utilized the R-Tree, a multidimensional index structure for
smaller chunks called regions, which are stored in servers         geographical data, as the secondary index tier.
called RegionServers.
   HBase is also a KVS since its data model is organized as a
KVS. Within a table, data are stored as a sorted list of key-
value pairs according to rows that are uniquely identified by
their rowkeys, which therefore play an important role when
searching or scanning. Rowkey design is one of the most
important aspects of HBase schema as well as other KVS.
   The physical data model in HBase is column-oriented,
making it a column-oriented database. Rows in HBase are
composed of columns and columns are grouped into column
families. Columns in a family are stored together in a low-
level storage file called an HFile. The column family forms
the basic unit of physical storage to which HBase features
such as compression can be applied. Hence, proper design of          Figure 2: Z-order of 1-character Geohash codes
column families is also essential when storing and processing
data on HBase.                                                        To bridge the two tiers Geohash and R-Tree, we propose a
                                                                   novel data structure, the binary Geohash rectangle-partition
                                                                   tree (BGRP Tree) [10]. We first partition regions into rect-
3.    COMPLETED AND ON-GOING WORK                                  angles using the longest common prefix. Because overlap
                                                                   between these rectangles would lead to redundant scans if
3.1     Spatial Index                                              we insert them directly into the R-Tree, we use the BGRP
                                                                   Tree for further partitioning of rectangles into subrectangles,
3.1.1    Two-tier index structure                                  until there is no overlap between them. Finally, we insert
   Data stored in HBase are accessed by a single rowkey.           all subrectangles into the R-Tree. When processing spatial
However, spatial data are represented by two coordinates           queries, the system finds the rectangles in the R-Tree that
(longitude and latitude), which are equally important in           may contain query results. Then scan process is conducted
defining a location. Geohash provides a solution to trans-         on the rectangles found, allowing us to prune the scanning
form longitude/latitude coordinates into unique codes. Fig-        on unrelated regions.
ure 1 shows how to generate a Geohash code for a spatial
point. Geohash recursively performs binary partitioning to         3.1.2    Experimental results
divide the range into two equal parts, and then assign bit 0         We built a cluster with 64 nodes. Each node had a virtual
if the location is in the left part and bit 1 if it is the right   core, 8 GB memory and a 64 GB hard drive. The operating
part, respectively. Then it interleaves bits from the two di-      system for the nodes was CentOS 7.0 (64-bit). We set up one
mensions and uses base 32 to encode the bit sequence into          HMaster, 60 Region Servers and three Zookeeper Quorums
a hash code.                                                       using Apache HBase 0.98.7 with Apache Hadoop 2.4.1 and
   There are some advantages of using Geohash to store spa-        Zookeeper 3.4.6. Replication was set to two. To conduct
tial data into KVS. Points that share the same Geohash             queries on SpatialHadoop, we installed SpatialHadoop v2.1,
which shipped with Apache Hadoop 1.2.1, and configured              to divide the range of each dimension into two equal parts
one master and 64 slaves.                                           and interleave bits from the three dimensions, then finally
   We evaluated the insert performance using Yahoo! Cloud           using base 64 instead of base 32 to encode the bit sequence.
System Benchmark. The number of workloads was varied                Figure 4 shows an example of generating an STCode of two
from one to 64. Because real spatial data are often skewed,         characters for a point.
we chose a Zipfian distribution to generate longitude and
latitude of each input point. We observed that the system
could sustain a peak throughput of 600 K inserts per second
when there were eight workloads generating data. However,
when the number of workloads was increased to 32 or 64,
we observed a drop in performance. This was because of the
cost of splitting regions when a region becomes a hot spot.
Moving objects data are naturally skewed but as shown in
Figure 2, Geohash divides the space into equal buckets, so
some prefixes in dense areas (e.g., u, s, w ) are used more
frequently than the prefixes in sparse areas (e.g., 0, 1, 2, 3 ),
and regions of the prefixes in dense area can easily become                       Figure 4: STCode generation
hot spots.
                                                                      Because KVS keeps rowkeys in lexicographical order, we
                                                                    can exploit STCode as a rowkey on HBase. Since STCode
                                                                    cannot encode the year information, we achieve the same
                                                                    STCode for points that are in the same place and time but in
                                                                    different years. To handle this limitation, we stored the year
                                                                    information as a column family in HBase. The reason we
                                                                    chose a column family is that it is the secondary information
                                                                    to identify a row beside the rowkey, and the basic unit of
                                                                    physical storage in HBase. By using year information as a
                                                                    column family, we can distinguish data for different years
                                                                    and store or compress all data for a year in the same or
                                                                    nearby physical storage, thus improve the performance when
                                                                    searching for close points in the database.
      Figure 3: KNN queries with T-Drive dataset

   We executed k nearest neighbours (kNN) queries using the         4.   RESEARCH PLAN
two real datasets, T-Drive [13] and OpenStreetMap (OSM)4 .             In this section, we describe the key future directions to
Figure 3 shows the performance of kNN queries for the T-            meet the requirements of a MOD and to overcome some
Drive dataset, which is more than 700 MB with 17,762,390            limitations of our previous work as discussed in Section 3.
records. With both datasets, we observed that parallel kNN             Space-filling curves: Transforming from two-dimensional
using a two-tier index outperformed all other kNN queries           spatial data into one-dimensional data is essential to store
with Geohash only. Note that our index design operates              moving objects data into KVS. In previous work, we have
about 60 to 90 times faster than SpatialHadoop. This is             used Z-order because of its simplicity for both encoding
because HBase does not require the startup, cleanup, shuf-          and decoding, so it could support high insertion through-
fling and sorting tasks of MapReduce. Another reason is             put. However, the quality measures for space-filling curves
that we store kNN procedures in every region server before-         in [8] indicate that Z-order performs worst. Therefore, we
hand, thereby needing only to invoke that procedure locally         would like to study the performance of other space-filling
on each server. In contrast, MapReduce sends a procedure            curves including Hilbert order, βΩ order, and AR2 W 2 order
to slave servers for every query, thereby requiring more time       in the context of indexing spatial data in KVS, and analyze
for network communication.                                          the relationship between quality measures of the curves and
                                                                    query performance in databases.
3.2     Spatio-Temporal Index                                          Indexing current and near future movement: Tem-
   A spatial index is important in indexing moving objects,         poral information has a special feature—monotonicity. For
but we cannot apply it directly to create a moving-objects          example, if the evolution of an object is represented by a
index because of the evolution of objects over time. In our         set of records (oi , si , ti ) where si is the location of object
current approach, temporal information is simply treated as         oi at the timestamp ti , for each two consecutive records of
an extra dimension on top of the spatial hash. For instance,        an object oi (oi , si , ti ) and (oi , si+1 , ti+1 ), we always have
STCode [9] is an algorithm that encodes spatial information         ti+1 > ti . Because of this feature, some moving objects
(longitude, latitude), and temporal information into unique         data could be considered as obsolete if their timestamps are
codes. Latitude is λ ∈ (0, 180) and longitude is ϕ ∈ (0,            smaller than a predefined threshold. To handle these obso-
360). Temporal information is represented in minutes within         lete entries, pack and purge operations could be employed to
1 year m ∈ (0, 527,040). The minute values (527,040 =               save index and storage space. Purging obsolete entries and
366 × 24 × 60) cover the whole year, even in a leap year.           reorganizing after purge operations must be considered while
STCode is generated by applying the rule used with Geohash          designing index structures. Separately maintaining tempo-
                                                                    ral and spatial information influences purge operations less
4
    http://www.openstreetmap.org                                    than other queries, which are often related to recent data.
Therefore, instead of indexing temporal information as an-        7.   REFERENCES
other dimension, which occurs in STCode, incorporating it          [1] A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and
as a special dimension is a potential and challenging ap-              J. Saltz. Hadoop gis: a high performance spatial data
proach in our research.                                                warehousing system over mapreduce. Proceedings of
   Density based partition: While Z-order as well as                   the VLDB Endowment, 6(11):1009–1020, 2013.
some other space-filling curves divide the surface into equal      [2] A. Akdogan, C. Shahabi, and U. Demiryurek. Toss-it:
buckets, moving objects data are naturally skewed. This                A cloud-based throwaway spatial index structure for
leads to some problems in handling spatial queries. First,             dynamic location data. In Mobile Data Management
as mentioned in Section 3.1, the load imbalance of data in-            (MDM), 2014 IEEE 15th International Conference on,
sertion into distributed storage could reduce the insertion            volume 1, pages 249–258. IEEE, 2014.
throughput. Second, it also causes many false positive scans
                                                                   [3] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A.
if the scanning bucket is bigger than the query range. On
                                                                       Wallach, M. Burrows, T. Chandra, A. Fikes, and
the other hand, if we divide the query range into a large
                                                                       R. E. Gruber. Bigtable: A distributed storage system
number of small buckets to scan, the I/O increment would
                                                                       for structured data. ACM Transactions on Computer
cause significant performance deterioration. To overcome
                                                                       Systems (TOCS), 26(2):4, 2008.
these problems, we should partition the space according to
                                                                   [4] C. de Souza Baptista, C. E. S. Pires, D. F. B. Leite,
the density of the data.
                                                                       and M. G. de Oliveiraa. Nosql geographic databases:
                                                                       an overview. Geographical Information Systems:
5.   RELATED WORK                                                      Trends and Technologies, 73, 2014.
   Storing and querying geospatial data have been supported        [5] J. Dittrich, L. Blunschi, and M. A. V. Salles. Movies:
by some NoSQL databases such as document-oriented data-                indexing moving objects by shooting index images.
bases (CouchDB, MongoDB) or network-oriented (Neo4j).                  GeoInformatica, 15(4):727–767, 2011.
[4] provides an intensive survey of NoSQL Geographic Data-         [6] A. Eldawy and M. F. Mokbel. A demonstration of
bases. Although MongoDB supported 2D and 2Dsphere in-                  spatialhadoop: an efficient mapreduce framework for
dexes for spatial data, it deals only with point vectors and           spatial data. Proceedings of the VLDB Endowment,
its input and output are limited to GeoJSON. GeoCouch,                 6(12):1230–1233, 2013.
the spatial extension of CouchDB, cannot filter based on           [7] A. Fox, C. Eichelberger, J. Hughes, and S. Lyon.
both spatial and nonspatial criteria in a query. The spatial           Spatio-temporal indexing in non-relational distributed
library of Neo4j, Neo4j Spatial, supports many types of spa-           databases. In Big Data, 2013 IEEE International
tial data and almost fully supports spatial functions but still        Conference on, pages 291–299. IEEE, 2013.
requires several improvements to index and query temporal          [8] H. Haverkort and F. van Walderveen. Locality and
information with spatial information.                                  bounding-box quality of two-dimensional space-filling
   With the development of large main memories, many re-               curves. Computational Geometry, 43(2):131–147, 2010.
searchers have proposed main-memory indexes to deal with           [9] J. Ježek and I. Kolingerová. Stcode: The text
the challenge of trading off between high update rates and             encoding algorithm for latitude/longitude/time. In
low-latency location-based queries. TwinGrid [12] and MOV-             Connecting a Digital Europe Through Location and
IES [5] maintain two separate index structures, the update             Place, pages 163–177. Springer, 2014.
and query indexes. The update index stores all arriving up-       [10] B. Le Hong Van and A. Takasu. An efficient
dates during a time period. The query index is a read-only             distributed index for geospatial databases. In Database
index to answer queries. TwinGrid replaces the query index             and Expert Systems Applications: 26th International
by the update index periodically while MOVIES accumu-                  Conference, DEXA 2015, Valencia, Spain, September
lates the update index and rebuild the query index from the            1-4, 2015, Proceedings, volume 9261, page 28.
accumulated update index. These systems can support high               Springer, 2015.
update rates but suffer from the staleness of query results.
                                                                  [11] S. Nishimura, S. Das, D. Agrawal, and A. E. Abbadi.
ToSS-it [2] used Voronoi diagrams to index spatial data and
                                                                       Md-hbase: A scalable multi-dimensional data
a distribute-first build-later approach to distribute data to
                                                                       infrastructure for location aware services. In Mobile
servers. It provides high query throughput and scalability
                                                                       Data Management (MDM), 2011 12th IEEE
but when data are skewed, it takes time to update the in-
                                                                       International Conference on, volume 1, pages 7–16.
dex.
                                                                       IEEE, 2011.
                                                                  [12] D. Šidlauskas, K. A. Ross, C. S. Jensen, and
6.   CONCLUSION                                                        S. Šaltenis. Thread-level parallel indexing of update
   This paper discusses the challenges of building MOD that            intensive moving-object workloads. In Advances in
can support large volumes of data and provide both low-                Spatial and Temporal Databases, pages 186–204.
latency queries and high insertion throughput. We have                 Springer, 2011.
proposed an index structure to support spatial data on KVS        [13] J. Yuan, Y. Zheng, C. Zhang, W. Xie, X. Xie, G. Sun,
and observed improved query performance compared with                  and Y. Huang. T-drive: driving directions based on
other index structures and with the state-of-the-art frame-            taxi trajectories. In Proceedings of the 18th
work based on MapReduce. We are in the early stages of in-             SIGSPATIAL International conference on advances in
tegrating temporal information into spatial index structures           geographic information systems, pages 99–108. ACM,
to support moving objects data. We described the problems              2010.
left to solve and the approaches and plans to achieve the
goal of this PhD project.