Processing Moving Object Data Streams with Data Stream Management Systems Tobias Brandt Supervised by Marco Grawunder University of Oldenburg, Germany tobias.leo.brandt@uol.de ABSTRACT programming language and simplify the connection to typ- With the wide spread of cheap and mobile GPS sensors ical data sources. Maintenance of queries on data streams as well as mobile data connections, live streams from mov- is made simple due to the ease to change and update the ing objects are becoming a huge data source. The services query text. Hence, queries can be adapted to new require- based on these data streams, for example, for connected ments quickly. These features make them a useful tool to cars, vessels or smartphone users, need real-time results for easily create and change queries for many different use cases queries based on the current or even near-future positions and are therefore a good choice for rapid prototyping sys- of the moving objects. Spatio-temporal data from moving tems in the field of data stream processing and analysis. objects cannot just be treated as a crowd of points with To cope with the requirements of data streams, DSMSs timestamps, but must be seen as points in a trajectory with support continuous queries and use a data-driven approach. non-measured points in between. In this paper I present New results are incrementally calculated when new data ar- my work on the management of such real-time trajectories rives at the system. This increases the demand for quick within a Data Stream Management System (DSMS) to en- calculations, wherefore data is typically kept in-memory. able simple, flexible and efficient in-memory moving object Unfortunately, a potentially infinite data stream cannot be query processing. hold in-memory. A typical solution DSMSs provide are win- dows. These reduce the amount of data hold in-memory to a smaller part of the data streams, e. g., all elements from 1. INTRODUCTION the last hour or the last 100 elements. Moving objects in the real world are everywhere and have Even though DSMSs already tackle lots of the challenges been there for a while: pedestrians and cars on the streets, that occur with data stream processing, they lack features vessels on the oceans and airplanes in the sky. Comparably necessary to work with moving objects data. Two of these new is the huge amount of data these objects are produc- features are (1) continuous location interpolation and near- ing. Many of these send their location regularly to a central future prediction and (2) fast moving objects index struc- server or other facility, may it be the smartphone user with a tures for windows with high fluctuation. In this work, I Location-based Service (LBS) or a vessel with an Automatic concentrate on point data, hence, moving regions are not in Identification System (AIS) sender. the scope. That is because most moving object data of inter- This data can be used to answer questions and solve real- est today, such as vessels and pedestrians, can be simplified world problems. For example, vessels can be warned about to point objects without loosing too much precision in the congested or currently dangerous areas based on their own queries. Moving or evolving regions in contrast introduce a position as well as the positions of other vessels. As more whole new palette of challenges. data can be shared via live-streams and as the results of such One important feature of moving objects data streams is queries are required with minimal delay, traditional systems that the objects move continuously but are only measured that first store and then query the data streams are not once in a while. Therefore, the objects have unknown lo- ideal. They typically run short-term queries on static data cations in between the measurements, which can and some- sets that are stored on the hard drive. times need to be used for querying. Imagine a query where Data Stream Management Systems (DSMSs) especially a vessel needs to know all vessels around it. Another vessel, target data streams. They offer solutions for many data which last known location is (temporal and spatial) far away stream related challenges, provide query languages to define but will probably be within that range on querytime, should queries without the need to write code in a general purpose be included in the answer, hence, its location needs to be automatically predicted to the future by the query. This sce- nario is particularly important for satellite AIS where it is normal to have hours between location updates [3]. Stream- ing data differs from static time-series data: in static data, all measured locations of a moving object are known. In con- trast, in a streaming environment, it is not possible to know if and when the next location update of a moving object will Proceedings of the VLDB 2017 PhD Workshop, August 28, 2017. Munich, Germany. arrive. Copyright (c) 2017 for this paper by its authors. Copying permitted for Challenges with this approach are that interpolation and private and academic purposes. prediction depends on the use case and always comes with an uncertainty. When new data about an interpolated object is 12:00 12:00 available, old query results may need to be updated. When 11:55 2 and how to update results by more precise ones is a non- 1 11:58 trivial question within a DSMS. To my best knowledge, 12:00 there is no work that tackles these challenges in the field of X data streams for moving objects. 11:58 The second feature mentioned above are moving objects 3 12:05 windows and suitable index structures. As not all data can be stored, it needs to be decided which data is still needed for processing and which is not and how and when old data can be wiped. Typical window concepts such as time-based windows can be extended by windows especially for moving objects. A possible window type could be a distance-based CC BY-SA 2.0, Editors of http://map.openseamap.org/ window. It would store all data within a certain distance of a single moving object, e. g., the last kilometer of every ob- Figure 1: Range query for the orange vessel. ject. The requirements for the underlying index structures differ both from pure temporal as well as pure spatial index structures. the middle sent its last location at 12 o’clock and wants to 2. OBJECTIVES AND CHALLENGES know which vessels are in its range at that point in time. The overall goal of this work is to create and implement The trajectory of the other vessels one to three are visual- concepts to allow DSMSs to process spatio-temporal data ized with the arrows and circles. The circles are the measure- from moving objects. To reach this goal, I am tackling the ments where the correct location of the vessel is known. The challenges of including location interpolation and prediction arrows in between visualize a simple interpolation. It is as- as well as window concepts for moving object data streams. sumed that the path between the measurements is straight. That is not necessarily the case but it is a reasonable and 2.1 Objectives simple approximation. The dashed lines with the striped circles are predictions of the future trajectory. • Location Interpolation Moving objects naturally The need to interpolate and predict locations arises from move continuously but are only measured once in a the fact that the moving objects do not measure and send while. Within a data stream, they seem to be hopping their location synchronized with each other, but at different from point to point, resulting in delayed and possi- time intervals. In the figure, the last known location of bly wrong results. I aim to introduce location inter- the orange vessel in the middle was captured at 12 o’clock. polation into the data stream processing to that the For this point in time, the locations of all other vessels are objects move continuously in time and space. The in- needed to answer the query correctly. Unfortunately, the terpolation should also be used for short-term location known, i. e., measured, locations of the other vessels are not prediction. at 12 o’clock, but slightly before or after this point in time. If only the last known location would be used to answer the • Result Uncertainty As prediction always comes with query, the result would be wrong: Vessel 1 would not be a degree of uncertainty, the results are uncertain as within the result but should be, Vessel 2 would be within well and can change when new data from a moving the result but should not and the location of Vessel 3 would objects arrives. The accuracy or uncertainty of a re- be wrong. sult should be transparent to the user and updates of Hence, interpolation and prediction is necessary to an- results should be possible. swer the query approximately correct. When doing so, a few challenges arise. The interpolation has to work in an in- • Moving Object Windows Window definitions that cremental manner with limited knowledge about the data, are especially useful for moving objects data should be as not all past data can be stored in a streaming environ- introduced. ment. The accuracy of a query result needs to be known to • Spatio-temporal Indexes The data within the win- the user or further processing steps. When the prediction dows need to be hold in spatio-temporal index struc- was wrong, old results for a query may need to be updated tures optimized for high data fluctuation. (e. g., if Vessel 1 takes a different path than predicted). In a streaming scenario, the approximate result may already be The concepts created to solve the goals should be eval- used for further processing or a following result, for exam- uated by implementing them into an open source DSMS. ple, for 12:03, has already been processed. The questions Scenarios with AIS data from vessels should show that the on how to integrate uncertain results and updates to them concepts work with real-world data and queries. within a DSMS need to be tackled. 2.2 Challenges for Location Interpolation 2.3 Challenges for Moving Object Windows Typical spatio-temporal queries include neighborhood and Windows are necessary to reduce the infinite data stream range queries. An example query could be “Continuously re- to a finite set of data. The data within a window can be port all vessels within a range of 10 km around vessel X”. kept in-memory and never needs to be permanently written Such a query is depicted in Figure 1. The orange vessel in to a hard drive. Next to the performance improvements, the concept of windows has more benefits. They are based implement the spatio-temporal operations. While doing so, on the assumption that in many cases, queries are only or they have to behave like normal operators to the outside so mostly interested in the current data and not in the data that other operators can seamlessly use the output. Two from the distant past. To define a window according to the example operators that are needed are a range and a k - use case, the domain needs to be known. A typical window nearest neighbors (k NN) operator. Both search for other could, for example, hold all data from the last hour. moving objects close to a certain object. The external be- In the domain of moving objects, the requirements for havior of these operators is similar to other operators. They windows can differ depending on the scenario. The defi- receive stream elements, process them and send their results nition of a window only by the time and not by the space as stream elements to the next operator. dimension is often not enough. Imagine, for example, a win- Internally, these operators need to use location interpola- dow where the speed of all objects within a data stream is tion and prediction to compute correct results and annotate diverse and variable (e. g., slow vessels and fast planes within these results with the level of (un)certainty in the meta data one stream). It could be necessary to have from each ob- of a streaming object. The interpolation should be done by ject at least the last kilometer within the window. With a a framework within the DSMS that allows to interchange time-based window this would be difficult to achieve. Addi- algorithms, as the interpolation algorithm can change from tionally, compression could be introduced to moving objects case to case. windows. Patroumpas et al. [7] show that trajectory data can be compressed without loosing much accuracy. Hence, 3.2 Current and Future Work new window concepts for moving objects are useful. Currently, prototype versions for moving object range and Windows reduce the amount of the required (in-memory) k NN queries are available. A prototype implementation storage space. Nevertheless, it opens up new questions about within a DSMS was developed. For spatial querying with how to clean up the memory, for example, when to delete moving object windows, an index structure based on Geo- old data. This question gets more complicated as window Hashes [6] was developed. It showed better performance indexes have to be shared between multiple queries. Imag- than an implementation based on QuadTrees. This could ine a data stream of all vessels on the North Sea. As spatial be due to better insertion and deletion performance with queries can be heavily accelerated when using an index, the the GeoHash implementation. However, these results are data in the windows are indexed. Thereby, due to memory very preliminary, as the test setup needs to be better de- limitations, creating multiple nearly identical indexes must scribed and results need to be analyzed further to find out be avoided. Subsequently, one index is shared between mul- the reasons for the differences. tiple queries. Nevertheless, such a sharing makes the de- The correct integration of those queries as well as moving cision when to delete old data more complicated, as the object windows is ongoing work. The results are currently window requirements from the queries can be different. only partly usable for other query operators. Interpolating Additionally, not every index structure is suitable for a and predicting locations are in the concept phase. Develop- spatio-temporal data stream index. In contrast to more tra- ment and implementation of these will be a major part of ditional Geographic Information System (GIS) applications, the future work. the fluctuation in the data is very high. New data needs to be inserted and at the same time old data needs to be re- 3.3 Planned Evaluation moved on a high frequency. It is possible that the whole The concepts that are created in this PhD project will set of data within a window can be swapped within minutes be implemented into the open source DSMS Odysseus 1 [1]. or even seconds, which, for example, distinguishes the re- Odysseus offers a rich set of operators and a query language. quirements from static time-series data. Traditional index- For the purpose of this work it already supports protocols structures that require heavy reorganization when data gets used in the maritime domain such as AIS and can talk to changed are probably not suitable for this environment. In common data sources such as RabbitMQ out of the box. In this work it needs to be evaluated if index structures for contrast to streaming frameworks such as Apache Flink2 or this purpose are useful as it is only an improvement if the Heron3 , it is not necessary to program in a general program- indexing needs less time than it saves while querying the ming language such as Java to create new queries. data. With that implementation, the feasibility of the concepts In this PhD project I aim to create suitable window con- will be evaluated. Using an iterative approach, the concepts cepts, implement them and choosing an efficient index struc- can be adjusted if uncovered challenges occur while evaluat- ture for this very dynamic environment. ing. The implementation will be used with scenarios in the maritime context, especially with AIS data. An example 3. RESEARCH PLAN query could be to continuously query if a vessel is heading In this section, the main approaches to overcome the chal- to an area that will be congested during its transit. lenges from above are described. For the given scenarios with moving objects, timely query results are necessary, wherefore the performance of the solu- 3.1 Approach tions will be measured. The latency and throughput of the Query processing on data streams is typically done with queries will be used to compare different implementations, operator graphs. The elements of the data stream are send e. g., different approaches for spatio-temporal indexes. from one operator to the next, each operator doing a specific 1 http://odysseus.offis.uni-oldenburg.de/, last ac- task. Joins, projections and selections are typical examples cessed on 03/21/2017 for such operators. 2 https://flink.apache.org/, last accessed on 05/24/2017 When adding support for moving objects data, this mod- 3 https://twitter.github.io/heron/, last accessed on ular architecture should be exploited. New operators can 05/24/2017 4. RELATED WORK 5. CONCLUSION General purpose and open source streaming systems such This paper describes the motivation, challenges and ap- as Apache Flink and Apache Storm4 as well as commer- proaches of processing data from moving objects in DSMSs. cial systems such as IBM InfoSphere Streams5 (short: IBM A major challenge are asynchronous updates of locations of Streams) offer high performance and distributed stream pro- multiple moving objects. To serve timely query results, e. g., cessing, but have only limited support for moving objects. for a range query, locations of objects need to be interpo- While spatio-temporal data is supported in some systems lated and predicted. The integration of interpolated values (e. g., with IBM Streams [2]), location interpolation and into a DSMS provides some challenges that are tackled with moving object windows are, as of my knowledge, not. this PhD project. First approaches to solve these are ex- Zhang et al. [9] use Apache Storm to process fast data plained. The planned evaluation uses AIS data from vessels streams from moving objects. They focus on a distributed for continuous queries. spatial index which speeds up range and k NN queries as well as spatial joins. One main difference to this work is 6. ACKNOWLEDGMENTS that they do not interpolate and predict locations to have We thank the Ministry of Science and Culture of Lower temporal correct results. Saxony, Germany for supporting us with the graduate school The open source project GeoMesa6 works with spatio- Safe Automation of Maritime Systems (SAMS). temporal data, e. g., from moving objects [6]. The project develops indexes based on space-filling curves. These allow quick access to spatial or spatio-temporal data within sorted 7. REFERENCES [1] H.-J. Appelrath, D. Geesen, M. Grawunder, key-value stores such as Apache Accumolo7 . GeoMesa does T. Michelsen, and D. Nicklas. Odysseus: A highly not specifically address streaming data, data stream man- customizable framework for creating efficient event agement capabilities or location interpolation. stream management systems. In Proceedings of the 6th The RxSpatial library [8] is an extension for the Microsoft ACM International Conference on Distributed SQL Server Spatial Library and adds support for moving Event-Based Systems, DEBS ’12, pages 367–368, New objects. It adds the RUM-tree, which is an extension of the York, NY, USA, 2012. ACM. R-tree for frequent updates. Additionally, RxSpatial allows [2] A. Biem, E. Bouillet, H. Feng, A. Ranganathan, continuous spatial queries, e. g., to observe if a moving object A. Riabov, O. Verscheure, H. Koutsopoulos, and is close to another. As of my best knowledge, the library C. Moran. Ibm infosphere streams for scalable, does not take into account the time of the updates but uses real-time, intelligent transportation services. In the newest updates of every moving object. Interpolation Proceedings of the 2010 ACM SIGMOD International and prediction are not used. Conference on Management of Data, SIGMOD ’10, Interpolation for moving objects is a difficult challenge for pages 1093–1104, New York, NY, USA, 2010. ACM. moving regions (e. g., described by Heinz et al. [5]). This is [3] M. A. Cervera, A. Ginesi, and K. Eckstein. especially complex as these regions can change their shape Satellite-based vessel automatic identification system: over time. In this work I want to concentrate on moving A feasibility and performance analysis. International points, which is a way simpler version of that problem. Nev- Journal of Satellite Communications and Networking, ertheless, the perfect interpolation method is not the goal of 29(2):117–142, 2011. this work but the integration of interpolation and prediction [4] R. H. Güting, T. Behr, and C. Düntgen. Secondo: A into the stream processing of moving objects data. platform for moving objects database research and for Secondo [4] is a database system especially for moving publishing and integrating research implementations. objects. It has a spatial and temporal algebra with which Fernuniv., Fak. für Mathematik u. Informatik, 2010. queries for moving objects can be formulated. As it is a database, it is not optimized for data streams, e. g., it does [5] F. Heinz and R. H. Güting. Robust high-quality not support windows, does not run mainly in-memory and interpolation of regions to moving regions. hence does not have to solve the problem of cleaning up old Geoinformatica, 20(3):385–413, July 2016. data. Nevertheless, it gives useful insights on the handling [6] H. V. Le. Distributed moving objects database based of moving objects data. on key-value stores. In Proceedings of the VLDB 2016 Patroumpas et al. [7] use AIS data in a streaming envi- PhD Workshop co-located with the 42nd International ronment to detect complex events such as unexpected stops Conference on Very Large Databases (VLDB 2016), of vessels. They compress the data to important points in New Delhi, India, September 9, 2016., 2016. the trajectory of the vessels without loosing much accuracy. [7] K. Patroumpas, E. Alevizos, A. Artikis, M. Vodas, They also address errors in the AIS data by removing wrong N. Pelekis, and Y. Theodoridis. Online event measurements. In contrast to this work, they do not use lo- recognition from moving vessel trajectories. cation prediction for vessels that did not send an update for GeoInformatica, 21(2):389–427, 2017. a while, which is for example the case for satellite AIS. [8] Y. Shi, A. M. Hendawi, H. Fattah, and M. Ali. Rxspatial: Reactive spatial library for real-time location tracking and processing. In Proceedings of the 4 2016 International Conference on Management of https://storm.apache.org/, last accessed on 03/21/2017 Data, pages 2165–2168. ACM, 2016. 5 https://www.ibm.com/analytics/us/en/technology/ stream-computing/, last accessed on 03/21/2017 [9] F. Zhang, Y. Zheng, D. Xu, Z. Du, Y. Wang, R. Liu, 6 http://www.geomesa.org, last accessed on 03/17/2017 and X. Ye. Real-time spatial queries for moving objects 7 https://accumulo.apache.org/, last accessed on using storm topology. ISPRS International Journal of 03/21/2017 Geo-Information, 5(10), 2016.