=Paper=
{{Paper
|id=Vol-1812/JARCA16-paper-2
|storemode=property
|title=A Scalable Data Streaming Infrastructure for Smart Cities
|pdfUrl=https://ceur-ws.org/Vol-1812/JARCA16-paper-2.pdf
|volume=Vol-1812
|authors=Jesús Arias Fisteus,Luis Sánchez Fernández,Víctor Corcoba Magaña,Mario Muñoz Organero,Jorge Yago Fernández,Juan Antonio Álvarez-García
}}
==A Scalable Data Streaming Infrastructure for Smart Cities==
A Scalable Data Streaming Infrastructure for Smart Cities Jesús Arias Fisteus1 , Luis Sánchez Fernández1 , Vı́ctor Córcoba Magaña1 , Mario Muñoz Organero1 , Jorge Yago Fernández2 , Juan Antonio Álvarez-Garcı́a2 1 Depto. de Ingenierı́a Telemática, Universidad Carlos III de Madrid {jaf, luiss, vcorcoba, munozm}@it.uc3m.es 2 Depto. de Lenguajes y Sistemas Informáticos, Universidad de Sevilla {jorgeyago, jaalvarez }@us.es ably managing streams of sensor data in the context of the HERMES (Healthy and Efficient Routes in Massive open- Abstract data basEd Smart cities) [FAGAG+ 15] project, which aims at helping its users, citizens of a smart city, keep healthy Many of the services a smart city can provide to habits. Other systems for health care in smart cities are re- its citizens rely on the ability of its infrastructure ported in [SPC+ 14]. The main sources of data in HERMES to collect and process in real time vast amounts of are the citizens themselves, which contribute to the smart continuous data that sensors deployed through the city by letting it track their physical activities through activ- city produce. In this paper we present the server ity bands or the SmartCitizen mobile application, and their infrastructure we have designed in the context of driving through the SmartDriver mobile application. the HERMES project to collect the data from sen- sors and aggregate it in streams for their use in In order to understand the amount of data it supposes, services of the smart city. let us focus on one of the applications. The SmartDriver application aims at reducing the stress levels and fuel con- sumption of its users, as well as improving traffic safety, 1 Introduction by providing the user with real time driving recommenda- Many of the services a smart city can provide to its citizens tions [CMMO15a, CMMO15b]. In order to do that, the rely on the ability of its infrastructure to collect and process application tracks its users while they drive and sends the in real time vast amounts of continuous data that sensors data to the infrastructure as soon as it captures it, so that deployed through the city produce [PZCG14]. In this sce- server-side services can perform real time computations nario, building an infrastructure that scales as the number such as detection of congested roads and stressful road sec- of such sensors and their data rates increase is a challenging tions. The application should receive useful feedback back, task. Grouping the data in streams is a common approach e.g. static road information, a recommended driving speed for this kind of scenarios. A data stream can be defined and traffic alerts. In its current prototype, the application as a real-time, continuous, ordered (implicitly by arrival track’s the vehicle’s movement as well as its driver’s heart time or explicitly by timestamp) sequence of items [GO03]. rate. It reports the vehicle’s location every 10 seconds. In Streams are different to stored data in several aspects: they addition, it reports immediately abnormal situations such cannot normally be stored in their entirety, and the order in as high accelerations or decelerations, excessive speeds or which data is received cannot be controlled. abrupt increases in the driver’s heart rate. More detailed Real time stream processing solutions are required to data, such as second by second information about location, manage this kind of data streams. In fact, the generic speed and heart rate, are buffered in order to reduce re- platform for big data applications proposed in [VLM+ 13] source consumption, and sent to the infrastructure every assigns an important role to such a component. Build- time the driver completes a 500 meter road section. Be- ing scalable stream processing solutions is far from triv- cause each driver produces at the least one data item ev- ial [CBB+ 03]. In this paper we propose a system for scal- ery 10s, only 10,000 drivers would suppose a load of more than 1,000 requests/s for the infrastructure that collects Copyright c by the paper’s authors. Copying permitted for private and the data. academic purposes. In: Zoe Falomir, Juan A. Ortega (eds.): Proceedings of JARCA 2016, The rest of the paper is organized as follows. Section 2 Almerı́a, Spain, June 2016, published at http://ceur-ws.org proposes a system architecture for the real time processing of streams in the context of a smart city. Section 3 presents HERMES servers that manage data persistence con- a case study for this architecture, based on the SmartDriver sume this stream in order to receive the data they have application of the HERMES project. Section 4 reports the to store. results of a performance of the system. Conclusions and future lines of work are presented in section 5. • Public stream: this stream is derived from the main stream. It is part of the public API HERMES provides to third-party applications. It transports aggregated, 2 The Data Streaming Infrastructure anonymized and semantically-annotated data that may The server-side infrastructure was developed on top of the be useful to those applications. Ztreamy middleware [AFFGSFFL14]. We have selected • Short-term location-based services: the streaming in- Ztreamy because of its flexibility, scalability and the sim- frastructure needs to perform some real-time compu- ple HTTP-based API it provides to data producers, which tations and keep some short-term data. For example, improves compatibility and simplifies the development of it needs to detect traffic incidents, retrieve the scores clients such as the SmartDriver mobile application. In of nearby drivers for the SmartDriver’s gamification addition, Ztreamy provides useful out-of-the-box features system, etc. This module serves the collectors and the such as stream aggregation, filtering and replication, as public stream server. In addition, it needs to use the well as a persistency subsystem that prevents the lose of long-term location-based services in order to get car- data items once they have been accepted by the infrastruc- tography data and speed recommendations based on ture, even in the case of temporal network disruptions or historical data. This information is needed as input failures of one or more components of the deployed sys- for some of the short-term services, and part of it is tem. As our experiments in [AFFGSFFL14] show, other also returned to the SmartDriver application. publish-subscribe systems for sensor data like DataTur- bine [FTS+ 09] would not provide the performance levels The other components of the architecture (mobile appli- we need in this scenario. The ZeroMQ middleware1 is cations, long-term storage and location-based services and more or less similar in terms of performance to Ztreamy, third-party applications) lay without the scope of this paper. but Ztreamy provides us with a much more convenient high Depending on the amount of simultaneous clients the level API and an HTTP-based interface. The more recent system needs to handle, this architecture can be deployed Apache Kafka [KNR+ 11] publish-subscribe system could on a single server or distributed across several ones. If dis- be an alternative to Ztreamy, but we have not yet studied ei- tributed, a good network link between them is advisable. ther its suitability for this scenario or its performance, and Ideally, all the servers should share the same local network leave it for future work. in order to reduce end-to-end delays and bandwidth limita- Figure 1 shows the system architecture we have de- tions. signed. It consists of the following main components: Additionally, because of the locality of the services the infrastructure provides, the system as a whole can be easily • Data collectors: Ztreamy servers to which the Smart- partitioned for different geographical areas, thus deploying Driver and SmartCitizen mobile applications post a replica of the whole system for each geographical area. their data through HTTP. These servers validate the This eases the scaling of the system as the amount of users data and orchestrate the interactions with other ser- of its services grows. vices needed to handle it. They are also responsible of responding mobile applications with feedback data 3 The SmartDriver Case Study when required. Since most of the load of input data handling is supported by these data collectors, they are In order to illustrate the internals of the system, let us focus replicated behind an HTTP load balancer in order to on the SmartDriver mobile application. It tracks the driver increase the number of clients they are able to handle. and posts the following types of events: We have chosen the well-known Nginx2 open-source • Vehicle Location: it contains a timestamp, latitude and HTTP server for this task. longitude where the vehicle is located, an estimation • Main stream: data items received by the collectors are of the accuracy of that location, the instantaneous ve- then aggregated into the main stream, which is man- hicle speed and the current driving score assigned to aged by a separate Ztreamy server. the driver by the gamification subsystem of the appli- cation. These events are posted every 10s. They are • Storage stream: this stream filters the data items that used mainly for the real-time services. don’t need to be stored out of the main stream. The • Driving Section: it contains more detailed informa- 1 http://zeromq.org/ (Visited 2016-06-01) tion about a larger road section, including second by 2 https://www.nginx.com/ (Visited 2016-11-23) second location and speed, heart rate measurements Short-term Streaming location-based server services infrastructure Short-term user data Collector #1 Load balancer City infrastructure sensors Collector #2 Main stream Public stream ... SmartDriver App STRONG LIVE Collector #N Storage stream STRONG LIVE SmartCitizen App Long-term Third-party location-based services services Public API Cartography User data Spatiotemporal database Figure 1: Data streaming infrastructure architecture and aggregated computations associated to this sec- In order to reduce the load of the long-term location- tion (average and standard deviations of speed and based services with unnecessary requests due to stopped or heart rate, as well as statistics about speed varia- very slow vehicles, collectors assume the type of road and tions). These events are posted for every 500m the speed limit did not change if the driver advanced less than user drives. They are intended for storage, but can 10m since the last time they determined those values. The also be used in some real time services. short-term services take similar measures to avoid some computations such as retrieving or storing driver scores in • Abnormal situations: they are posted every time those situations. SmartDriver detects an abnormal situation (strong ac- The current prototype of the short-term location-based celerations and decelerations, too high speeds, too services provides two main features: high heart rates), immediately after its detection. • It tracks the latest location of each driver in order to Because of their 10s periodicity, the system uses the Ve- detect the way of the road the driver follows (both the hicle Location posts to send feedback to the SmartDriver current location and a previous location are needed) as application. Collector servers are responsible of gathering well as detecting when the driver has advanced more the required information from the short-term and long-term than the 10m threshold. location-based services and sending it back to the applica- tion in the body of their HTTP response. This feedback • It tracks the driving score and location of every driver includes: in order to provide the nearby drivers’ score service. • Type of road and its speed limit (to be obtained from The first feature is implemented on top of a RAM-stored the long-term services). two-tier dictionary in which every 30s the oldest dictionary is dropped and a new one created. This structure allows the • Recommended speed as computed by the speed rec- system to keep just one location per driver and drop those ommendation service (to be obtained from the long- drivers that have not contacted the service for more than term services and possibly adapted to current road 30s. conditions by the short-term services). The second feature is more complex because it re- quires performing spatial queries on a rectangle around the • Traffic alerts in the vicinity (to be obtained from the driver’s current location. We have implemented it with a short-term services). RAM-stored SQLite3 database using an R-tree-based in- dex. The system periodically drops data older than 1 hour • Driving scores assigned to nearby drivers by the gam- because the gamification feature bases on recent data. ification system (to be obtained from the short-term services). 3 https://www.sqlite.org/ (Visited 2016-06-01) 4 Evaluation Main stream server 30000 The current prototype of the streaming server infrastructure was subjected to experiments with varied amounts of load in order to evaluate its performance. Because of the unfea- 25000 sibility of recruiting enough volunteers to simultaneously use the application up to the loads the system is able to 20000 Events / min handle, we developed a simulator that produces a synthetic load. 15000 4.1 The Simulator The simulator was designed to produce data and send it 10000 to the infrastructure in a way that, from the point of view of measuring performance, is equivalent to having a given amount of actual users, all of them using the SmartDriver 1000 1500 2000 2500 3000 3500 4000 4500 Clients application and driving simultaneously a number of differ- ent paths in the same city. The following parameters can be configured in the simulator before starting a simulation: Figure 2: Evolution of event rate with respect to the number of clients. • Number of simulated drivers: since each driver gener- the events they produce to the infrastructure by sending ates at least one Vehicle Location event every 10s, the HTTP requests that are similar to those the actual Smart- minimum number of requests per second the system Driver application would send. Despite coming all the needs to handle is rmin = n/10, where n is the num- requests produced by the simulator from the same host, ber of drivers. Data Section events make actual rates drivers in the simulator are prevented from sharing their slightly higher, especially when drivers reach higher underlying TCP connections with other drivers. This way speeds. In order to introduce variability on the sys- the simulator will produce a realistic traffic pattern, analo- tem, each driver is assigned some random parameters gous to the actual pattern SmartDriver produces. that influences her driver behavior (e.g. her inclina- tion to drive fast or slow with respect to speed limits). In addition, not all drivers start at the same time. Each 4.2 Experimental Setup driver starts randomly within one minute of starting The experiments were run by deploying the streaming the simulator. server infrastructure (load balancer, six collector instances, • Paths: each simulated driver is assigned a path she one main stream instance, one storage stream instance and will traverse during the simulation. Paths are based on one short-term location-based service instance) on a high- the actual cartography of Seville, with random start- end server with 12 Intel Xeon E5-2430 2.5GHz cores and ing and end points in the city and its surroundings. 64 GB of RAM memory. Each path is created by choosing a pair of random start The simulator was deployed on a laptop computer, con- and end points within a configurable distance from the nected to the server through one intermediate IP router and city center. The path itself will be the optimum path a 100Mbps connection. In order to accurately measure for going in a private vehicle from the start to the end event delivery delays, simulator and server used the NTP point, as returned by a geographic information system. service to synchronize their clocks. The number of paths is configurable and drivers are uniformly assigned to those paths. Therefore, many 4.3 Results drivers may follow the same path. Despite sharing a path, because their behavior and the instant they begin The combination of number of clients and event rate of to drive are random, those drivers will not be synchro- each client determines the load the system needs to handle. nized and therefore there will be enough randomness Since in SmartDriver the event rate each client generates on the system. is approximately the same, the aggregated event rate arriv- ing the server infrastructure should grow linearly with the Once the simulation starts, the simulator makes every number of clients. Figure 2 shows that, as expected, the driver advance on her path at a speed that depends on event rate at the main stream is proportional to the number the randomly assigned characteristics of the driver and the of clients up to 4,000 clients. At that point, with approx- speed limit of the current road, with a random bias. Accel- imately 28,000 events per minute, the infrastructure satu- eration and deceleration are also modeled by the simulator rates and begins to reject events, and therefore linearity is (e.g. at turns or when speed limits change). Drivers send lost. Short-term server HTTP balancer Main stream server Collectors Storage stream server Main stream server All components 0.30 0.5 0.25 0.4 Utilization 0.20 Utilization 0.3 0.15 0.2 0.1 0.10 10000 15000 20000 25000 0.0 Events / min 1000 1500 2000 2500 3000 3500 4000 4500 Clients Figure 4: Utilization versus event rate. Figure 3: Global and per-component CPU utilization. feed stream. As expected, delays increase with the load of the system, especially from 3,500 clients on, where the We’ve measured the performance of the server infras- indicators show that the system is beginning to saturate. In tructure for different loads in a series of experiments with our experimental setup the network distance between the a growing number of clients. The main performance indi- simulator and the infrastructure is less than 1ms. In more cators we measured were: realistic scenarios, that distance would be higher, although • CPU utilization: amount of time of CPU used during a less than 0.5s in most of the situations. minute, divided by 60s. This measurement was taken As a conclusion, the infrastructure we have presented in every minute. Its estimated mean and 95% confidence this paper is able to handle up to 4,000 simultaneous drivers intervals were computed and reported in the plots. A from a single server, which represent approximately 28,000 component with an utilization close to 1 is in the limit new events every minute. At larger data rates collectors of the load it can handle. A component with an uti- begin to reject some events due to saturation. lization close to 0 is mainly free. 5 Conclusions and Future Work • Event admission delay: amount of time between the creation of the event at the client side (the simula- We deployed the first working prototype of this infras- tor) and its admission at a front-end server and at the tructure more than two years ago. It is still working and database feed stream. Larger delays may signal con- has received frequent feature upgrades and bug fixes since gestion situations in the server. Similarly to utiliza- then. During this period, the system has been continu- tion, mean delays with 95% confidence intervals were ously capturing data from our beta testers with no major estimated from the delays suffered by a random sam- issues. Although the core of the architecture is already im- ple of the simulated events. plemented and deployed, some of its services are still work in progress. More specifically, the public stream and the Figure 3 shows the overall utilization of the infrastruc- speed recommendation and traffic incident detection ser- ture as well as the individual utilization of each server com- vices are not yet part of the current prototype. ponent. The six collector servers average a higher utiliza- According to our experiments, the architecture we pro- tion than the rest, thus being the bottleneck of the sys- pose provides a reasonable level of performance in the tem. However, their utilization may be reduced by adding context of the HERMES project. The maximum amount some more collector instances to the pool, assuming that of simultaneous drivers a single server may handle is ap- the server has cores enough. The next component in terms proximately 4,000, but the infrastructure can be scaled- of utilization is the short-term location-based server, fol- up by deploying it into more servers, especially the col- lowed by the server that handles the main stream. The stor- lectors components. Distributing the short-term location age stream server and the HTTP balancer can handle much services component is challenging because of the need of more load than the rest. Figure 4 shows clearly that utiliza- a shared spatial database, but techniques for efficiently tion at the main stream grows linearly with its event rate. partitioning such data and their processing already ex- Finally, figure 5 shows the event delivery delay at two ist [AWV+ 13, WAV15]. points of the server infrastructure: collectors and storage Future work includes rebuilding this infrastructure Storage stream server Collectors journeys. In 2015 IEEE 5th International Conference on Consumer Electronics – Berlin, pages 153–157, 2015. 6 [CMMO15b] Vı́ctor Córcoba Magaña and Mario 5 Muñoz-Organero. Reducing stress and fuel consumption providing road infor- 4 mation. Ambient Intelligence – Software Delay (s) 3 and Applications, 376:23–31, 2015. 2 [FAGAG+ 15] Jorge Yago Fernández, Álvaro Ar- cos Garcı́a, Juan Antonio Álvarez- 1 Garcı́a, Juan Antonio Ortega Ramı́rez, Jesús Torres, Jesús Arias Fisteus, Vı́ctor 0 Córcoba Magaña, Mario Muñoz Or- 1000 1500 2000 2500 3000 3500 4000 4500 ganero, and Luis Sánchez Fernández. Clients Plataforma para gestión de información de ciudadanos de una smartcity. In Ac- Figure 5: Event delivery delays at collectors and storage tas de las XVII Jornadas de ARCA, pages stream. Delays measure the amount of time since an event 53–56, Junio 2015. is created at the simulator until it is fully processed and accepted at the given server component. [FTS+ 09] T. Fountain, S. Tilak, P. Shin, P. Hub- bard, and L. Freudinger. The open on top of a big data framework such as Apache source dataturbine initiative: Streaming Kafka [KNR+ 11] to try to increase the amount of clients data middleware for environmental ob- that may be served from a single server. serving systems. In International Sympo- sium on Remote Sensing of Environment, Acknowledgements 2009. Research reported in this paper was supported by the Span- ish Economy Ministry through the “HERMES – Smart [GO03] Lukasz Golab and M. Tamer Özsu. Is- Driver” project (TIN2013-46801-C4-2-R) and the “HER- sues in data stream management. SIG- MES – Smart Citizen” project (TIN2013-46801-C4-1-R). MOD Rec., 32:5–14, June 2003. [KNR+ 11] Jay Kreps, Neha Narkhede, Jun Rao, References et al. Kafka: A distributed messaging [AFFGSFFL14] Jesus Arias Fisteus, Norberto Fernan- system for log processing. NetDB, 2011. dez Garcia, Luis Sanchez Fernandez, and Damaris Fuentes-Lorenzo. Ztreamy: [PZCG14] Charith Perera, Arkady Zaslavsky, Peter a middleware for publishing semantic Christen, and Dimitrios Georgakopoulos. streams on the web. Journal of Web Se- Sensing as a service model for smart mantics, 25:16–23, 2014. cities supported by internet of things. Transactions on Emerging Telecommuni- [AWV+ 13] Ablimit Aji, Fusheng Wang, Hoang Vo, cations Technologies, 25(1):81–93, 2014. Rubao Lee, Qiaoling Liu, Xiaodong Zhang, and Joel Saltz. Hadoop gis: A [SPC+ 14] A. Solanas, C. Patsakis, M. Conti, I. S. high performance spatial data warehous- Vlachos, V. Ramos, F. Falcone, O. Posto- ing system over mapreduce. Proc. VLDB lache, P. A. Perez-martinez, R. D. Pietro, Endow., 6(11):1009–1020, August 2013. D. N. Perrea, and A. Martinez-Balleste. Smart health: A context-aware health [CBB+ 03] Mitch Cherniack, Hari Balakrishnan, paradigm within smart cities. IEEE Magdalena Balazinska, Don Carney, Uur Communications Magazine, 52(8):74– etintemel, Ying Xing, and Stan Zdonik. 81, Aug 2014. Scalable distributed stream processing. In In CIDR, 2003. [VLM+ 13] I. Vilajosana, J. Llosa, B. Martinez, M. Domingo-Prieto, A. Angles, and [CMMO15a] V. Córcoba Magaña and M. Muñoz- X. Vilajosana. Bootstrapping smart cities Organero. Reducing stress on habitual through a self-sustainable model based on big data flows. IEEE Communications Magazine, 51(6):128–134, June 2013. [WAV15] Fusheng Wang, Ablimit Aji, and Hoang Vo. High performance spatial queries for spatial big data: From medical imaging to gis. SIGSPATIAL Special, 6(3):11–18, April 2015.