=Paper= {{Paper |id=Vol-1812/JARCA16-paper-2 |storemode=property |title=A Scalable Data Streaming Infrastructure for Smart Cities |pdfUrl=https://ceur-ws.org/Vol-1812/JARCA16-paper-2.pdf |volume=Vol-1812 |authors=Jesús Arias Fisteus,Luis Sánchez Fernández,Víctor Corcoba Magaña,Mario Muñoz Organero,Jorge Yago Fernández,Juan Antonio Álvarez-García }} ==A Scalable Data Streaming Infrastructure for Smart Cities== https://ceur-ws.org/Vol-1812/JARCA16-paper-2.pdf
        A Scalable Data Streaming Infrastructure for Smart Cities

 Jesús Arias Fisteus1 , Luis Sánchez Fernández1 , Vı́ctor Córcoba Magaña1 , Mario Muñoz Organero1 ,
                          Jorge Yago Fernández2 , Juan Antonio Álvarez-Garcı́a2
                   1
                     Depto. de Ingenierı́a Telemática, Universidad Carlos III de Madrid
                                {jaf, luiss, vcorcoba, munozm}@it.uc3m.es
                 2
                   Depto. de Lenguajes y Sistemas Informáticos, Universidad de Sevilla
                                        {jorgeyago, jaalvarez }@us.es



                                                                        ably managing streams of sensor data in the context of the
                                                                        HERMES (Healthy and Efficient Routes in Massive open-
                           Abstract                                     data basEd Smart cities) [FAGAG+ 15] project, which aims
                                                                        at helping its users, citizens of a smart city, keep healthy
     Many of the services a smart city can provide to                   habits. Other systems for health care in smart cities are re-
     its citizens rely on the ability of its infrastructure             ported in [SPC+ 14]. The main sources of data in HERMES
     to collect and process in real time vast amounts of                are the citizens themselves, which contribute to the smart
     continuous data that sensors deployed through the                  city by letting it track their physical activities through activ-
     city produce. In this paper we present the server                  ity bands or the SmartCitizen mobile application, and their
     infrastructure we have designed in the context of                  driving through the SmartDriver mobile application.
     the HERMES project to collect the data from sen-
     sors and aggregate it in streams for their use in                      In order to understand the amount of data it supposes,
     services of the smart city.                                        let us focus on one of the applications. The SmartDriver
                                                                        application aims at reducing the stress levels and fuel con-
                                                                        sumption of its users, as well as improving traffic safety,
1    Introduction                                                       by providing the user with real time driving recommenda-
Many of the services a smart city can provide to its citizens           tions [CMMO15a, CMMO15b]. In order to do that, the
rely on the ability of its infrastructure to collect and process        application tracks its users while they drive and sends the
in real time vast amounts of continuous data that sensors               data to the infrastructure as soon as it captures it, so that
deployed through the city produce [PZCG14]. In this sce-                server-side services can perform real time computations
nario, building an infrastructure that scales as the number             such as detection of congested roads and stressful road sec-
of such sensors and their data rates increase is a challenging          tions. The application should receive useful feedback back,
task. Grouping the data in streams is a common approach                 e.g. static road information, a recommended driving speed
for this kind of scenarios. A data stream can be defined                and traffic alerts. In its current prototype, the application
as a real-time, continuous, ordered (implicitly by arrival              track’s the vehicle’s movement as well as its driver’s heart
time or explicitly by timestamp) sequence of items [GO03].              rate. It reports the vehicle’s location every 10 seconds. In
Streams are different to stored data in several aspects: they           addition, it reports immediately abnormal situations such
cannot normally be stored in their entirety, and the order in           as high accelerations or decelerations, excessive speeds or
which data is received cannot be controlled.                            abrupt increases in the driver’s heart rate. More detailed
    Real time stream processing solutions are required to               data, such as second by second information about location,
manage this kind of data streams. In fact, the generic                  speed and heart rate, are buffered in order to reduce re-
platform for big data applications proposed in [VLM+ 13]                source consumption, and sent to the infrastructure every
assigns an important role to such a component. Build-                   time the driver completes a 500 meter road section. Be-
ing scalable stream processing solutions is far from triv-              cause each driver produces at the least one data item ev-
ial [CBB+ 03]. In this paper we propose a system for scal-              ery 10s, only 10,000 drivers would suppose a load of more
                                                                        than 1,000 requests/s for the infrastructure that collects
Copyright c by the paper’s authors. Copying permitted for private and   the data.
academic purposes.
In: Zoe Falomir, Juan A. Ortega (eds.): Proceedings of JARCA 2016,         The rest of the paper is organized as follows. Section 2
Almerı́a, Spain, June 2016, published at http://ceur-ws.org             proposes a system architecture for the real time processing
of streams in the context of a smart city. Section 3 presents          HERMES servers that manage data persistence con-
a case study for this architecture, based on the SmartDriver           sume this stream in order to receive the data they have
application of the HERMES project. Section 4 reports the               to store.
results of a performance of the system. Conclusions and
future lines of work are presented in section 5.                     • Public stream: this stream is derived from the main
                                                                       stream. It is part of the public API HERMES provides
                                                                       to third-party applications. It transports aggregated,
2     The Data Streaming Infrastructure                                anonymized and semantically-annotated data that may
The server-side infrastructure was developed on top of the             be useful to those applications.
Ztreamy middleware [AFFGSFFL14]. We have selected                    • Short-term location-based services: the streaming in-
Ztreamy because of its flexibility, scalability and the sim-           frastructure needs to perform some real-time compu-
ple HTTP-based API it provides to data producers, which                tations and keep some short-term data. For example,
improves compatibility and simplifies the development of               it needs to detect traffic incidents, retrieve the scores
clients such as the SmartDriver mobile application. In                 of nearby drivers for the SmartDriver’s gamification
addition, Ztreamy provides useful out-of-the-box features              system, etc. This module serves the collectors and the
such as stream aggregation, filtering and replication, as              public stream server. In addition, it needs to use the
well as a persistency subsystem that prevents the lose of              long-term location-based services in order to get car-
data items once they have been accepted by the infrastruc-             tography data and speed recommendations based on
ture, even in the case of temporal network disruptions or              historical data. This information is needed as input
failures of one or more components of the deployed sys-                for some of the short-term services, and part of it is
tem. As our experiments in [AFFGSFFL14] show, other                    also returned to the SmartDriver application.
publish-subscribe systems for sensor data like DataTur-
bine [FTS+ 09] would not provide the performance levels              The other components of the architecture (mobile appli-
we need in this scenario. The ZeroMQ middleware1 is              cations, long-term storage and location-based services and
more or less similar in terms of performance to Ztreamy,         third-party applications) lay without the scope of this paper.
but Ztreamy provides us with a much more convenient high             Depending on the amount of simultaneous clients the
level API and an HTTP-based interface. The more recent           system needs to handle, this architecture can be deployed
Apache Kafka [KNR+ 11] publish-subscribe system could            on a single server or distributed across several ones. If dis-
be an alternative to Ztreamy, but we have not yet studied ei-    tributed, a good network link between them is advisable.
ther its suitability for this scenario or its performance, and   Ideally, all the servers should share the same local network
leave it for future work.                                        in order to reduce end-to-end delays and bandwidth limita-
    Figure 1 shows the system architecture we have de-           tions.
signed. It consists of the following main components:                Additionally, because of the locality of the services the
                                                                 infrastructure provides, the system as a whole can be easily
    • Data collectors: Ztreamy servers to which the Smart-       partitioned for different geographical areas, thus deploying
      Driver and SmartCitizen mobile applications post           a replica of the whole system for each geographical area.
      their data through HTTP. These servers validate the        This eases the scaling of the system as the amount of users
      data and orchestrate the interactions with other ser-      of its services grows.
      vices needed to handle it. They are also responsible
      of responding mobile applications with feedback data       3     The SmartDriver Case Study
      when required. Since most of the load of input data
      handling is supported by these data collectors, they are   In order to illustrate the internals of the system, let us focus
      replicated behind an HTTP load balancer in order to        on the SmartDriver mobile application. It tracks the driver
      increase the number of clients they are able to handle.    and posts the following types of events:
      We have chosen the well-known Nginx2 open-source
                                                                     • Vehicle Location: it contains a timestamp, latitude and
      HTTP server for this task.
                                                                       longitude where the vehicle is located, an estimation
    • Main stream: data items received by the collectors are           of the accuracy of that location, the instantaneous ve-
      then aggregated into the main stream, which is man-              hicle speed and the current driving score assigned to
      aged by a separate Ztreamy server.                               the driver by the gamification subsystem of the appli-
                                                                       cation. These events are posted every 10s. They are
    • Storage stream: this stream filters the data items that          used mainly for the real-time services.
      don’t need to be stored out of the main stream. The            • Driving Section: it contains more detailed informa-
    1 http://zeromq.org/ (Visited 2016-06-01)                          tion about a larger road section, including second by
    2 https://www.nginx.com/ (Visited 2016-11-23)                      second location and speed, heart rate measurements
                                                                                     Short-term                                    Streaming
                                                                                   location-based                                    server
                                                                                      services                                   infrastructure
                                                         Short-term
                                                          user data




                                                                                   Collector #1




                                                              Load balancer
                      City infrastructure
                            sensors                                                Collector #2

                                                                                                         Main stream           Public stream




                                                                                    ...
                     SmartDriver App
                                STRONG
                         LIVE




                                                                                   Collector #N
                                                                                                        Storage stream
                                                STRONG
                                         LIVE




                     SmartCitizen App



                                                                                     Long-term                                                    Third-party
                                                                                   location-based                                                  services
                                                                                      services




                                                                                                                  Public API
                                                                              Cartography   User data

                                                                                                             Spatiotemporal
                                                                                                               database



                                    Figure 1: Data streaming infrastructure architecture
      and aggregated computations associated to this sec-          In order to reduce the load of the long-term location-
      tion (average and standard deviations of speed and       based services with unnecessary requests due to stopped or
      heart rate, as well as statistics about speed varia-     very slow vehicles, collectors assume the type of road and
      tions). These events are posted for every 500m the       speed limit did not change if the driver advanced less than
      user drives. They are intended for storage, but can      10m since the last time they determined those values. The
      also be used in some real time services.                 short-term services take similar measures to avoid some
                                                               computations such as retrieving or storing driver scores in
   • Abnormal situations: they are posted every time           those situations.
      SmartDriver detects an abnormal situation (strong ac-        The current prototype of the short-term location-based
      celerations and decelerations, too high speeds, too      services provides two main features:
      high heart rates), immediately after its detection.
                                                                  • It tracks the latest location of each driver in order to
   Because of their 10s periodicity, the system uses the Ve-         detect the way of the road the driver follows (both the
hicle Location posts to send feedback to the SmartDriver             current location and a previous location are needed) as
application. Collector servers are responsible of gathering          well as detecting when the driver has advanced more
the required information from the short-term and long-term           than the 10m threshold.
location-based services and sending it back to the applica-
tion in the body of their HTTP response. This feedback            • It tracks the driving score and location of every driver
includes:                                                            in order to provide the nearby drivers’ score service.

  • Type of road and its speed limit (to be obtained from                                                   The first feature is implemented on top of a RAM-stored
    the long-term services).                                                                            two-tier dictionary in which every 30s the oldest dictionary
                                                                                                        is dropped and a new one created. This structure allows the
  • Recommended speed as computed by the speed rec-                                                     system to keep just one location per driver and drop those
    ommendation service (to be obtained from the long-                                                  drivers that have not contacted the service for more than
    term services and possibly adapted to current road                                                  30s.
    conditions by the short-term services).                                                                 The second feature is more complex because it re-
                                                                                                        quires performing spatial queries on a rectangle around the
  • Traffic alerts in the vicinity (to be obtained from the                                             driver’s current location. We have implemented it with a
    short-term services).                                                                               RAM-stored SQLite3 database using an R-tree-based in-
                                                                                                        dex. The system periodically drops data older than 1 hour
  • Driving scores assigned to nearby drivers by the gam-                                               because the gamification feature bases on recent data.
    ification system (to be obtained from the short-term
    services).                                                                                            3 https://www.sqlite.org/ (Visited 2016-06-01)
4     Evaluation                                                                                              Main stream server
                                                                                 30000
The current prototype of the streaming server infrastructure
was subjected to experiments with varied amounts of load
in order to evaluate its performance. Because of the unfea-                      25000
sibility of recruiting enough volunteers to simultaneously
use the application up to the loads the system is able to
                                                                                 20000




                                                                  Events / min
handle, we developed a simulator that produces a synthetic
load.
                                                                                 15000
4.1    The Simulator
The simulator was designed to produce data and send it                           10000
to the infrastructure in a way that, from the point of view
of measuring performance, is equivalent to having a given
amount of actual users, all of them using the SmartDriver                                1000   1500   2000   2500 3000       3500   4000   4500
                                                                                                                 Clients
application and driving simultaneously a number of differ-
ent paths in the same city. The following parameters can be
configured in the simulator before starting a simulation:         Figure 2: Evolution of event rate with respect to the number
                                                                  of clients.
    • Number of simulated drivers: since each driver gener-       the events they produce to the infrastructure by sending
      ates at least one Vehicle Location event every 10s, the     HTTP requests that are similar to those the actual Smart-
      minimum number of requests per second the system            Driver application would send. Despite coming all the
      needs to handle is rmin = n/10, where n is the num-         requests produced by the simulator from the same host,
      ber of drivers. Data Section events make actual rates       drivers in the simulator are prevented from sharing their
      slightly higher, especially when drivers reach higher       underlying TCP connections with other drivers. This way
      speeds. In order to introduce variability on the sys-       the simulator will produce a realistic traffic pattern, analo-
      tem, each driver is assigned some random parameters         gous to the actual pattern SmartDriver produces.
      that influences her driver behavior (e.g. her inclina-
      tion to drive fast or slow with respect to speed limits).
      In addition, not all drivers start at the same time. Each   4.2              Experimental Setup
      driver starts randomly within one minute of starting        The experiments were run by deploying the streaming
      the simulator.                                              server infrastructure (load balancer, six collector instances,
    • Paths: each simulated driver is assigned a path she         one main stream instance, one storage stream instance and
      will traverse during the simulation. Paths are based on     one short-term location-based service instance) on a high-
      the actual cartography of Seville, with random start-       end server with 12 Intel Xeon E5-2430 2.5GHz cores and
      ing and end points in the city and its surroundings.        64 GB of RAM memory.
      Each path is created by choosing a pair of random start        The simulator was deployed on a laptop computer, con-
      and end points within a configurable distance from the      nected to the server through one intermediate IP router and
      city center. The path itself will be the optimum path       a 100Mbps connection. In order to accurately measure
      for going in a private vehicle from the start to the end    event delivery delays, simulator and server used the NTP
      point, as returned by a geographic information system.      service to synchronize their clocks.
      The number of paths is configurable and drivers are
      uniformly assigned to those paths. Therefore, many          4.3              Results
      drivers may follow the same path. Despite sharing a
      path, because their behavior and the instant they begin     The combination of number of clients and event rate of
      to drive are random, those drivers will not be synchro-     each client determines the load the system needs to handle.
      nized and therefore there will be enough randomness         Since in SmartDriver the event rate each client generates
      on the system.                                              is approximately the same, the aggregated event rate arriv-
                                                                  ing the server infrastructure should grow linearly with the
   Once the simulation starts, the simulator makes every          number of clients. Figure 2 shows that, as expected, the
driver advance on her path at a speed that depends on             event rate at the main stream is proportional to the number
the randomly assigned characteristics of the driver and the       of clients up to 4,000 clients. At that point, with approx-
speed limit of the current road, with a random bias. Accel-       imately 28,000 events per minute, the infrastructure satu-
eration and deceleration are also modeled by the simulator        rates and begins to reject events, and therefore linearity is
(e.g. at turns or when speed limits change). Drivers send         lost.
                             Short-term server      HTTP balancer                                                         Main stream server
                             Collectors             Storage stream server
                             Main stream server     All components

                                                                                                 0.30
              0.5

                                                                                                 0.25
              0.4




                                                                                   Utilization
                                                                                                 0.20
Utilization




              0.3

                                                                                                 0.15
              0.2

              0.1                                                                                0.10

                                                                                                           10000       15000         20000     25000
              0.0                                                                                                          Events / min
                    1000   1500    2000    2500 3000    3500     4000       4500
                                              Clients
                                                                                                        Figure 4: Utilization versus event rate.
       Figure 3: Global and per-component CPU utilization.                         feed stream. As expected, delays increase with the load
                                                                                   of the system, especially from 3,500 clients on, where the
   We’ve measured the performance of the server infras-                            indicators show that the system is beginning to saturate. In
tructure for different loads in a series of experiments with                       our experimental setup the network distance between the
a growing number of clients. The main performance indi-                            simulator and the infrastructure is less than 1ms. In more
cators we measured were:                                                           realistic scenarios, that distance would be higher, although
        • CPU utilization: amount of time of CPU used during a                     less than 0.5s in most of the situations.
          minute, divided by 60s. This measurement was taken                          As a conclusion, the infrastructure we have presented in
          every minute. Its estimated mean and 95% confidence                      this paper is able to handle up to 4,000 simultaneous drivers
          intervals were computed and reported in the plots. A                     from a single server, which represent approximately 28,000
          component with an utilization close to 1 is in the limit                 new events every minute. At larger data rates collectors
          of the load it can handle. A component with an uti-                      begin to reject some events due to saturation.
          lization close to 0 is mainly free.
                                                                                   5              Conclusions and Future Work
        • Event admission delay: amount of time between the
          creation of the event at the client side (the simula-                    We deployed the first working prototype of this infras-
          tor) and its admission at a front-end server and at the                  tructure more than two years ago. It is still working and
          database feed stream. Larger delays may signal con-                      has received frequent feature upgrades and bug fixes since
          gestion situations in the server. Similarly to utiliza-                  then. During this period, the system has been continu-
          tion, mean delays with 95% confidence intervals were                     ously capturing data from our beta testers with no major
          estimated from the delays suffered by a random sam-                      issues. Although the core of the architecture is already im-
          ple of the simulated events.                                             plemented and deployed, some of its services are still work
                                                                                   in progress. More specifically, the public stream and the
   Figure 3 shows the overall utilization of the infrastruc-                       speed recommendation and traffic incident detection ser-
ture as well as the individual utilization of each server com-                     vices are not yet part of the current prototype.
ponent. The six collector servers average a higher utiliza-                            According to our experiments, the architecture we pro-
tion than the rest, thus being the bottleneck of the sys-                          pose provides a reasonable level of performance in the
tem. However, their utilization may be reduced by adding                           context of the HERMES project. The maximum amount
some more collector instances to the pool, assuming that                           of simultaneous drivers a single server may handle is ap-
the server has cores enough. The next component in terms                           proximately 4,000, but the infrastructure can be scaled-
of utilization is the short-term location-based server, fol-                       up by deploying it into more servers, especially the col-
lowed by the server that handles the main stream. The stor-                        lectors components. Distributing the short-term location
age stream server and the HTTP balancer can handle much                            services component is challenging because of the need of
more load than the rest. Figure 4 shows clearly that utiliza-                      a shared spatial database, but techniques for efficiently
tion at the main stream grows linearly with its event rate.                        partitioning such data and their processing already ex-
   Finally, figure 5 shows the event delivery delay at two                         ist [AWV+ 13, WAV15].
points of the server infrastructure: collectors and storage                            Future work includes rebuilding this infrastructure
                              Storage stream server    Collectors                        journeys. In 2015 IEEE 5th International
                                                                                         Conference on Consumer Electronics –
                                                                                         Berlin, pages 153–157, 2015.
            6
                                                                           [CMMO15b]     Vı́ctor Córcoba Magaña and Mario
            5                                                                            Muñoz-Organero. Reducing stress and
                                                                                         fuel consumption providing road infor-
            4
                                                                                         mation. Ambient Intelligence – Software
Delay (s)




            3                                                                            and Applications, 376:23–31, 2015.

            2                                                              [FAGAG+ 15]   Jorge Yago Fernández, Álvaro Ar-
                                                                                         cos Garcı́a, Juan Antonio Álvarez-
            1                                                                            Garcı́a, Juan Antonio Ortega Ramı́rez,
                                                                                         Jesús Torres, Jesús Arias Fisteus, Vı́ctor
            0                                                                            Córcoba Magaña, Mario Muñoz Or-
                1000   1500    2000     2500 3000     3500   4000   4500                 ganero, and Luis Sánchez Fernández.
                                           Clients
                                                                                         Plataforma para gestión de información
                                                                                         de ciudadanos de una smartcity. In Ac-
Figure 5: Event delivery delays at collectors and storage                                tas de las XVII Jornadas de ARCA, pages
stream. Delays measure the amount of time since an event                                 53–56, Junio 2015.
is created at the simulator until it is fully processed and
accepted at the given server component.                                    [FTS+ 09]     T. Fountain, S. Tilak, P. Shin, P. Hub-
                                                                                         bard, and L. Freudinger.       The open
on top of a big data framework such as Apache                                            source dataturbine initiative: Streaming
Kafka [KNR+ 11] to try to increase the amount of clients                                 data middleware for environmental ob-
that may be served from a single server.                                                 serving systems. In International Sympo-
                                                                                         sium on Remote Sensing of Environment,
Acknowledgements                                                                         2009.
Research reported in this paper was supported by the Span-
ish Economy Ministry through the “HERMES – Smart                           [GO03]        Lukasz Golab and M. Tamer Özsu. Is-
Driver” project (TIN2013-46801-C4-2-R) and the “HER-                                     sues in data stream management. SIG-
MES – Smart Citizen” project (TIN2013-46801-C4-1-R).                                     MOD Rec., 32:5–14, June 2003.

                                                                           [KNR+ 11]     Jay Kreps, Neha Narkhede, Jun Rao,
References                                                                               et al. Kafka: A distributed messaging
[AFFGSFFL14] Jesus Arias Fisteus, Norberto Fernan-                                       system for log processing. NetDB, 2011.
             dez Garcia, Luis Sanchez Fernandez,
             and Damaris Fuentes-Lorenzo. Ztreamy:                         [PZCG14]      Charith Perera, Arkady Zaslavsky, Peter
             a middleware for publishing semantic                                        Christen, and Dimitrios Georgakopoulos.
             streams on the web. Journal of Web Se-                                      Sensing as a service model for smart
             mantics, 25:16–23, 2014.                                                    cities supported by internet of things.
                                                                                         Transactions on Emerging Telecommuni-
[AWV+ 13]                     Ablimit Aji, Fusheng Wang, Hoang Vo,                       cations Technologies, 25(1):81–93, 2014.
                              Rubao Lee, Qiaoling Liu, Xiaodong
                              Zhang, and Joel Saltz. Hadoop gis: A         [SPC+ 14]     A. Solanas, C. Patsakis, M. Conti, I. S.
                              high performance spatial data warehous-                    Vlachos, V. Ramos, F. Falcone, O. Posto-
                              ing system over mapreduce. Proc. VLDB                      lache, P. A. Perez-martinez, R. D. Pietro,
                              Endow., 6(11):1009–1020, August 2013.                      D. N. Perrea, and A. Martinez-Balleste.
                                                                                         Smart health: A context-aware health
[CBB+ 03]                     Mitch Cherniack, Hari Balakrishnan,                        paradigm within smart cities.       IEEE
                              Magdalena Balazinska, Don Carney, Uur                      Communications Magazine, 52(8):74–
                              etintemel, Ying Xing, and Stan Zdonik.                     81, Aug 2014.
                              Scalable distributed stream processing. In
                              In CIDR, 2003.                               [VLM+ 13]     I. Vilajosana, J. Llosa, B. Martinez,
                                                                                         M. Domingo-Prieto, A. Angles, and
[CMMO15a]                     V. Córcoba Magaña and M. Muñoz-                         X. Vilajosana. Bootstrapping smart cities
                              Organero. Reducing stress on habitual                      through a self-sustainable model based
          on big data flows. IEEE Communications
          Magazine, 51(6):128–134, June 2013.
[WAV15]   Fusheng Wang, Ablimit Aji, and Hoang
          Vo. High performance spatial queries for
          spatial big data: From medical imaging
          to gis. SIGSPATIAL Special, 6(3):11–18,
          April 2015.