=Paper= {{Paper |id=Vol-2203/108 |storemode=property |title=Real-time Monitoring of Hungarian Highway Traffic from Cell Phone Network Data |pdfUrl=https://ceur-ws.org/Vol-2203/108.pdf |volume=Vol-2203 |authors=Andrea Galloni,Balazs Horvath,Tomas Horvath |dblpUrl=https://dblp.org/rec/conf/itat/GalloniHH18 }} ==Real-time Monitoring of Hungarian Highway Traffic from Cell Phone Network Data== https://ceur-ws.org/Vol-2203/108.pdf
S. Krajči (ed.): ITAT 2018 Proceedings, pp. 108–115
CEUR Workshop Proceedings Vol. 2203, ISSN 1613-0073, c 2018 Andrea Galloni, Balázs Horváth, and Tomáš Horváth



             Real-time Monitoring of Hungarian Highway Traffic from Cell Phone
                                       Network Data

                                         Andrea Galloni, Balázs Horváth, and Tomáš Horváth

                                          Department of Data Science and Data Technologies,
                                                       Faculty of Informatics,
                                            ELTE – Eötvös Loránd University in Budapest
                                 {andrea.galloni,balazs.horvath,tomas.horvath}@inf.elte.hu,
                                                    http://t-labs.elte.hu/

      Abstract: A lightweight model for real-time monitoring          ysis of the mobile telecommunication infrastructure, the
      of the load of Hungarian highway traffic is presented in        core model can be adapted to different scenarios and dif-
      the paper. The input data of the model are cell phone net-      ferent data logs such as tower cell crowdedness or Internet
      work event records provided by Magyar Telekom Nyrt.,            backbones nodes activity monitoring.
      the major Hungarian telecommunication company. The
      output is a classification of the level of crowdedness of       1.1 Related Work
      the Hungarian highways inferred from the activity level of
      the mobile telecommunication infrastructure. While pro-         Nowadays we are witnessing to a constant increasing
      cessing, a data-stream is flowing through a chain of sim-       speed of networks, furthermore the capability to store and
      ple but efficient data structures. For computing anomalies      process conspicuous amount of data can be performed at
      against the usual behavior of the traffic at given segments     affordable prices. This new scenario enables telecom op-
      of the highway, so-called break-points, known from the          erators to store and process big quantities of logs triggered
      SAX representation of time-series, are utilized which re-       by a countless number of events. Along the past years,
      quire cheap computation. The model is implemented as            the research community proposed several models regard-
      a server application able to feed a client web-based visu-      ing the possibility to infer or predict information regard-
      alization application implemented for demonstration pur-        ing the status of the mobility infrastructure analyzing the
      poses. The experiments, performed on anonymized data            mobile telecommunication event logs. Furthermore, the
      covering one month of cell phone records, show that the         evolution of new telecommunication technology standards
      presented model is computationally cheap, it efficiently        such as 5G will bring more efficient and accurate localiza-
      runs even on low-end hardware such that Raspberry Pi.           tion techniques leading to more precise analysis and esti-
                                                                      mations [7].
      Keywords: Anomaly Detection, Mobile Data Analytics,                In [6], authors present quite a complex model able to
      Visualization                                                   estimate the traffic flow making use of anonymized tem-
                                                                      poral series of cell handover logs, building state diagrams
      1    Introduction                                               and using Markov Models in order to detect car accidents.
                                                                         On the other hand, in [8], authors propose a framework
      A method resulting from an industrial research project is       mining several heterogeneous data sources making use of
      presented in this paper. The presented method has been de-      the MapReduce programming-model (more precisely us-
      veloped for a specific and well-defined use case: inferring     ing Apache Hadoop) in order to process big amounts of
      the level of the mobility traffic load on Hungarian high-       data in a reasonable amount of time involving high-end
      ways from cell phone network data. The data have been           hardware and clusters of machines. The authors of this
      provided by the industrial partner, Magyar Telekom Nyrt.,       contribution are able to estimate the traffic volume and the
      the major Hungarian telecommunication company, a sub-           speed of the traffic flow.
      sidiary of Deutsche Telekom AG.                                    In [9], the authors provide a full overview of methodolo-
         Experimental outcomes are positive and promising as          gies providing a possible list of necessary steps in order to
      the underlying core model is computationally light and          infer traffic information such that i) location data collec-
      simple. The data structures used and the overall system         tion, ii) terminal classification in order to determine which
      architectural-design could be, possibly, exploited for other    mobile terminals are located on the road and which means
      applications or use cases. More specifically, the presented     of transport they are in, while iii) map matching phase in
      model can be used where there is the need to detect anoma-      order to link the extracted location data with the mobility
      lies in time series given a set of nodes and the logs pro-      infrastructure, iv) the route determination process used to
      viding quantitative information describing the activity of      determine the path of the vehicles while the last step is to
      such nodes over time. Even if the developed framework is        perform v) the estimation of the traffic state.
      specifically build to infer information regarding the Hun-         In order to estimate traffic flows the use of origin/des-
      garian highways mobility infrastructure through the anal-       tination matrices have been tried out in [10] and [11],
Real-time Monitoring of Hungarian Highway Traffic from Cell Phone Network Data                                                                        109

      however, as pointed out in [6], these solutions might                          Call Detail Record (CDR) data concern the activity logs
      be computationally expensive from the technical point                          of each user interacting with the network on a daily ba-
      of view and hard to scale when the size of the user set                        sis. A CDR is produced by a telephone equipment that
      tends to grow. On the other hand, from the law regulation                      documents the details of a call or other telecommunica-
      perspective these solutions might see limitations on                           tions events (e.g. notifications, short message service or
      real deployment scenarios due to privacy concerns and                          signaling protocols) that involves the telecom provider in-
      strict regulations, especially the ones applying within the                    frastructure.
      European Union.                                                                   The dataset is composed by several files accounting a
                                                                                     size of 500GB. The entire information contained within
         Within this contribution, a minimalistic approach to                        the dataset covers a range of 31 days, more precisely be-
      traffic load detection is introduced exploiting just events                    tween 15th September 2016 and 15th October 2016, where
      triggered by the active utilization of the User Equipments                     all the unique identifiers referencing to the users have been
      (Calls, SMS and Mobile Data Usage) without involving                           anonymized on a daily basis. The overall number of logs
      logs related to lower level signalling protocols such as                       is around 200 million records per day.
      cell handover event logs. The aim of this research was                            In order to work with just the useful data all the un-
      to discover at which extent and precision is possible to in-                   necessary information contained in the dataset such as the
      fer reliable traffic analysis with minimalistic datasets and                   nature of event logs and other information regarding cus-
      minimal computing costs. A server application able to                          tomer related data (e.g. phone and events identifiers) have
      feed a web-based visualization application has been im-                        to be discarded by the framework.
      plemented for demonstration purposes. Experiments pro-                            At the end of this process, the CDR files contains three
      vided on real but anonymized data covering one month of                        kind of attributes as presented in Table 1, namely, the
      cell phone records show that the the presented model is                        Unique User Identifier (UUID) re-anonymized on a daily
      promising and is able to efficiently run even on low-end                       basis (to prevent tracking of user movement across more
      hardware.                                                                      days), the date-time information related to the log event
         The rest of this paper is organized as follows: Section                     and the Tower Identifier (TID) providing information from
      2 gives a description of the available data and the proce-                     which cell-tower the event has been triggered.
      dure utilized to match telecom data with geographical-map
      data. In Section 3, a detailed description of the system ar-
                                                                                     Cell Reference (CR) data provide informations about the
      chitecture and the core estimation model are provided. In
                                                                                     mobile radio-towers and their positioning within the Hun-
      the Section 4 the process of traffic load classification is
                                                                                     garian territory. As illustrated in the Table 2, CR data
      described. The following Section 5 contains experimen-
                                                                                     contain three attributes, namely, the Tower Identifier (TID)
      tal results and measurements. Finally, in Section 6 some
                                                                                     which connects the CDR data with the CR data, the Lat-
      conclusions and plans for future work are provided.
                                                                                     itude and the Longitude regarding the given tower. Due
                                                                                     to how the industrial partner gathered and anonymized
      2    Data Sources and Framework                                                the data, process on which the authors were not involved,
           Initialization                                                            some records contained within the CDR files hold UUIDs
                                                                                     with NULL value or in some cases hold inconsistent TIDs.
      An important part of the presented framework is its initial-                   Namely some TIDs contained in the CDR logs do not
      ization to the specific use case, such that highway traffic                    match any of the TIDs in the CR data. The UIIDs incon-
      load monitoring, in our case, what consists of understand-                     sistencies are uniformly spread over the locations while
      ing the data and selecting the towers to be considered rel-                    for the inconsistent TIDs is not possible to draw any con-
      evant by the framework.                                                        clusion about the geographical regions affected. In case of
                                                                                     those inconsistent logs, it is not possible to get the subject
                                                                                     performing the action or the location of the event. For this
      2.1 Telecom Data                                                               reason all the affected records, which are around 30% of
                                                                                     all the records, are affected and have to be handled (dis-
      The industrial partner has granted access to a dataset1 con-
                                                                                     carded) by the framework during the on-line computation.
      taining several .csv (Comma Separated Values) files or-
      ganized in a daily basis containing two main kinds of in-
      formation such that                                                            2.2 Geographic Maps Data
                                                                                     In order to get information about Hungarian highways the
                                                                                     research relied on OpenStreetMap2 (OSM) data. The open
                                                                                     project makes available data about roads, trails, cafes, rail-
           1 Due to the signed non-disclosure agreement between the academic         way stations and other basic map features from all around
      and the industrial partner of the project, each sample of the data, e.g. the   the world.
      Tables 1 and 2, presented in this paper are synthetic, i.e. contain fictive
      information.                                                                      2 https://www.openstreetmap.org/
110                                                                                     Andrea Galloni, Balázs Horváth, and Tomáš Horváth

                                         UUID                date-time               TID
                                     6776554S3449       2016-09-15 13:30:41    10B32F03E10CB
                                     777865354435       2016-09-15 00:27:50    10DA0232324BC
                                     677655453449       2016-09-15 17:44:08    00443344DFFEA

             Table 1: A synthetic CDR data sample with fictive UUID, date-time and TID data, for illustration purpose .


                 TID             Latitude      Longitude
             6776554S3449       46.291311      17.366325
             777865354435       46.282342      17.357244
             677655453449       46.366745      17.364112

      Table 2: A synthetic CR data sample with fictive TID, Lat-
      itude and Longitude data for illustration purpose.


         First of all information about the Hungarian country
      borders have been extracted, then the data regarding only
      highways has been kept obtaining a .json file containing        Figure 1: Density of cell towers for Budapest Urban and
      informations about each highway divided in several seg-         Eastern Rural Area, an illustrative example.
      ments, where each segment contains information describ-
      ing a small section of the highway (e.g.: name, type, speed
      limit) and its location. The length of each highway’s sec-
      tion depends on the topography of the area, the density of
      the population and the radio-technology of the cell tow-
      ers of the operator. The outcome of this phase, i.e. the
      detected borders, can be observed in the Figure 6.


      Detecting Relevant Towers In order to monitor the high-         Figure 2: Relevant towers identification and towers com-
      ways infrastructure traffic and exclude irrelevant informa-     petence attribution, an illustrative example.
      tion from the data model an additional filtering and select-
      ing step has been performed. After this phase, only the
      relevant towers that have a strict correlation with the high-   as shown in Figure 2, each highway section is linked to a
      ways infrastructure have been kept.                             specific cell tower while all the towers not related with the
         The density of the cell tower placement and its spatial      highways will not be considered by the framework while
      characteristics represent a crucial issue in terms of space-    processing.
      resolution within the developed monitoring system. Pro-
      vided the mostly flat characteristics of the Hungarian land-
      scape, is possible to assume that the cell towers displace-     3   The System Architecture
      ment is mostly not conditioned by the topographic proper-
      ties of the surrounding areas but rather follows the density    The framework has its roots in several components, i.e. the
      of the population over the whole territory. This character-     following data structures (called dictionaries, according to
      istic of the cell towers placement is due to several factors    Python notation) and logical units.
      such as scalability, laws of physics and signal processing
      theory. Figure 1 illustrates the density of the city of Bu-     3.1 Data Structures
      dapest and its east country side area from which it is pos-
      sible to observe that in rural areas cell towers are placed     Relevant Towers Dictionary (RTD) RTD is one of the
      close to the highway in order to provide the radio-signal to    main data structure on which most of the others are based
      travelers.                                                      on. In fact the RTD represents an HashMap having as keys
         In order to detect cell towers which lead to the crowded-    the identifiers (TID, see Table 2) of all the relevant towers
      ness status of that specific section of the highway, an ad-     and as values the geographical coordinates of given towers
      hoc algorithm have been developed based on a QuadTree           as strings. RTD is the result of the module for detection of
      [1] data structure. The generated tree contains all the co-     relevant towers, described above. This dictionary remains
      ordinates of the cell towers that are listed in the CR data.    constant in the framework and changes only if there are
      Then, for every highway section the closest tower cell has      changes in geographical locations of cell towers such that
      been found querying the tree structure. After this process,     a new tower is placed near the highways, for example.
Real-time Monitoring of Hungarian Highway Traffic from Cell Phone Network Data                                                           111

      User at Relevant Towers Dictionary (URTD) URTD is                  cific highway segment. It uses as an evaluation model de-
      a HashMap and it has as keys all the UUIDs (Unique User            scribed in the next section where the number of classes is
      IDs, see Table 1) of the users whose previous logs were            determined by a parameter.
      triggered by relevant towers, namely, those logs who had              The EM module is triggered every time a time frame
      their TID appearing in the RTD keys, and, as values the            (a parameter of the framework) expires. At this point it
      TID of the tower the given user has been related to for the        is time to evaluate the status of the whole system. At this
      last time. At the startup of the system, this data structure is    stage the EM iterates over all the TIDCD, computes all the
      empty and at the beginning of a new day, due to the UUID           means and standard deviations using the past data neces-
      re-anonymization on a daily basis, it is reinitialized.            sary to evaluate the actual status of all the relevant TIDs
                                                                         and then finally classifies every segment of a given high-
                                                                         way into one of the predefined classes. For example, in
      Tower ID Counter Dictionary (TIDCD) TIDCD is a
                                                                         case of 10 classes, the class 5-6 corresponds to normal
      HashMap and it has as keys (TIDs) all the relevant towers
                                                                         traffic, 7-8 to higher while 9-10 to very high traffic, 3-4
      while it has as values counters representing the number of
                                                                         to lower and 1-2 to very low traffic on a given segment of
      users whose last log have been related to that specific TID.
                                                                         the highway.

      3.2 The Status Maintainer Module (SMM)
                                                                         3.4 The Notification Delegate Module (NDM)
      As soon as a log event is triggered, the SMM is delegated
                                                                         The NDM is the part of the framework responsible to con-
      to keep track of the last location of the user (attaching to
                                                                         stantly collect the result of the EM and update the clients
      him/her the proper TID in the URTD) and increases or de-
                                                                         about the highways status sending all the needed informa-
      creases the counter of the proper TID within the TIDCD
                                                                         tions as a json payload. While this process takes place,
      according to the situation. This module acts as a supervi-
                                                                         an ID conversion is performed. In fact, the back-end and
      sor moving the subscribers from a tower to another, keep-
                                                                         the front-end of the framework, due to security reasons,
      ing the model up to date with the information provided by
                                                                         do not share the same internal IDs for representing the
      the event logs. SMM is delegated to interpret and take de-
                                                                         TIDs. Once finished, the control is passed to the Noti-
      cisions based on the information provided from the logs
                                                                         fication Delegate Module responsible for communicating
      with the following functionality:
                                                                         with the client (described below).
         As soon as the application is started, both the URTD and
      the TIDCD Hash Maps are empty. As the first incoming
      log having a TID present in the RTD key-set is processed,          4       Classification of the Traffic Load
      the SMM will fill the URTD with the UUID as key and the
      TID as value, then, it will initialize the counter of the spe-     In order to classify the load of a segment of highway, one
      cific TID key within the TIDCD to 1. If a second log from          should consider the past data as a reference for the eval-
      the same user will come but, this time, with a different and       uation of the new entries. Given the topographic features
      relevant TID then the counter of the old TID (the last “lo-        of the Hungarian landscape it is expected that the activity
      cation” of the user) within the TIDCD will be decreased            of cell towers close to the highway have a low static noise
      by one unit and suddenly the corresponding URTD’s value            (in [6] authors highlights that systems relying on cell tele-
      will be updated to the actual TID inferred from the log            com logs for traffic estimations within urban areas could
      querying the RTD and, finally, the corresponding TIDCD             suffer high loss of precision due to event logs triggered by
      value (for the actual TID the user is connected to) will be        non-traveling users). In fact, in this specific case the ma-
      increased. In the last case, if the subscriber’s handset will      jority of the event logs are generated mostly by travelers
      trigger a log which is not related to a relevant tower, the        and it is easily possible to observe that the number of trav-
      counter of the old TID within the TIDCD will be decreased          elers in a given time-range tend to be similar for different
      by one and the key within the URTD corresponding to the            week-days.
      specific UUID will be removed (the user has left the set              Provided the latter observation, it is possible to build a
      of towers delegated to monitor the highways mobility sta-          model such that in order to classify the load of a specific
      tus). All the event logs (records) containing a non-relevant       road segment in a precise time-range of the day (e.g.: be-
      TID (user is not on a highway) and UUID not stored in              tween 12:00 and 12:15) compares the value to be classified
      URTD (user was not on a highway before) are immedi-                against the values of load within the same time range of the
      ately discarded. Figure 3 provides an illustration of the          previous days. Here, a sort of seasonality of the traffic has
      SMM logic.                                                         to be considered, e.g. there is more heavier load during the
                                                                         rush hours while less traffic at nights.
      3.3 The Evaluator Module (EM)                                         An other, interesting phenomenon is that sometimes a
                                                                         traffic anomaly can become normal with time. For ex-
      The EM is the core module of the framework, which is               ample, consider a longer construction work on a highway
      responsible for classifying the status of the load of a spe-       causing the close of some parts (e.g. lanes) of the road. In
112                                                                                         Andrea Galloni, Balázs Horváth, and Tomáš Horváth




                                       Figure 3: The logic diagram of the Status Maintainer Module


      the time of its appearance, since it is sudden for the traf-        they do not need further computation but may be deter-
      fic, it is considered an anomaly and results in traffic jams.       mined by simply looking them up in a statistical table, as
      However, with time, the traffic normalizes such that peo-           illustrated in the table 3 containing the breakpoints divid-
      ple get used to it (e.g. start using alternative routes) and        ing a Gaussian distribution in an arbitrary number (from 3
      the notion of heavy load changes.                                   to 10) of regions.
         All of the above observations lead to the straightforward           The final key-concept of the proposed model is that it
      use of time-series for representing the given problem of            does not represent the data in the past with a SAX repre-
      traffic load monitoring and classification.                         sentation, however the system classifies a new entry using
                                                                          the breakpoints generated making use of the data recorded
                                                                          in past in a SAX-like fashion just performing a lookup on
      4.1   Utilizing Breakpoints                                         a small table.
                                                                             An efficient way to detect anomalies in time series is
      The proposed model for classification of traffic load in var-
                                                                          that it is enough to compare the breakpoint determined for
      ious highway segments utilizes well-known concepts in
                                                                          the actual time frame to the breakpoints determined for
      Symbolic Aggregate Approximation (SAX) of time series
                                                                          the corresponding time frame(s) in the past, for example,
      [2, 3]. SAX makes use of Piecewise Aggregate Approx-
                                                                          the same time of the day before or the same time of the
      imation (PAA), a computationally very cheap method [4]
                                                                          same day a week before, etc. Depending on the number
      since it operates with arithmetic mean and standard devi-
                                                                          of classes to which the framework should classify the new
      ation, both very cheap operations.
                                                                          entry in the dataset, the right column of the statistical table
         SAX allows a time series of arbitrary length n to be re-
                                                                          has to be implemented in the system and then looked up.
      duced to a string of arbitrary length w, (w  n) with the
                                                                          In order to give a wider freedom of tuning, this value has
      alphabet size a > 2 (the number of letters used to represent
                                                                          been kept as a parameter of the framework.
      the time-series using SAX). In this process the data is di-
      vided into w equal sized “frames”. The mean value of each
      frame is calculated and a vector of these values becomes a          Historical Data Representation The data from the past
      reduced representation. For further details, refer to [2, 3].       are stored in an efficient-to-load binary format making use
         Assuming that the distribution of the values over the            of Python’s Pickle module. Files are stored in a daily for-
      time-series follows a normal distribution, it is possible           mat in the form of a HashMap of arrays having as keys the
      to subdivide the time-series into so called breakpoints B.          TIDs. Once loaded, the files are reshaped in MxN matri-
      Breakpoints are a sorted list of numbers B = (β1 , · · · , βa−1 )   ces, one for each TID, where M is the number of time-
      such that the area under a N(0, 1) Gaussian curve from β1           frames and N represents the number of days to consider in
      to βi+1 = 1/a where β0 and βa are defined as −∞ and +∞,             the past. All the tests in this research were performed set-
      respectively. The advantage of utilizing breakpoints is that        ting the time frames at 15 minutes and considering the last
Real-time Monitoring of Hungarian Highway Traffic from Cell Phone Network Data                                                            113

                                                                          a
                         βi           3         4          5          6           7         8          9        10
                         β1       -0.43     -0.67      -0.84      -0.97       -1.07     -1.15      -1.22     -1.28
                         β2        0.43         0      -0.25      -0.43       -0.57     -0.67      -0.76     -0.84
                         β3                  0.67       0.25          0       -0.18     -0.32      -0.43     -0.52
                         β4                             0.84       0.43        0.18         0      -0.14     -0.25
                         β5                                        0.97        0.57      0.43       0.14         0
                         β6                                                    1.07      0.67       0.43      0.25
                         β7                                                              1.15       0.76      0.52
                         β8                                                                         1.22      0.84
                         β9                                                                                   1.28

                                   Table 3: Statistical table for determining the values of breakpoints


      15 days in the past. This representation have been chosen           of time) over the validation data with traffic classification
      because once the system is running in real-time with a con-         level equal to 9 or 10. With this setup eleven out of four-
      tinuous data-stream then it is easy to shift the matrix data        teen (79%) congestions have been successfully detected.
      and recompute all the means and standard deviations effi-           The time series have been quantized with time windows
      ciently. This shifting and re-computation operation would           of 15 minutes, thus during the testing phase the classifica-
      be performed once per day at a given time (e.g. midnight)           tion takes place each fifteen minutes.
      exactly when the system is subject of a reinitialization or            It is interesting to note that the majority of the anoma-
      when the log re-anonymization process is performed.                 lies are detected earlier than the data provided by utinform.
                                                                          This can be a proof that the proposed solution is able to
                                                                          spot congestions when these are shorter than three kilome-
      5   Experiments
                                                                          ters. The outcome is similar for what concerns the end of
                                                                          the congestions, in this case the proposed solution tends to
      Given the lack of open systems providing precise infor-
                                                                          detect the end of an anomaly later than the validation data
      mation about the traffic flows over time we validated the
                                                                          set. Three congestions out of fourteen have not been de-
      system over specific traffic congestions. The goal of the
                                                                          tected, however this can be due to at least two reasons: the
      experiment is to validate the method looking for a correla-
                                                                          telecom operator (because of industrial secrecy concerns)
      tion between the highways status and the telecommuni-
                                                                          did not provide to us any detail regarding the range of an-
      cation infrastructure. In order to validate the results of
                                                                          tennas or their direction, thus, there might be a chance of
      the developed system authors asked the collaboration of
                                                                          inaccuracies along the phase of matching the geographi-
      utinform3 , a department of Magyar Közút Nonprofit Zrt.
                                                                          cal maps with the towers’ positions in order to define the
      whose responsibility is to collect information and moni-
                                                                          competence of each tower w.r.t. the highways segments.
      toring the roads traffic. They gather information from het-
                                                                          Another reason could be due to data inconsistencies. In
      erogeneous data sources: the employees of the company
                                                                          fact, as mentioned in Section 2, one third of the data have
      monitoring the highways, the county directorates and the
                                                                          been dropped.
      engineering departments employees. Utinform, in order to
      cross-validate the crowd sourced data sets, is also coop-              Figure 4 represents the classification value over time
      erating with institutions such as police, disaster manage-          slices of fifteen minutes for the highway segment suffer-
      ment, and public transport which have their own feed to             ing for the congestion described in Table 4 on row three.
      post their information to the system. Furthermore, the sys-         The red (in black and white print: the dark) vertical line
      tem has a crowd source interface, where people can submit           represents the time slot when the anomaly is detected on
      experienced traffic anomalies. The goal of utinform is to           the validation set. Here, a congestion is defined when the
      provide fast and validated traffic information to the drivers       classification values are equal or greater than 9. On the
      in whole Hungary.                                                   other end, Figure 5 represents the number of handsets (an
         The data received include information regarding traffic          estimation of the number of travelers) involved in the con-
      congestions on all the Hungarian highways. The dataset              gestion.
      contained data from October the 1st to October the 14th,               The application framework have been completely devel-
      2016 regarding congestions with a length of at least 3 kilo-        oped in Python 3.6 and it have been demonstrated to be
      meters. Table 4 shows the validation data and the results           very efficient. Although the system is designed to work
      of the proposed framework.                                          on-line receiving a stream of data in real-time, in order
         Setting the number of classes to 10. We define a cor-            to measure the performances of the model, we decided to
      rect detection when there is at least 50% overlap (in terms         exclude the streaming and the networking module, finally
                                                                          we tested the model in off-line mode. One log per time
          3 http://utinform.hu/                                           is read and suddenly processed. With this approach we
114                                                                                   Andrea Galloni, Balázs Horváth, and Tomáš Horváth

                            Cong. Start Time     Cong. End Time      Det. Start Time     Det. End Time
                                 11:00               11:40                10:30               12:15
                                 06:35               07:45                07:15               07:45
                                 09:30               22:10                09:00               00:00
                                 17:55               09:05                 none               none
                                 06:52               07:30                07:00               08:15
                                 14:50               16:50                16:45               17:30
                                 08:40               18:10                08:15               18:45
                                 14:00               17:10                14:45               19:25
                                 10:00               11:20                08:45               13:45
                                 17:23               20:15                16:15               20:00
                                 07:20               08:50                07:30               10:30
                                 02:55               04:40                 none               none
                                 15:00               16:00                15:15               16:15
                                 06:45               07:25                 none               none

      Table 4: Experimental Results indicating congestion (Cong.) start and end times as well as detection (Det.) start and end
      times.




      Figure 4: Classification values over time slices and the      Figure 5: Number of handsets involved in a congestion and
      time of a congestion (red/dark line), an illustrative exam-   the time of the congestion (red/dark line), an illustrative
      ple.                                                          example.


      maintained the architectural design of the framework and      is a web based application, it uses HTML and JavaScript,
      no-delay straming process has emulated while at the same      which communications with the Python back end through
      time avoiding networking related delays. Running the ap-      Web Sockets with .json files. At the initialization stage,
      plication on an Intel i7 6700K with 16GB of DDR4 RAM,         first the front end asks the back end for mapping of tower
      on average, manages to process the logs for an entire day     IDs and highway sections, as it was mentioned before at
      within less than 10 minutes with a peak of RAM con-           the geographic data section. After the initial step, the back
      sumption of 32MB. Furthermore, the application frame-         end is sending messages which the front end processes and
      work have been deployed on a Raspberry Pi 3 where             shows on the map. These messages are json files, con-
      the running time needed to process an entire set of logs      taining a list of objects with three attributes: ID, value,
      representing one day is in around 1 hour, in average.         anomaly. The ID stands for the tower IDs and the value is
                                                                    an int from 0 to 9 showing the traffic load on that segment,
      5.1   Visualization                                           and the anomaly flag is giving information that this value
                                                                    is the expected for the given time window or it is deviating
      First the communication interface need to be mentioned        from the usual traffic load on that area. In order to keep the
      between the back end and the front end part. The front end    low cost functionality of the system, the visualization had
Real-time Monitoring of Hungarian Highway Traffic from Cell Phone Network Data                                                               115

      to be carefully built as well. An open-source JavaScript           Acknowledgements
      library, LeafletJS4 was used to place layers on the par-
      ticular highway segments. This library has relatively low          Authors would like to thank Magyar Telekom Nyrt. and
      cost of handling layers and as it is expected from an ap-          utinform.hu. The research has been supported by the
      plication where the traffic load is constantly changing this       European Union, co-financed by the European Social
      was a critical feature. Each layer is a colored visualization      Fund EFOP-3.6.3-VEKOP-16-2017-00001 and the project
      of the value given to the particular highway section, an ex-       “Open City services” funded by Magyar Telekom Nyrt.
      ample can be seen on the Figure 6, where the spectrum is
      from blue to red, representing the low to high traffic load.       References
                                                                         [1] R. Finkel, J.L. Bentley (1974). Quad Trees: A Data Structure
                                                                             for Retrieval on Composite Keys. Acta Informatica 4 (1), 1–
                                                                             9.
                                                                         [2] Lin, J., Keogh, E., Lonardi, S. and Chiu, B., 2003. A
                                                                             symbolic representation of time series, with implications
                                                                             for streaming algorithms. In Proceedings of the 8th ACM
                                                                             SIGMOD workshop on Research issues in data mining and
                                                                             knowledge discovery (pp. 2-11). ACM.
                                                                         [3] Lin, J., Keogh, E., Wei, L. and Lonardi, S., 2007. Experienc-
                                                                             ing SAX: A Novel Symbolic Representation of Time Series.
                                                                             Data Mining and knowledge discovery, 15(2), pp.107-144.
                                                                         [4] Zhang, Y. and Glass, J., 2011. A piecewise aggregate ap-
                                                                             proximation lower-bound estimate for posteriorgram-based
      Figure 6: The traffic load visualized on the highway seg-              dynamic time warping. 12th Conference of the International
      ments, an illustrative example.                                        Speech Communication Association, pp.1909-1912.
                                                                         [5] Senin, P. and Malinchik, S., 2013, December. Sax-vsm: In-
                                                                             terpretable time series classification using sax and vector
                                                                             space model. In Data Mining (ICDM), 2013 IEEE 13th In-
      6 Conclusions                                                          ternational Conference on (pp. 1175-1180). IEEE.
      A lightweight traffic monitoring and traffic jam detec-            [6] Milani, A., Gentili, E. and Poggioni, V., 2009. Cellular Flow
                                                                             in Mobility Networks. IEEE Intelligent Informatics Bulletin,
      tion framework has been presented in this paper based on
                                                                             10(1), pp.17-23.
      HashMap data structures and methods for breakpoint de-
                                                                         [7] Hakkarainen, A., Werner, J., Costa, M., Leppanen, K. and
      tection well-known in time series classification. The used
                                                                             Valkama, M., 2015, September. High-efficiency device lo-
      concepts require cheap computation and, basically, mini-               calization in 5G ultra-dense networks: Prospects and en-
      mal tuning phase opposite to the case of tuning the hyper-             abling technologies. In Vehicular Technology Conference
      parameters of machine learning algorithms. The few pa-                 (VTC Fall), 2015 IEEE 82nd (pp. 1-5). IEEE.
      rameters of the framework such that the time window or             [8] R. Khokale, A. Ghate (2017). Data Mining for Traffic Pre-
      time frames as well as the number of breakpoints can be                diction and Analysis using Big Data. International Journal of
      set according to an available domain knowledge or user                 Engineering Trends and Technology 48 (3).
      expertise. However, to avoid false positives, it is recom-         [9] Gundlegard, D. and Karlsson, J.M., Road Traffic Estima-
      mended to tune the presented framework before imple-                   tion using Cellular Network Signaling in Intelligent Trans-
      menting it into a production environment.                              portation Systems, 2009. In: Wireless technologies in In-
         The SAX representation had already been used as a tool              telligent Transportation Systems. Editors: Ming-Tuo Zhou,
      for time series classification. However, the proposed dis-             Yan Zhang and L. T. Yan. ISBN: 978-1-60741-588-6 2009
      cretization procedure is unique in that it uses an interme-            pp. 361-392. Nova Science Publishers.
      diate representation between the raw time series and the           [10] White, J. and Wells, I., 2002. Extracting origin destina-
      symbolic strings. Furthermore the aim is not to classify               tion information from mobile phone data. In 11th Interna-
      full time series as in [5] but rather to classify new incom-           tional Conference on Road Transport Information and Con-
      ing single values.                                                     trol, London, 2002, pp. 30–34
         The presented framework was tested using real data.             [11] Wideberg, J.P., Caceres, N. and Benitez, F.G., 2006. Deriv-
      Due to the sensitive nature of the project the presented               ing Traffic Data from a Cellular Network. In Procedings of
                                                                             the 13th ITS World Congress, London, 2006.
      framework was developed within, the authors cannot dis-
      close the source code nor the data used in experiments, in
      this time. Experiments show that the proposed framework
      is promising and worth further development and adapta-
      tion to other use-case scenarios.
         4 http://leafletjs.com