S. Krajči (ed.): ITAT 2018 Proceedings, pp. 108–115 CEUR Workshop Proceedings Vol. 2203, ISSN 1613-0073, c 2018 Andrea Galloni, Balázs Horváth, and Tomáš Horváth Real-time Monitoring of Hungarian Highway Traffic from Cell Phone Network Data Andrea Galloni, Balázs Horváth, and Tomáš Horváth Department of Data Science and Data Technologies, Faculty of Informatics, ELTE – Eötvös Loránd University in Budapest {andrea.galloni,balazs.horvath,tomas.horvath}@inf.elte.hu, http://t-labs.elte.hu/ Abstract: A lightweight model for real-time monitoring ysis of the mobile telecommunication infrastructure, the of the load of Hungarian highway traffic is presented in core model can be adapted to different scenarios and dif- the paper. The input data of the model are cell phone net- ferent data logs such as tower cell crowdedness or Internet work event records provided by Magyar Telekom Nyrt., backbones nodes activity monitoring. the major Hungarian telecommunication company. The output is a classification of the level of crowdedness of 1.1 Related Work the Hungarian highways inferred from the activity level of the mobile telecommunication infrastructure. While pro- Nowadays we are witnessing to a constant increasing cessing, a data-stream is flowing through a chain of sim- speed of networks, furthermore the capability to store and ple but efficient data structures. For computing anomalies process conspicuous amount of data can be performed at against the usual behavior of the traffic at given segments affordable prices. This new scenario enables telecom op- of the highway, so-called break-points, known from the erators to store and process big quantities of logs triggered SAX representation of time-series, are utilized which re- by a countless number of events. Along the past years, quire cheap computation. The model is implemented as the research community proposed several models regard- a server application able to feed a client web-based visu- ing the possibility to infer or predict information regard- alization application implemented for demonstration pur- ing the status of the mobility infrastructure analyzing the poses. The experiments, performed on anonymized data mobile telecommunication event logs. Furthermore, the covering one month of cell phone records, show that the evolution of new telecommunication technology standards presented model is computationally cheap, it efficiently such as 5G will bring more efficient and accurate localiza- runs even on low-end hardware such that Raspberry Pi. tion techniques leading to more precise analysis and esti- mations [7]. Keywords: Anomaly Detection, Mobile Data Analytics, In [6], authors present quite a complex model able to Visualization estimate the traffic flow making use of anonymized tem- poral series of cell handover logs, building state diagrams 1 Introduction and using Markov Models in order to detect car accidents. On the other hand, in [8], authors propose a framework A method resulting from an industrial research project is mining several heterogeneous data sources making use of presented in this paper. The presented method has been de- the MapReduce programming-model (more precisely us- veloped for a specific and well-defined use case: inferring ing Apache Hadoop) in order to process big amounts of the level of the mobility traffic load on Hungarian high- data in a reasonable amount of time involving high-end ways from cell phone network data. The data have been hardware and clusters of machines. The authors of this provided by the industrial partner, Magyar Telekom Nyrt., contribution are able to estimate the traffic volume and the the major Hungarian telecommunication company, a sub- speed of the traffic flow. sidiary of Deutsche Telekom AG. In [9], the authors provide a full overview of methodolo- Experimental outcomes are positive and promising as gies providing a possible list of necessary steps in order to the underlying core model is computationally light and infer traffic information such that i) location data collec- simple. The data structures used and the overall system tion, ii) terminal classification in order to determine which architectural-design could be, possibly, exploited for other mobile terminals are located on the road and which means applications or use cases. More specifically, the presented of transport they are in, while iii) map matching phase in model can be used where there is the need to detect anoma- order to link the extracted location data with the mobility lies in time series given a set of nodes and the logs pro- infrastructure, iv) the route determination process used to viding quantitative information describing the activity of determine the path of the vehicles while the last step is to such nodes over time. Even if the developed framework is perform v) the estimation of the traffic state. specifically build to infer information regarding the Hun- In order to estimate traffic flows the use of origin/des- garian highways mobility infrastructure through the anal- tination matrices have been tried out in [10] and [11], Real-time Monitoring of Hungarian Highway Traffic from Cell Phone Network Data 109 however, as pointed out in [6], these solutions might Call Detail Record (CDR) data concern the activity logs be computationally expensive from the technical point of each user interacting with the network on a daily ba- of view and hard to scale when the size of the user set sis. A CDR is produced by a telephone equipment that tends to grow. On the other hand, from the law regulation documents the details of a call or other telecommunica- perspective these solutions might see limitations on tions events (e.g. notifications, short message service or real deployment scenarios due to privacy concerns and signaling protocols) that involves the telecom provider in- strict regulations, especially the ones applying within the frastructure. European Union. The dataset is composed by several files accounting a size of 500GB. The entire information contained within Within this contribution, a minimalistic approach to the dataset covers a range of 31 days, more precisely be- traffic load detection is introduced exploiting just events tween 15th September 2016 and 15th October 2016, where triggered by the active utilization of the User Equipments all the unique identifiers referencing to the users have been (Calls, SMS and Mobile Data Usage) without involving anonymized on a daily basis. The overall number of logs logs related to lower level signalling protocols such as is around 200 million records per day. cell handover event logs. The aim of this research was In order to work with just the useful data all the un- to discover at which extent and precision is possible to in- necessary information contained in the dataset such as the fer reliable traffic analysis with minimalistic datasets and nature of event logs and other information regarding cus- minimal computing costs. A server application able to tomer related data (e.g. phone and events identifiers) have feed a web-based visualization application has been im- to be discarded by the framework. plemented for demonstration purposes. Experiments pro- At the end of this process, the CDR files contains three vided on real but anonymized data covering one month of kind of attributes as presented in Table 1, namely, the cell phone records show that the the presented model is Unique User Identifier (UUID) re-anonymized on a daily promising and is able to efficiently run even on low-end basis (to prevent tracking of user movement across more hardware. days), the date-time information related to the log event The rest of this paper is organized as follows: Section and the Tower Identifier (TID) providing information from 2 gives a description of the available data and the proce- which cell-tower the event has been triggered. dure utilized to match telecom data with geographical-map data. In Section 3, a detailed description of the system ar- Cell Reference (CR) data provide informations about the chitecture and the core estimation model are provided. In mobile radio-towers and their positioning within the Hun- the Section 4 the process of traffic load classification is garian territory. As illustrated in the Table 2, CR data described. The following Section 5 contains experimen- contain three attributes, namely, the Tower Identifier (TID) tal results and measurements. Finally, in Section 6 some which connects the CDR data with the CR data, the Lat- conclusions and plans for future work are provided. itude and the Longitude regarding the given tower. Due to how the industrial partner gathered and anonymized 2 Data Sources and Framework the data, process on which the authors were not involved, Initialization some records contained within the CDR files hold UUIDs with NULL value or in some cases hold inconsistent TIDs. An important part of the presented framework is its initial- Namely some TIDs contained in the CDR logs do not ization to the specific use case, such that highway traffic match any of the TIDs in the CR data. The UIIDs incon- load monitoring, in our case, what consists of understand- sistencies are uniformly spread over the locations while ing the data and selecting the towers to be considered rel- for the inconsistent TIDs is not possible to draw any con- evant by the framework. clusion about the geographical regions affected. In case of those inconsistent logs, it is not possible to get the subject performing the action or the location of the event. For this 2.1 Telecom Data reason all the affected records, which are around 30% of all the records, are affected and have to be handled (dis- The industrial partner has granted access to a dataset1 con- carded) by the framework during the on-line computation. taining several .csv (Comma Separated Values) files or- ganized in a daily basis containing two main kinds of in- formation such that 2.2 Geographic Maps Data In order to get information about Hungarian highways the research relied on OpenStreetMap2 (OSM) data. The open project makes available data about roads, trails, cafes, rail- 1 Due to the signed non-disclosure agreement between the academic way stations and other basic map features from all around and the industrial partner of the project, each sample of the data, e.g. the the world. Tables 1 and 2, presented in this paper are synthetic, i.e. contain fictive information. 2 https://www.openstreetmap.org/ 110 Andrea Galloni, Balázs Horváth, and Tomáš Horváth UUID date-time TID 6776554S3449 2016-09-15 13:30:41 10B32F03E10CB 777865354435 2016-09-15 00:27:50 10DA0232324BC 677655453449 2016-09-15 17:44:08 00443344DFFEA Table 1: A synthetic CDR data sample with fictive UUID, date-time and TID data, for illustration purpose . TID Latitude Longitude 6776554S3449 46.291311 17.366325 777865354435 46.282342 17.357244 677655453449 46.366745 17.364112 Table 2: A synthetic CR data sample with fictive TID, Lat- itude and Longitude data for illustration purpose. First of all information about the Hungarian country borders have been extracted, then the data regarding only highways has been kept obtaining a .json file containing Figure 1: Density of cell towers for Budapest Urban and informations about each highway divided in several seg- Eastern Rural Area, an illustrative example. ments, where each segment contains information describ- ing a small section of the highway (e.g.: name, type, speed limit) and its location. The length of each highway’s sec- tion depends on the topography of the area, the density of the population and the radio-technology of the cell tow- ers of the operator. The outcome of this phase, i.e. the detected borders, can be observed in the Figure 6. Detecting Relevant Towers In order to monitor the high- Figure 2: Relevant towers identification and towers com- ways infrastructure traffic and exclude irrelevant informa- petence attribution, an illustrative example. tion from the data model an additional filtering and select- ing step has been performed. After this phase, only the relevant towers that have a strict correlation with the high- as shown in Figure 2, each highway section is linked to a ways infrastructure have been kept. specific cell tower while all the towers not related with the The density of the cell tower placement and its spatial highways will not be considered by the framework while characteristics represent a crucial issue in terms of space- processing. resolution within the developed monitoring system. Pro- vided the mostly flat characteristics of the Hungarian land- scape, is possible to assume that the cell towers displace- 3 The System Architecture ment is mostly not conditioned by the topographic proper- ties of the surrounding areas but rather follows the density The framework has its roots in several components, i.e. the of the population over the whole territory. This character- following data structures (called dictionaries, according to istic of the cell towers placement is due to several factors Python notation) and logical units. such as scalability, laws of physics and signal processing theory. Figure 1 illustrates the density of the city of Bu- 3.1 Data Structures dapest and its east country side area from which it is pos- sible to observe that in rural areas cell towers are placed Relevant Towers Dictionary (RTD) RTD is one of the close to the highway in order to provide the radio-signal to main data structure on which most of the others are based travelers. on. In fact the RTD represents an HashMap having as keys In order to detect cell towers which lead to the crowded- the identifiers (TID, see Table 2) of all the relevant towers ness status of that specific section of the highway, an ad- and as values the geographical coordinates of given towers hoc algorithm have been developed based on a QuadTree as strings. RTD is the result of the module for detection of [1] data structure. The generated tree contains all the co- relevant towers, described above. This dictionary remains ordinates of the cell towers that are listed in the CR data. constant in the framework and changes only if there are Then, for every highway section the closest tower cell has changes in geographical locations of cell towers such that been found querying the tree structure. After this process, a new tower is placed near the highways, for example. Real-time Monitoring of Hungarian Highway Traffic from Cell Phone Network Data 111 User at Relevant Towers Dictionary (URTD) URTD is cific highway segment. It uses as an evaluation model de- a HashMap and it has as keys all the UUIDs (Unique User scribed in the next section where the number of classes is IDs, see Table 1) of the users whose previous logs were determined by a parameter. triggered by relevant towers, namely, those logs who had The EM module is triggered every time a time frame their TID appearing in the RTD keys, and, as values the (a parameter of the framework) expires. At this point it TID of the tower the given user has been related to for the is time to evaluate the status of the whole system. At this last time. At the startup of the system, this data structure is stage the EM iterates over all the TIDCD, computes all the empty and at the beginning of a new day, due to the UUID means and standard deviations using the past data neces- re-anonymization on a daily basis, it is reinitialized. sary to evaluate the actual status of all the relevant TIDs and then finally classifies every segment of a given high- way into one of the predefined classes. For example, in Tower ID Counter Dictionary (TIDCD) TIDCD is a case of 10 classes, the class 5-6 corresponds to normal HashMap and it has as keys (TIDs) all the relevant towers traffic, 7-8 to higher while 9-10 to very high traffic, 3-4 while it has as values counters representing the number of to lower and 1-2 to very low traffic on a given segment of users whose last log have been related to that specific TID. the highway. 3.2 The Status Maintainer Module (SMM) 3.4 The Notification Delegate Module (NDM) As soon as a log event is triggered, the SMM is delegated The NDM is the part of the framework responsible to con- to keep track of the last location of the user (attaching to stantly collect the result of the EM and update the clients him/her the proper TID in the URTD) and increases or de- about the highways status sending all the needed informa- creases the counter of the proper TID within the TIDCD tions as a json payload. While this process takes place, according to the situation. This module acts as a supervi- an ID conversion is performed. In fact, the back-end and sor moving the subscribers from a tower to another, keep- the front-end of the framework, due to security reasons, ing the model up to date with the information provided by do not share the same internal IDs for representing the the event logs. SMM is delegated to interpret and take de- TIDs. Once finished, the control is passed to the Noti- cisions based on the information provided from the logs fication Delegate Module responsible for communicating with the following functionality: with the client (described below). As soon as the application is started, both the URTD and the TIDCD Hash Maps are empty. As the first incoming log having a TID present in the RTD key-set is processed, 4 Classification of the Traffic Load the SMM will fill the URTD with the UUID as key and the TID as value, then, it will initialize the counter of the spe- In order to classify the load of a segment of highway, one cific TID key within the TIDCD to 1. If a second log from should consider the past data as a reference for the eval- the same user will come but, this time, with a different and uation of the new entries. Given the topographic features relevant TID then the counter of the old TID (the last “lo- of the Hungarian landscape it is expected that the activity cation” of the user) within the TIDCD will be decreased of cell towers close to the highway have a low static noise by one unit and suddenly the corresponding URTD’s value (in [6] authors highlights that systems relying on cell tele- will be updated to the actual TID inferred from the log com logs for traffic estimations within urban areas could querying the RTD and, finally, the corresponding TIDCD suffer high loss of precision due to event logs triggered by value (for the actual TID the user is connected to) will be non-traveling users). In fact, in this specific case the ma- increased. In the last case, if the subscriber’s handset will jority of the event logs are generated mostly by travelers trigger a log which is not related to a relevant tower, the and it is easily possible to observe that the number of trav- counter of the old TID within the TIDCD will be decreased elers in a given time-range tend to be similar for different by one and the key within the URTD corresponding to the week-days. specific UUID will be removed (the user has left the set Provided the latter observation, it is possible to build a of towers delegated to monitor the highways mobility sta- model such that in order to classify the load of a specific tus). All the event logs (records) containing a non-relevant road segment in a precise time-range of the day (e.g.: be- TID (user is not on a highway) and UUID not stored in tween 12:00 and 12:15) compares the value to be classified URTD (user was not on a highway before) are immedi- against the values of load within the same time range of the ately discarded. Figure 3 provides an illustration of the previous days. Here, a sort of seasonality of the traffic has SMM logic. to be considered, e.g. there is more heavier load during the rush hours while less traffic at nights. 3.3 The Evaluator Module (EM) An other, interesting phenomenon is that sometimes a traffic anomaly can become normal with time. For ex- The EM is the core module of the framework, which is ample, consider a longer construction work on a highway responsible for classifying the status of the load of a spe- causing the close of some parts (e.g. lanes) of the road. In 112 Andrea Galloni, Balázs Horváth, and Tomáš Horváth Figure 3: The logic diagram of the Status Maintainer Module the time of its appearance, since it is sudden for the traf- they do not need further computation but may be deter- fic, it is considered an anomaly and results in traffic jams. mined by simply looking them up in a statistical table, as However, with time, the traffic normalizes such that peo- illustrated in the table 3 containing the breakpoints divid- ple get used to it (e.g. start using alternative routes) and ing a Gaussian distribution in an arbitrary number (from 3 the notion of heavy load changes. to 10) of regions. All of the above observations lead to the straightforward The final key-concept of the proposed model is that it use of time-series for representing the given problem of does not represent the data in the past with a SAX repre- traffic load monitoring and classification. sentation, however the system classifies a new entry using the breakpoints generated making use of the data recorded in past in a SAX-like fashion just performing a lookup on 4.1 Utilizing Breakpoints a small table. An efficient way to detect anomalies in time series is The proposed model for classification of traffic load in var- that it is enough to compare the breakpoint determined for ious highway segments utilizes well-known concepts in the actual time frame to the breakpoints determined for Symbolic Aggregate Approximation (SAX) of time series the corresponding time frame(s) in the past, for example, [2, 3]. SAX makes use of Piecewise Aggregate Approx- the same time of the day before or the same time of the imation (PAA), a computationally very cheap method [4] same day a week before, etc. Depending on the number since it operates with arithmetic mean and standard devi- of classes to which the framework should classify the new ation, both very cheap operations. entry in the dataset, the right column of the statistical table SAX allows a time series of arbitrary length n to be re- has to be implemented in the system and then looked up. duced to a string of arbitrary length w, (w  n) with the In order to give a wider freedom of tuning, this value has alphabet size a > 2 (the number of letters used to represent been kept as a parameter of the framework. the time-series using SAX). In this process the data is di- vided into w equal sized “frames”. The mean value of each frame is calculated and a vector of these values becomes a Historical Data Representation The data from the past reduced representation. For further details, refer to [2, 3]. are stored in an efficient-to-load binary format making use Assuming that the distribution of the values over the of Python’s Pickle module. Files are stored in a daily for- time-series follows a normal distribution, it is possible mat in the form of a HashMap of arrays having as keys the to subdivide the time-series into so called breakpoints B. TIDs. Once loaded, the files are reshaped in MxN matri- Breakpoints are a sorted list of numbers B = (β1 , · · · , βa−1 ) ces, one for each TID, where M is the number of time- such that the area under a N(0, 1) Gaussian curve from β1 frames and N represents the number of days to consider in to βi+1 = 1/a where β0 and βa are defined as −∞ and +∞, the past. All the tests in this research were performed set- respectively. The advantage of utilizing breakpoints is that ting the time frames at 15 minutes and considering the last Real-time Monitoring of Hungarian Highway Traffic from Cell Phone Network Data 113 a βi 3 4 5 6 7 8 9 10 β1 -0.43 -0.67 -0.84 -0.97 -1.07 -1.15 -1.22 -1.28 β2 0.43 0 -0.25 -0.43 -0.57 -0.67 -0.76 -0.84 β3 0.67 0.25 0 -0.18 -0.32 -0.43 -0.52 β4 0.84 0.43 0.18 0 -0.14 -0.25 β5 0.97 0.57 0.43 0.14 0 β6 1.07 0.67 0.43 0.25 β7 1.15 0.76 0.52 β8 1.22 0.84 β9 1.28 Table 3: Statistical table for determining the values of breakpoints 15 days in the past. This representation have been chosen of time) over the validation data with traffic classification because once the system is running in real-time with a con- level equal to 9 or 10. With this setup eleven out of four- tinuous data-stream then it is easy to shift the matrix data teen (79%) congestions have been successfully detected. and recompute all the means and standard deviations effi- The time series have been quantized with time windows ciently. This shifting and re-computation operation would of 15 minutes, thus during the testing phase the classifica- be performed once per day at a given time (e.g. midnight) tion takes place each fifteen minutes. exactly when the system is subject of a reinitialization or It is interesting to note that the majority of the anoma- when the log re-anonymization process is performed. lies are detected earlier than the data provided by utinform. This can be a proof that the proposed solution is able to spot congestions when these are shorter than three kilome- 5 Experiments ters. The outcome is similar for what concerns the end of the congestions, in this case the proposed solution tends to Given the lack of open systems providing precise infor- detect the end of an anomaly later than the validation data mation about the traffic flows over time we validated the set. Three congestions out of fourteen have not been de- system over specific traffic congestions. The goal of the tected, however this can be due to at least two reasons: the experiment is to validate the method looking for a correla- telecom operator (because of industrial secrecy concerns) tion between the highways status and the telecommuni- did not provide to us any detail regarding the range of an- cation infrastructure. In order to validate the results of tennas or their direction, thus, there might be a chance of the developed system authors asked the collaboration of inaccuracies along the phase of matching the geographi- utinform3 , a department of Magyar Közút Nonprofit Zrt. cal maps with the towers’ positions in order to define the whose responsibility is to collect information and moni- competence of each tower w.r.t. the highways segments. toring the roads traffic. They gather information from het- Another reason could be due to data inconsistencies. In erogeneous data sources: the employees of the company fact, as mentioned in Section 2, one third of the data have monitoring the highways, the county directorates and the been dropped. engineering departments employees. Utinform, in order to cross-validate the crowd sourced data sets, is also coop- Figure 4 represents the classification value over time erating with institutions such as police, disaster manage- slices of fifteen minutes for the highway segment suffer- ment, and public transport which have their own feed to ing for the congestion described in Table 4 on row three. post their information to the system. Furthermore, the sys- The red (in black and white print: the dark) vertical line tem has a crowd source interface, where people can submit represents the time slot when the anomaly is detected on experienced traffic anomalies. The goal of utinform is to the validation set. Here, a congestion is defined when the provide fast and validated traffic information to the drivers classification values are equal or greater than 9. On the in whole Hungary. other end, Figure 5 represents the number of handsets (an The data received include information regarding traffic estimation of the number of travelers) involved in the con- congestions on all the Hungarian highways. The dataset gestion. contained data from October the 1st to October the 14th, The application framework have been completely devel- 2016 regarding congestions with a length of at least 3 kilo- oped in Python 3.6 and it have been demonstrated to be meters. Table 4 shows the validation data and the results very efficient. Although the system is designed to work of the proposed framework. on-line receiving a stream of data in real-time, in order Setting the number of classes to 10. We define a cor- to measure the performances of the model, we decided to rect detection when there is at least 50% overlap (in terms exclude the streaming and the networking module, finally we tested the model in off-line mode. One log per time 3 http://utinform.hu/ is read and suddenly processed. With this approach we 114 Andrea Galloni, Balázs Horváth, and Tomáš Horváth Cong. Start Time Cong. End Time Det. Start Time Det. End Time 11:00 11:40 10:30 12:15 06:35 07:45 07:15 07:45 09:30 22:10 09:00 00:00 17:55 09:05 none none 06:52 07:30 07:00 08:15 14:50 16:50 16:45 17:30 08:40 18:10 08:15 18:45 14:00 17:10 14:45 19:25 10:00 11:20 08:45 13:45 17:23 20:15 16:15 20:00 07:20 08:50 07:30 10:30 02:55 04:40 none none 15:00 16:00 15:15 16:15 06:45 07:25 none none Table 4: Experimental Results indicating congestion (Cong.) start and end times as well as detection (Det.) start and end times. Figure 4: Classification values over time slices and the Figure 5: Number of handsets involved in a congestion and time of a congestion (red/dark line), an illustrative exam- the time of the congestion (red/dark line), an illustrative ple. example. maintained the architectural design of the framework and is a web based application, it uses HTML and JavaScript, no-delay straming process has emulated while at the same which communications with the Python back end through time avoiding networking related delays. Running the ap- Web Sockets with .json files. At the initialization stage, plication on an Intel i7 6700K with 16GB of DDR4 RAM, first the front end asks the back end for mapping of tower on average, manages to process the logs for an entire day IDs and highway sections, as it was mentioned before at within less than 10 minutes with a peak of RAM con- the geographic data section. After the initial step, the back sumption of 32MB. Furthermore, the application frame- end is sending messages which the front end processes and work have been deployed on a Raspberry Pi 3 where shows on the map. These messages are json files, con- the running time needed to process an entire set of logs taining a list of objects with three attributes: ID, value, representing one day is in around 1 hour, in average. anomaly. The ID stands for the tower IDs and the value is an int from 0 to 9 showing the traffic load on that segment, 5.1 Visualization and the anomaly flag is giving information that this value is the expected for the given time window or it is deviating First the communication interface need to be mentioned from the usual traffic load on that area. In order to keep the between the back end and the front end part. The front end low cost functionality of the system, the visualization had Real-time Monitoring of Hungarian Highway Traffic from Cell Phone Network Data 115 to be carefully built as well. An open-source JavaScript Acknowledgements library, LeafletJS4 was used to place layers on the par- ticular highway segments. This library has relatively low Authors would like to thank Magyar Telekom Nyrt. and cost of handling layers and as it is expected from an ap- utinform.hu. The research has been supported by the plication where the traffic load is constantly changing this European Union, co-financed by the European Social was a critical feature. Each layer is a colored visualization Fund EFOP-3.6.3-VEKOP-16-2017-00001 and the project of the value given to the particular highway section, an ex- “Open City services” funded by Magyar Telekom Nyrt. ample can be seen on the Figure 6, where the spectrum is from blue to red, representing the low to high traffic load. References [1] R. Finkel, J.L. Bentley (1974). Quad Trees: A Data Structure for Retrieval on Composite Keys. Acta Informatica 4 (1), 1– 9. [2] Lin, J., Keogh, E., Lonardi, S. and Chiu, B., 2003. A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery (pp. 2-11). ACM. [3] Lin, J., Keogh, E., Wei, L. and Lonardi, S., 2007. Experienc- ing SAX: A Novel Symbolic Representation of Time Series. Data Mining and knowledge discovery, 15(2), pp.107-144. [4] Zhang, Y. and Glass, J., 2011. A piecewise aggregate ap- proximation lower-bound estimate for posteriorgram-based Figure 6: The traffic load visualized on the highway seg- dynamic time warping. 12th Conference of the International ments, an illustrative example. Speech Communication Association, pp.1909-1912. [5] Senin, P. and Malinchik, S., 2013, December. Sax-vsm: In- terpretable time series classification using sax and vector space model. In Data Mining (ICDM), 2013 IEEE 13th In- 6 Conclusions ternational Conference on (pp. 1175-1180). IEEE. A lightweight traffic monitoring and traffic jam detec- [6] Milani, A., Gentili, E. and Poggioni, V., 2009. Cellular Flow in Mobility Networks. IEEE Intelligent Informatics Bulletin, tion framework has been presented in this paper based on 10(1), pp.17-23. HashMap data structures and methods for breakpoint de- [7] Hakkarainen, A., Werner, J., Costa, M., Leppanen, K. and tection well-known in time series classification. The used Valkama, M., 2015, September. High-efficiency device lo- concepts require cheap computation and, basically, mini- calization in 5G ultra-dense networks: Prospects and en- mal tuning phase opposite to the case of tuning the hyper- abling technologies. In Vehicular Technology Conference parameters of machine learning algorithms. The few pa- (VTC Fall), 2015 IEEE 82nd (pp. 1-5). IEEE. rameters of the framework such that the time window or [8] R. Khokale, A. Ghate (2017). Data Mining for Traffic Pre- time frames as well as the number of breakpoints can be diction and Analysis using Big Data. International Journal of set according to an available domain knowledge or user Engineering Trends and Technology 48 (3). expertise. However, to avoid false positives, it is recom- [9] Gundlegard, D. and Karlsson, J.M., Road Traffic Estima- mended to tune the presented framework before imple- tion using Cellular Network Signaling in Intelligent Trans- menting it into a production environment. portation Systems, 2009. In: Wireless technologies in In- The SAX representation had already been used as a tool telligent Transportation Systems. Editors: Ming-Tuo Zhou, for time series classification. However, the proposed dis- Yan Zhang and L. T. Yan. ISBN: 978-1-60741-588-6 2009 cretization procedure is unique in that it uses an interme- pp. 361-392. Nova Science Publishers. diate representation between the raw time series and the [10] White, J. and Wells, I., 2002. Extracting origin destina- symbolic strings. Furthermore the aim is not to classify tion information from mobile phone data. In 11th Interna- full time series as in [5] but rather to classify new incom- tional Conference on Road Transport Information and Con- ing single values. trol, London, 2002, pp. 30–34 The presented framework was tested using real data. [11] Wideberg, J.P., Caceres, N. and Benitez, F.G., 2006. Deriv- Due to the sensitive nature of the project the presented ing Traffic Data from a Cellular Network. In Procedings of the 13th ITS World Congress, London, 2006. framework was developed within, the authors cannot dis- close the source code nor the data used in experiments, in this time. Experiments show that the proposed framework is promising and worth further development and adapta- tion to other use-case scenarios. 4 http://leafletjs.com