=Paper=
{{Paper
|id=Vol-2203/108
|storemode=property
|title=Real-time Monitoring of Hungarian Highway Traffic from Cell Phone Network Data
|pdfUrl=https://ceur-ws.org/Vol-2203/108.pdf
|volume=Vol-2203
|authors=Andrea Galloni,Balazs Horvath,Tomas Horvath
|dblpUrl=https://dblp.org/rec/conf/itat/GalloniHH18
}}
==Real-time Monitoring of Hungarian Highway Traffic from Cell Phone Network Data==
S. Krajči (ed.): ITAT 2018 Proceedings, pp. 108–115
CEUR Workshop Proceedings Vol. 2203, ISSN 1613-0073, c 2018 Andrea Galloni, Balázs Horváth, and Tomáš Horváth
Real-time Monitoring of Hungarian Highway Traffic from Cell Phone
Network Data
Andrea Galloni, Balázs Horváth, and Tomáš Horváth
Department of Data Science and Data Technologies,
Faculty of Informatics,
ELTE – Eötvös Loránd University in Budapest
{andrea.galloni,balazs.horvath,tomas.horvath}@inf.elte.hu,
http://t-labs.elte.hu/
Abstract: A lightweight model for real-time monitoring ysis of the mobile telecommunication infrastructure, the
of the load of Hungarian highway traffic is presented in core model can be adapted to different scenarios and dif-
the paper. The input data of the model are cell phone net- ferent data logs such as tower cell crowdedness or Internet
work event records provided by Magyar Telekom Nyrt., backbones nodes activity monitoring.
the major Hungarian telecommunication company. The
output is a classification of the level of crowdedness of 1.1 Related Work
the Hungarian highways inferred from the activity level of
the mobile telecommunication infrastructure. While pro- Nowadays we are witnessing to a constant increasing
cessing, a data-stream is flowing through a chain of sim- speed of networks, furthermore the capability to store and
ple but efficient data structures. For computing anomalies process conspicuous amount of data can be performed at
against the usual behavior of the traffic at given segments affordable prices. This new scenario enables telecom op-
of the highway, so-called break-points, known from the erators to store and process big quantities of logs triggered
SAX representation of time-series, are utilized which re- by a countless number of events. Along the past years,
quire cheap computation. The model is implemented as the research community proposed several models regard-
a server application able to feed a client web-based visu- ing the possibility to infer or predict information regard-
alization application implemented for demonstration pur- ing the status of the mobility infrastructure analyzing the
poses. The experiments, performed on anonymized data mobile telecommunication event logs. Furthermore, the
covering one month of cell phone records, show that the evolution of new telecommunication technology standards
presented model is computationally cheap, it efficiently such as 5G will bring more efficient and accurate localiza-
runs even on low-end hardware such that Raspberry Pi. tion techniques leading to more precise analysis and esti-
mations [7].
Keywords: Anomaly Detection, Mobile Data Analytics, In [6], authors present quite a complex model able to
Visualization estimate the traffic flow making use of anonymized tem-
poral series of cell handover logs, building state diagrams
1 Introduction and using Markov Models in order to detect car accidents.
On the other hand, in [8], authors propose a framework
A method resulting from an industrial research project is mining several heterogeneous data sources making use of
presented in this paper. The presented method has been de- the MapReduce programming-model (more precisely us-
veloped for a specific and well-defined use case: inferring ing Apache Hadoop) in order to process big amounts of
the level of the mobility traffic load on Hungarian high- data in a reasonable amount of time involving high-end
ways from cell phone network data. The data have been hardware and clusters of machines. The authors of this
provided by the industrial partner, Magyar Telekom Nyrt., contribution are able to estimate the traffic volume and the
the major Hungarian telecommunication company, a sub- speed of the traffic flow.
sidiary of Deutsche Telekom AG. In [9], the authors provide a full overview of methodolo-
Experimental outcomes are positive and promising as gies providing a possible list of necessary steps in order to
the underlying core model is computationally light and infer traffic information such that i) location data collec-
simple. The data structures used and the overall system tion, ii) terminal classification in order to determine which
architectural-design could be, possibly, exploited for other mobile terminals are located on the road and which means
applications or use cases. More specifically, the presented of transport they are in, while iii) map matching phase in
model can be used where there is the need to detect anoma- order to link the extracted location data with the mobility
lies in time series given a set of nodes and the logs pro- infrastructure, iv) the route determination process used to
viding quantitative information describing the activity of determine the path of the vehicles while the last step is to
such nodes over time. Even if the developed framework is perform v) the estimation of the traffic state.
specifically build to infer information regarding the Hun- In order to estimate traffic flows the use of origin/des-
garian highways mobility infrastructure through the anal- tination matrices have been tried out in [10] and [11],
Real-time Monitoring of Hungarian Highway Traffic from Cell Phone Network Data 109
however, as pointed out in [6], these solutions might Call Detail Record (CDR) data concern the activity logs
be computationally expensive from the technical point of each user interacting with the network on a daily ba-
of view and hard to scale when the size of the user set sis. A CDR is produced by a telephone equipment that
tends to grow. On the other hand, from the law regulation documents the details of a call or other telecommunica-
perspective these solutions might see limitations on tions events (e.g. notifications, short message service or
real deployment scenarios due to privacy concerns and signaling protocols) that involves the telecom provider in-
strict regulations, especially the ones applying within the frastructure.
European Union. The dataset is composed by several files accounting a
size of 500GB. The entire information contained within
Within this contribution, a minimalistic approach to the dataset covers a range of 31 days, more precisely be-
traffic load detection is introduced exploiting just events tween 15th September 2016 and 15th October 2016, where
triggered by the active utilization of the User Equipments all the unique identifiers referencing to the users have been
(Calls, SMS and Mobile Data Usage) without involving anonymized on a daily basis. The overall number of logs
logs related to lower level signalling protocols such as is around 200 million records per day.
cell handover event logs. The aim of this research was In order to work with just the useful data all the un-
to discover at which extent and precision is possible to in- necessary information contained in the dataset such as the
fer reliable traffic analysis with minimalistic datasets and nature of event logs and other information regarding cus-
minimal computing costs. A server application able to tomer related data (e.g. phone and events identifiers) have
feed a web-based visualization application has been im- to be discarded by the framework.
plemented for demonstration purposes. Experiments pro- At the end of this process, the CDR files contains three
vided on real but anonymized data covering one month of kind of attributes as presented in Table 1, namely, the
cell phone records show that the the presented model is Unique User Identifier (UUID) re-anonymized on a daily
promising and is able to efficiently run even on low-end basis (to prevent tracking of user movement across more
hardware. days), the date-time information related to the log event
The rest of this paper is organized as follows: Section and the Tower Identifier (TID) providing information from
2 gives a description of the available data and the proce- which cell-tower the event has been triggered.
dure utilized to match telecom data with geographical-map
data. In Section 3, a detailed description of the system ar-
Cell Reference (CR) data provide informations about the
chitecture and the core estimation model are provided. In
mobile radio-towers and their positioning within the Hun-
the Section 4 the process of traffic load classification is
garian territory. As illustrated in the Table 2, CR data
described. The following Section 5 contains experimen-
contain three attributes, namely, the Tower Identifier (TID)
tal results and measurements. Finally, in Section 6 some
which connects the CDR data with the CR data, the Lat-
conclusions and plans for future work are provided.
itude and the Longitude regarding the given tower. Due
to how the industrial partner gathered and anonymized
2 Data Sources and Framework the data, process on which the authors were not involved,
Initialization some records contained within the CDR files hold UUIDs
with NULL value or in some cases hold inconsistent TIDs.
An important part of the presented framework is its initial- Namely some TIDs contained in the CDR logs do not
ization to the specific use case, such that highway traffic match any of the TIDs in the CR data. The UIIDs incon-
load monitoring, in our case, what consists of understand- sistencies are uniformly spread over the locations while
ing the data and selecting the towers to be considered rel- for the inconsistent TIDs is not possible to draw any con-
evant by the framework. clusion about the geographical regions affected. In case of
those inconsistent logs, it is not possible to get the subject
performing the action or the location of the event. For this
2.1 Telecom Data reason all the affected records, which are around 30% of
all the records, are affected and have to be handled (dis-
The industrial partner has granted access to a dataset1 con-
carded) by the framework during the on-line computation.
taining several .csv (Comma Separated Values) files or-
ganized in a daily basis containing two main kinds of in-
formation such that 2.2 Geographic Maps Data
In order to get information about Hungarian highways the
research relied on OpenStreetMap2 (OSM) data. The open
project makes available data about roads, trails, cafes, rail-
1 Due to the signed non-disclosure agreement between the academic way stations and other basic map features from all around
and the industrial partner of the project, each sample of the data, e.g. the the world.
Tables 1 and 2, presented in this paper are synthetic, i.e. contain fictive
information. 2 https://www.openstreetmap.org/
110 Andrea Galloni, Balázs Horváth, and Tomáš Horváth
UUID date-time TID
6776554S3449 2016-09-15 13:30:41 10B32F03E10CB
777865354435 2016-09-15 00:27:50 10DA0232324BC
677655453449 2016-09-15 17:44:08 00443344DFFEA
Table 1: A synthetic CDR data sample with fictive UUID, date-time and TID data, for illustration purpose .
TID Latitude Longitude
6776554S3449 46.291311 17.366325
777865354435 46.282342 17.357244
677655453449 46.366745 17.364112
Table 2: A synthetic CR data sample with fictive TID, Lat-
itude and Longitude data for illustration purpose.
First of all information about the Hungarian country
borders have been extracted, then the data regarding only
highways has been kept obtaining a .json file containing Figure 1: Density of cell towers for Budapest Urban and
informations about each highway divided in several seg- Eastern Rural Area, an illustrative example.
ments, where each segment contains information describ-
ing a small section of the highway (e.g.: name, type, speed
limit) and its location. The length of each highway’s sec-
tion depends on the topography of the area, the density of
the population and the radio-technology of the cell tow-
ers of the operator. The outcome of this phase, i.e. the
detected borders, can be observed in the Figure 6.
Detecting Relevant Towers In order to monitor the high- Figure 2: Relevant towers identification and towers com-
ways infrastructure traffic and exclude irrelevant informa- petence attribution, an illustrative example.
tion from the data model an additional filtering and select-
ing step has been performed. After this phase, only the
relevant towers that have a strict correlation with the high- as shown in Figure 2, each highway section is linked to a
ways infrastructure have been kept. specific cell tower while all the towers not related with the
The density of the cell tower placement and its spatial highways will not be considered by the framework while
characteristics represent a crucial issue in terms of space- processing.
resolution within the developed monitoring system. Pro-
vided the mostly flat characteristics of the Hungarian land-
scape, is possible to assume that the cell towers displace- 3 The System Architecture
ment is mostly not conditioned by the topographic proper-
ties of the surrounding areas but rather follows the density The framework has its roots in several components, i.e. the
of the population over the whole territory. This character- following data structures (called dictionaries, according to
istic of the cell towers placement is due to several factors Python notation) and logical units.
such as scalability, laws of physics and signal processing
theory. Figure 1 illustrates the density of the city of Bu- 3.1 Data Structures
dapest and its east country side area from which it is pos-
sible to observe that in rural areas cell towers are placed Relevant Towers Dictionary (RTD) RTD is one of the
close to the highway in order to provide the radio-signal to main data structure on which most of the others are based
travelers. on. In fact the RTD represents an HashMap having as keys
In order to detect cell towers which lead to the crowded- the identifiers (TID, see Table 2) of all the relevant towers
ness status of that specific section of the highway, an ad- and as values the geographical coordinates of given towers
hoc algorithm have been developed based on a QuadTree as strings. RTD is the result of the module for detection of
[1] data structure. The generated tree contains all the co- relevant towers, described above. This dictionary remains
ordinates of the cell towers that are listed in the CR data. constant in the framework and changes only if there are
Then, for every highway section the closest tower cell has changes in geographical locations of cell towers such that
been found querying the tree structure. After this process, a new tower is placed near the highways, for example.
Real-time Monitoring of Hungarian Highway Traffic from Cell Phone Network Data 111
User at Relevant Towers Dictionary (URTD) URTD is cific highway segment. It uses as an evaluation model de-
a HashMap and it has as keys all the UUIDs (Unique User scribed in the next section where the number of classes is
IDs, see Table 1) of the users whose previous logs were determined by a parameter.
triggered by relevant towers, namely, those logs who had The EM module is triggered every time a time frame
their TID appearing in the RTD keys, and, as values the (a parameter of the framework) expires. At this point it
TID of the tower the given user has been related to for the is time to evaluate the status of the whole system. At this
last time. At the startup of the system, this data structure is stage the EM iterates over all the TIDCD, computes all the
empty and at the beginning of a new day, due to the UUID means and standard deviations using the past data neces-
re-anonymization on a daily basis, it is reinitialized. sary to evaluate the actual status of all the relevant TIDs
and then finally classifies every segment of a given high-
way into one of the predefined classes. For example, in
Tower ID Counter Dictionary (TIDCD) TIDCD is a
case of 10 classes, the class 5-6 corresponds to normal
HashMap and it has as keys (TIDs) all the relevant towers
traffic, 7-8 to higher while 9-10 to very high traffic, 3-4
while it has as values counters representing the number of
to lower and 1-2 to very low traffic on a given segment of
users whose last log have been related to that specific TID.
the highway.
3.2 The Status Maintainer Module (SMM)
3.4 The Notification Delegate Module (NDM)
As soon as a log event is triggered, the SMM is delegated
The NDM is the part of the framework responsible to con-
to keep track of the last location of the user (attaching to
stantly collect the result of the EM and update the clients
him/her the proper TID in the URTD) and increases or de-
about the highways status sending all the needed informa-
creases the counter of the proper TID within the TIDCD
tions as a json payload. While this process takes place,
according to the situation. This module acts as a supervi-
an ID conversion is performed. In fact, the back-end and
sor moving the subscribers from a tower to another, keep-
the front-end of the framework, due to security reasons,
ing the model up to date with the information provided by
do not share the same internal IDs for representing the
the event logs. SMM is delegated to interpret and take de-
TIDs. Once finished, the control is passed to the Noti-
cisions based on the information provided from the logs
fication Delegate Module responsible for communicating
with the following functionality:
with the client (described below).
As soon as the application is started, both the URTD and
the TIDCD Hash Maps are empty. As the first incoming
log having a TID present in the RTD key-set is processed, 4 Classification of the Traffic Load
the SMM will fill the URTD with the UUID as key and the
TID as value, then, it will initialize the counter of the spe- In order to classify the load of a segment of highway, one
cific TID key within the TIDCD to 1. If a second log from should consider the past data as a reference for the eval-
the same user will come but, this time, with a different and uation of the new entries. Given the topographic features
relevant TID then the counter of the old TID (the last “lo- of the Hungarian landscape it is expected that the activity
cation” of the user) within the TIDCD will be decreased of cell towers close to the highway have a low static noise
by one unit and suddenly the corresponding URTD’s value (in [6] authors highlights that systems relying on cell tele-
will be updated to the actual TID inferred from the log com logs for traffic estimations within urban areas could
querying the RTD and, finally, the corresponding TIDCD suffer high loss of precision due to event logs triggered by
value (for the actual TID the user is connected to) will be non-traveling users). In fact, in this specific case the ma-
increased. In the last case, if the subscriber’s handset will jority of the event logs are generated mostly by travelers
trigger a log which is not related to a relevant tower, the and it is easily possible to observe that the number of trav-
counter of the old TID within the TIDCD will be decreased elers in a given time-range tend to be similar for different
by one and the key within the URTD corresponding to the week-days.
specific UUID will be removed (the user has left the set Provided the latter observation, it is possible to build a
of towers delegated to monitor the highways mobility sta- model such that in order to classify the load of a specific
tus). All the event logs (records) containing a non-relevant road segment in a precise time-range of the day (e.g.: be-
TID (user is not on a highway) and UUID not stored in tween 12:00 and 12:15) compares the value to be classified
URTD (user was not on a highway before) are immedi- against the values of load within the same time range of the
ately discarded. Figure 3 provides an illustration of the previous days. Here, a sort of seasonality of the traffic has
SMM logic. to be considered, e.g. there is more heavier load during the
rush hours while less traffic at nights.
3.3 The Evaluator Module (EM) An other, interesting phenomenon is that sometimes a
traffic anomaly can become normal with time. For ex-
The EM is the core module of the framework, which is ample, consider a longer construction work on a highway
responsible for classifying the status of the load of a spe- causing the close of some parts (e.g. lanes) of the road. In
112 Andrea Galloni, Balázs Horváth, and Tomáš Horváth
Figure 3: The logic diagram of the Status Maintainer Module
the time of its appearance, since it is sudden for the traf- they do not need further computation but may be deter-
fic, it is considered an anomaly and results in traffic jams. mined by simply looking them up in a statistical table, as
However, with time, the traffic normalizes such that peo- illustrated in the table 3 containing the breakpoints divid-
ple get used to it (e.g. start using alternative routes) and ing a Gaussian distribution in an arbitrary number (from 3
the notion of heavy load changes. to 10) of regions.
All of the above observations lead to the straightforward The final key-concept of the proposed model is that it
use of time-series for representing the given problem of does not represent the data in the past with a SAX repre-
traffic load monitoring and classification. sentation, however the system classifies a new entry using
the breakpoints generated making use of the data recorded
in past in a SAX-like fashion just performing a lookup on
4.1 Utilizing Breakpoints a small table.
An efficient way to detect anomalies in time series is
The proposed model for classification of traffic load in var-
that it is enough to compare the breakpoint determined for
ious highway segments utilizes well-known concepts in
the actual time frame to the breakpoints determined for
Symbolic Aggregate Approximation (SAX) of time series
the corresponding time frame(s) in the past, for example,
[2, 3]. SAX makes use of Piecewise Aggregate Approx-
the same time of the day before or the same time of the
imation (PAA), a computationally very cheap method [4]
same day a week before, etc. Depending on the number
since it operates with arithmetic mean and standard devi-
of classes to which the framework should classify the new
ation, both very cheap operations.
entry in the dataset, the right column of the statistical table
SAX allows a time series of arbitrary length n to be re-
has to be implemented in the system and then looked up.
duced to a string of arbitrary length w, (w n) with the
In order to give a wider freedom of tuning, this value has
alphabet size a > 2 (the number of letters used to represent
been kept as a parameter of the framework.
the time-series using SAX). In this process the data is di-
vided into w equal sized “frames”. The mean value of each
frame is calculated and a vector of these values becomes a Historical Data Representation The data from the past
reduced representation. For further details, refer to [2, 3]. are stored in an efficient-to-load binary format making use
Assuming that the distribution of the values over the of Python’s Pickle module. Files are stored in a daily for-
time-series follows a normal distribution, it is possible mat in the form of a HashMap of arrays having as keys the
to subdivide the time-series into so called breakpoints B. TIDs. Once loaded, the files are reshaped in MxN matri-
Breakpoints are a sorted list of numbers B = (β1 , · · · , βa−1 ) ces, one for each TID, where M is the number of time-
such that the area under a N(0, 1) Gaussian curve from β1 frames and N represents the number of days to consider in
to βi+1 = 1/a where β0 and βa are defined as −∞ and +∞, the past. All the tests in this research were performed set-
respectively. The advantage of utilizing breakpoints is that ting the time frames at 15 minutes and considering the last
Real-time Monitoring of Hungarian Highway Traffic from Cell Phone Network Data 113
a
βi 3 4 5 6 7 8 9 10
β1 -0.43 -0.67 -0.84 -0.97 -1.07 -1.15 -1.22 -1.28
β2 0.43 0 -0.25 -0.43 -0.57 -0.67 -0.76 -0.84
β3 0.67 0.25 0 -0.18 -0.32 -0.43 -0.52
β4 0.84 0.43 0.18 0 -0.14 -0.25
β5 0.97 0.57 0.43 0.14 0
β6 1.07 0.67 0.43 0.25
β7 1.15 0.76 0.52
β8 1.22 0.84
β9 1.28
Table 3: Statistical table for determining the values of breakpoints
15 days in the past. This representation have been chosen of time) over the validation data with traffic classification
because once the system is running in real-time with a con- level equal to 9 or 10. With this setup eleven out of four-
tinuous data-stream then it is easy to shift the matrix data teen (79%) congestions have been successfully detected.
and recompute all the means and standard deviations effi- The time series have been quantized with time windows
ciently. This shifting and re-computation operation would of 15 minutes, thus during the testing phase the classifica-
be performed once per day at a given time (e.g. midnight) tion takes place each fifteen minutes.
exactly when the system is subject of a reinitialization or It is interesting to note that the majority of the anoma-
when the log re-anonymization process is performed. lies are detected earlier than the data provided by utinform.
This can be a proof that the proposed solution is able to
spot congestions when these are shorter than three kilome-
5 Experiments
ters. The outcome is similar for what concerns the end of
the congestions, in this case the proposed solution tends to
Given the lack of open systems providing precise infor-
detect the end of an anomaly later than the validation data
mation about the traffic flows over time we validated the
set. Three congestions out of fourteen have not been de-
system over specific traffic congestions. The goal of the
tected, however this can be due to at least two reasons: the
experiment is to validate the method looking for a correla-
telecom operator (because of industrial secrecy concerns)
tion between the highways status and the telecommuni-
did not provide to us any detail regarding the range of an-
cation infrastructure. In order to validate the results of
tennas or their direction, thus, there might be a chance of
the developed system authors asked the collaboration of
inaccuracies along the phase of matching the geographi-
utinform3 , a department of Magyar Közút Nonprofit Zrt.
cal maps with the towers’ positions in order to define the
whose responsibility is to collect information and moni-
competence of each tower w.r.t. the highways segments.
toring the roads traffic. They gather information from het-
Another reason could be due to data inconsistencies. In
erogeneous data sources: the employees of the company
fact, as mentioned in Section 2, one third of the data have
monitoring the highways, the county directorates and the
been dropped.
engineering departments employees. Utinform, in order to
cross-validate the crowd sourced data sets, is also coop- Figure 4 represents the classification value over time
erating with institutions such as police, disaster manage- slices of fifteen minutes for the highway segment suffer-
ment, and public transport which have their own feed to ing for the congestion described in Table 4 on row three.
post their information to the system. Furthermore, the sys- The red (in black and white print: the dark) vertical line
tem has a crowd source interface, where people can submit represents the time slot when the anomaly is detected on
experienced traffic anomalies. The goal of utinform is to the validation set. Here, a congestion is defined when the
provide fast and validated traffic information to the drivers classification values are equal or greater than 9. On the
in whole Hungary. other end, Figure 5 represents the number of handsets (an
The data received include information regarding traffic estimation of the number of travelers) involved in the con-
congestions on all the Hungarian highways. The dataset gestion.
contained data from October the 1st to October the 14th, The application framework have been completely devel-
2016 regarding congestions with a length of at least 3 kilo- oped in Python 3.6 and it have been demonstrated to be
meters. Table 4 shows the validation data and the results very efficient. Although the system is designed to work
of the proposed framework. on-line receiving a stream of data in real-time, in order
Setting the number of classes to 10. We define a cor- to measure the performances of the model, we decided to
rect detection when there is at least 50% overlap (in terms exclude the streaming and the networking module, finally
we tested the model in off-line mode. One log per time
3 http://utinform.hu/ is read and suddenly processed. With this approach we
114 Andrea Galloni, Balázs Horváth, and Tomáš Horváth
Cong. Start Time Cong. End Time Det. Start Time Det. End Time
11:00 11:40 10:30 12:15
06:35 07:45 07:15 07:45
09:30 22:10 09:00 00:00
17:55 09:05 none none
06:52 07:30 07:00 08:15
14:50 16:50 16:45 17:30
08:40 18:10 08:15 18:45
14:00 17:10 14:45 19:25
10:00 11:20 08:45 13:45
17:23 20:15 16:15 20:00
07:20 08:50 07:30 10:30
02:55 04:40 none none
15:00 16:00 15:15 16:15
06:45 07:25 none none
Table 4: Experimental Results indicating congestion (Cong.) start and end times as well as detection (Det.) start and end
times.
Figure 4: Classification values over time slices and the Figure 5: Number of handsets involved in a congestion and
time of a congestion (red/dark line), an illustrative exam- the time of the congestion (red/dark line), an illustrative
ple. example.
maintained the architectural design of the framework and is a web based application, it uses HTML and JavaScript,
no-delay straming process has emulated while at the same which communications with the Python back end through
time avoiding networking related delays. Running the ap- Web Sockets with .json files. At the initialization stage,
plication on an Intel i7 6700K with 16GB of DDR4 RAM, first the front end asks the back end for mapping of tower
on average, manages to process the logs for an entire day IDs and highway sections, as it was mentioned before at
within less than 10 minutes with a peak of RAM con- the geographic data section. After the initial step, the back
sumption of 32MB. Furthermore, the application frame- end is sending messages which the front end processes and
work have been deployed on a Raspberry Pi 3 where shows on the map. These messages are json files, con-
the running time needed to process an entire set of logs taining a list of objects with three attributes: ID, value,
representing one day is in around 1 hour, in average. anomaly. The ID stands for the tower IDs and the value is
an int from 0 to 9 showing the traffic load on that segment,
5.1 Visualization and the anomaly flag is giving information that this value
is the expected for the given time window or it is deviating
First the communication interface need to be mentioned from the usual traffic load on that area. In order to keep the
between the back end and the front end part. The front end low cost functionality of the system, the visualization had
Real-time Monitoring of Hungarian Highway Traffic from Cell Phone Network Data 115
to be carefully built as well. An open-source JavaScript Acknowledgements
library, LeafletJS4 was used to place layers on the par-
ticular highway segments. This library has relatively low Authors would like to thank Magyar Telekom Nyrt. and
cost of handling layers and as it is expected from an ap- utinform.hu. The research has been supported by the
plication where the traffic load is constantly changing this European Union, co-financed by the European Social
was a critical feature. Each layer is a colored visualization Fund EFOP-3.6.3-VEKOP-16-2017-00001 and the project
of the value given to the particular highway section, an ex- “Open City services” funded by Magyar Telekom Nyrt.
ample can be seen on the Figure 6, where the spectrum is
from blue to red, representing the low to high traffic load. References
[1] R. Finkel, J.L. Bentley (1974). Quad Trees: A Data Structure
for Retrieval on Composite Keys. Acta Informatica 4 (1), 1–
9.
[2] Lin, J., Keogh, E., Lonardi, S. and Chiu, B., 2003. A
symbolic representation of time series, with implications
for streaming algorithms. In Proceedings of the 8th ACM
SIGMOD workshop on Research issues in data mining and
knowledge discovery (pp. 2-11). ACM.
[3] Lin, J., Keogh, E., Wei, L. and Lonardi, S., 2007. Experienc-
ing SAX: A Novel Symbolic Representation of Time Series.
Data Mining and knowledge discovery, 15(2), pp.107-144.
[4] Zhang, Y. and Glass, J., 2011. A piecewise aggregate ap-
proximation lower-bound estimate for posteriorgram-based
Figure 6: The traffic load visualized on the highway seg- dynamic time warping. 12th Conference of the International
ments, an illustrative example. Speech Communication Association, pp.1909-1912.
[5] Senin, P. and Malinchik, S., 2013, December. Sax-vsm: In-
terpretable time series classification using sax and vector
space model. In Data Mining (ICDM), 2013 IEEE 13th In-
6 Conclusions ternational Conference on (pp. 1175-1180). IEEE.
A lightweight traffic monitoring and traffic jam detec- [6] Milani, A., Gentili, E. and Poggioni, V., 2009. Cellular Flow
in Mobility Networks. IEEE Intelligent Informatics Bulletin,
tion framework has been presented in this paper based on
10(1), pp.17-23.
HashMap data structures and methods for breakpoint de-
[7] Hakkarainen, A., Werner, J., Costa, M., Leppanen, K. and
tection well-known in time series classification. The used
Valkama, M., 2015, September. High-efficiency device lo-
concepts require cheap computation and, basically, mini- calization in 5G ultra-dense networks: Prospects and en-
mal tuning phase opposite to the case of tuning the hyper- abling technologies. In Vehicular Technology Conference
parameters of machine learning algorithms. The few pa- (VTC Fall), 2015 IEEE 82nd (pp. 1-5). IEEE.
rameters of the framework such that the time window or [8] R. Khokale, A. Ghate (2017). Data Mining for Traffic Pre-
time frames as well as the number of breakpoints can be diction and Analysis using Big Data. International Journal of
set according to an available domain knowledge or user Engineering Trends and Technology 48 (3).
expertise. However, to avoid false positives, it is recom- [9] Gundlegard, D. and Karlsson, J.M., Road Traffic Estima-
mended to tune the presented framework before imple- tion using Cellular Network Signaling in Intelligent Trans-
menting it into a production environment. portation Systems, 2009. In: Wireless technologies in In-
The SAX representation had already been used as a tool telligent Transportation Systems. Editors: Ming-Tuo Zhou,
for time series classification. However, the proposed dis- Yan Zhang and L. T. Yan. ISBN: 978-1-60741-588-6 2009
cretization procedure is unique in that it uses an interme- pp. 361-392. Nova Science Publishers.
diate representation between the raw time series and the [10] White, J. and Wells, I., 2002. Extracting origin destina-
symbolic strings. Furthermore the aim is not to classify tion information from mobile phone data. In 11th Interna-
full time series as in [5] but rather to classify new incom- tional Conference on Road Transport Information and Con-
ing single values. trol, London, 2002, pp. 30–34
The presented framework was tested using real data. [11] Wideberg, J.P., Caceres, N. and Benitez, F.G., 2006. Deriv-
Due to the sensitive nature of the project the presented ing Traffic Data from a Cellular Network. In Procedings of
the 13th ITS World Congress, London, 2006.
framework was developed within, the authors cannot dis-
close the source code nor the data used in experiments, in
this time. Experiments show that the proposed framework
is promising and worth further development and adapta-
tion to other use-case scenarios.
4 http://leafletjs.com