Towards detection of faulty traffic sensors in real-time


Nikolas Zygouras, Nikolaos Panagiotou,
Ioannis Katakis, Dimitrios Gunopulos                                {NZYGOURAS , N . PANAGIOTOU , KATAK , DG}@ DI . UOA . GR
University of Athens, Athens, Greece
Nikos Zacheilas, Ioannis Boutsis, Vana Kalogeraki                                 {ZACHEILAS , MPOUTSIS , VANA}@ AUEB . GR
Athens University of Economics and Business, Athens, Greece


                          Abstract                                   Automatic identification of anomalies in streaming data is
                                                                     an emerging field of research due to the large number of
     Detecting traffic events using the sensor network               applications (intrusion detection, event identification, etc).
     infrastructure is an important service in urban en-             Many algorithms that utilize machine learning and time se-
     vironments that enables the authorities to han-                 ries analysis techniques have been successfully applied for
     dle traffic incidents. However, irregular mea-                  the detection of unexpected events during the last years (Yi
     surements in such settings can derive either from               et al., 2000). These methods offer high quality results and
     faulty sensors or from unpredictable events. In                 are able to perform on massive data streams in real-time.
     this paper, we propose an efficient solution to                 An interesting use-case is the automatic analysis of traffic
     resolve in real-time the source of such irregu-                 data generated by Smart Cities infrastructures. Human per-
     lar readings by examining the correlation and the               sonnel are unable to monitor and efficiently identify prob-
     consistency among neighbor sensors and exploit-                 lems on these data. The utilization of anomaly detection
     ing the wisdom of the crowd. Our experimental                   techniques would provide great assistance to traffic opera-
     evaluation illustrates the efficiency and practical-            tors as it would enable the automatic real-time identifica-
     ity of our approach.                                            tion of traffic issues.
                                                                     Recently, Crowdsourcing has emerged as an attractive
1. Introduction                                                      paradigm to exploit the intelligence of ubiquitous human
                                                                     crowd (citizens) to extract useful information. Traditional
Sensor network infrastructures have been widely used for             Crowdsourcing systems such as AMT1 , CrowdFlower2 ,
traffic management in smart cities to provide important ser-         etc., constitute marketplaces for human intelligence tasks
vices for the benefit of pedestrians, cyclists, motorists and        (HITs), that allow a requester to define a task, which is per-
public transport. Such services are typically provided by            formed by other human workers in exchange for a reward.
analyzing data provided by heterogeneous static and mo-              For example, mobile human workers with different char-
bile sensors. This enables the implementation of numerous            acteristics can be queried for geo-located tasks to extract
applications like proposing alternative routes, altering traf-       real-time information without needing an expensive infras-
fic lights, etc.                                                     tructure(Boutsis & Kalogeraki, 2014).
The most common type of sensor which is utilized in such             In this paper we develop an efficient approach that identi-
environments is the SCATS sensor. They are static sensors            fies faulty readings from traffic sensors by examining the
embedded at the city roads providing rich, real-time infor-          correlations among them and by taking advantage of the
mation such as traffic flow measurements based on vehicles           ubiquitous citizens through Crowdsourcing. We summa-
that cross a specific segment. Despite their utility in many         rize our contributions below:
traffic applications, SCATS sensors can be faulty. Thus,
one fundamental challenge in these settings is how to ef-
                                                                         • We present an efficient approach that identifies
ficiently distinguish between irregular and faulty measure-
                                                                           anomalous sensors and uses Crowdsourcing to resolve
ments before taking any unnecessary actions.
                                                                           whether irregular measurements are due to faulty sen-
Proceedings of the 2 nd International Workshop on Mining Urban
                                                                           sors or irregular traffic.
Data, Lille, France, 2015. Copyright c 2015 for this paper by its        1
                                                                             http://www.mturk.com/
authors. Copying permitted for private and academic purposes.            2
                                                                             http://www.crowdflower.com/


                                                                    8j
                                                         ICML Workshop

                                                                   LiveDrive radio where users can report traffic, and (iv)
                                                                   pedestrian counters.

                                                                   2.2. System model
                                                                   In this section we provide our system model for the data
                                                                   sensors that we examine, namely the SCATS sensors and
                                                                   Crowdsourcing.
                                                                   SCATS Sensors. SCATS (Sydney Coordinated Adaptive
                                                                   Traffic System) is an innovative computerized traffic man-
                                                                   agement system developed by Roads and Maritime Ser-
                                                                   vices (RMS) Australia. SCATS sensors are fixed mag-
Figure 1. The SCATS sensors locations at Dublin’s city centre      netic sensors deployed on intersections to measure the traf-
                                                                   fic flow and the degree of saturation of roads’ lanes. In
  • We tackle the problem of automatically detecting               Dublin city, each SCATS sensor produces and transmits a
    anomalous SCATS sensors with three methods: (i)                new record every minute. Each record contains information
    Pearson’s correlation, (ii) cross-correlation and (iii)        related to the timestamp t of the measurement, the sensor’s
    multivariate ARIMA model. The proposed methods                 ID i and finally the degree of saturation and traffic flow
    have to tackle the task efficiently in real-time.              measurements. In the provided dataset there are approxi-
                                                                   mately 300 SCATS controlled intersections and 1000 dif-
  • We develop our approach using the Lambda-                      ferent SCATS sensors throughout the road network. The
    Architecture which combines a batch processing                 GPS locations of the SCATS sensors are presented in Fig-
    framework (i.e. Hadoop3 ) and a distributed stream             ure 1. Degree of saturation measures how much a road’s
    processing system (i.e. Storm4 ) for efficiently pro-          lane is utilized, while traffic flow measures the vehicles’
    cessing both historical and real-time data.                    volume divided by the highest volume that has been mea-
                                                                   sured in a sliding window of a week5 . In this work we
  • We develop a Crowdsourcing system used to extract              decided to monitor the degree of saturation value, noted as
    answers from the human crowd based on the MapRe-               s, as it more reliable and informative than the traffic flow.
    duce paradigm.                                                 The degree of saturation of a particular SCATS sensor with
                                                                   ID i at the timestamp t is noted si,t .
  • We provide an experimental evaluation, which illus-
    trates that our approach is practical and can effectively      Crowdsourcing. Our crowdsourcing system comprises a
    identify irregular measurements in real-time.                  set of human workers denoted as wj which are able to re-
                                                                   ceive task assignments. Tasks are being inserted to the sys-
                                                                   tem by an authority, such as the DCC. Each task tk is as-
2. Problem Description and System Model                            sociated with a number of attributes as < idk , latitudek ,
2.1. Smart City                                                    longitudek , rewardk , descriptionk >. Hence, every
                                                                   task posses a unique identifier (idk ), the geographical co-
Smart cities exploit digital sensor devices that can be either     ordinates of the location that the task involves (latitudek ,
embedded at the city infrastructure or they can be mobile          longitudek ), the corresponding reward (rewardk ) for ex-
(e.g., smartphones) in order to provide services for their cit-    ecuting the task and a task description that describes the in-
izens that enhance their well-being. Such services may re-         formation that needs to be provided by the human worker.
late to traffic management, housekeeping information, etc.         An example of such a task description is: “Is there traffic
In this paper we focus on Dublin, a smart city that uti-           in O’Connell Street? Yes/No”. Finally each response pro-
lizes sensors for supervising and managing road traffic            vided by a worker is captured with a record by our system
(Kinane et al., 2014). In Dublin the traffic is controlled         using the worker and the task identifiers, coupled with the
by the Dublin City Council (DCC), which is responsible             response as follows: < wj , idk , responsejk >.
to develop, maintain and manage the city road network.
To achieve that they exploit several heterogeneous data            3. Architecture
sources that include: (i) SCATS sensors which are em-
bedded on the road and monitor real-time traffic density,          In Figure 2 we display our system architecture which
(ii) GPS traces from sensors embedded on buses, (iii) the          consists of the following components: (i) a Distributed
                                                                       5
   3
     https://hadoop.apache.org/                                       http://dublinked.com/datastore/datasets/
   4
     https://github.com/nathanmarz/storm                           dataset-274.php


                                                                  89
                                                                                  ICML Workshop

                          Distributed Database           MapReduce Job                    and low response latencies (McCreadie et al., 2013). Fur-
                                                                                          thermore, we decided to use Storm due to its scalability fea-
   Processing
     Batch


                                                                                          tures that we also exploit in our previous work (Zygouras


                                                       Map


                                                                         Reduce
                                                                                          et al., 2015). Storm users can change the parallelism of
                                                                                          the processing components to adapt to possibly workload
                                                                                          bursts.
   Processing


                                                                                          Finally, for the analysis of the historical sensor data we
     Stream


                       Input Source Preprocessing
                                        rocessing              CrowdSourcing
                                                               CrowdSou
                                                                   dSourcing              used the most commonly used open-source implementation
                                                    Analysis                              of the MapReduce programming model, Hadoop. We ex-
                                                                                          ecute periodical (i.e. at the end of each day) Hadoop jobs
       Applications


                                                                                          for computing the basic metrics required by our proposed
          Users


                                                                                          techniques, described in more detail in Section 4. Our jobs
                                                                                          retrieve historical data from a distributed database, more
                                                                                          specifically MongoDB9 . We decided to use MongoDB in-
                                                                                          stead of the Hadoop Distributed Filesystem (HDFS), as we
                                Figure 2. System Architecture                             want to have fast access to the data from the DSPS compo-
                                                                                          nent of our architecture, for computing and storing short-
                                                                                          term statistics in real-time.
Stream Processing System (DSPS), (ii) a batch processing
framework, (iii) a distributed database system, and (iv) the
Crowdsourcing component that consists of the users’ mo-                                   4. Methodology
bile devices. Our architecture is an instance of the Lambda-
                                                                                          The goal of this work is to monitor the streaming traf-
Architecture6 as we exploit the fast processing offered by
                                                                                          fic data and automatically pose Crowdsourcing tasks when
DSPS and the fault-tolerance and parallelism provided by
                                                                                          anomalous sensors are identified. In order to identify
current batch processing frameworks.
                                                                                          anomalous sensors we propose three different outlier tests
Incoming SCATS-sensor data are forwarded to a stream                                      that examine whether the SCATS sensors behave differ-
processing graph. These data are pre-processed and stored                                 ently from their normal behavior. These outlier tests
in the Distributed Database (i.e. Preprocessing component                                 are based on the following statistical measurements: (i)
in Figure 2) for further processing by the batch processing                               Pearson’s Correlation (ii) Cross-Correlation and (iii) the
component. We analyze the reported metrics via the Anal-                                  ARIMA Model. The normal behavior for each sensor is
ysis component which examines if one of the sensors de-                                   calculated offline using the historical data. These meth-
viates significantly from its neighbors so it could possibly                              ods are implemented using the Lambda architecture and
be a faulty sensor. This component uses both the current                                  Crowdsourcing tasks are assigned to users when anoma-
conditions and historical data for identifying such condi-                                lous SCATS sensors are identified.
tions. In case that one such sensor is detected, the Analysis
component informs the Crowdsourcing component about                                       4.1. Identifying Anomalous Sensors
this situation. The latter is responsible to send the appro-
priate Crowdsourcing tasks that will enable us to detect if                               In this section we describe the three statistical measure-
the sensor is a faulty-one. Finally, the batch processing                                 ments that are used and we explain how these are utilized
component periodically computes new statistics about the                                  to detect anomalous SCATS sensors. Initially we applied a
historical sensor data.                                                                   simple statistic measurement named Pearson’s correlation
                                                                                          that identifies the correlation between pairs of SCATS sen-
There are multiple DSPSs which support low latency pro-                                   sors. Then we used an extension of the first method, named
cessing in real-time. Some of these systems are Apache                                    cross-correlation, to identify how many lags we should
Storm, Spark Streaming7 and TUD-Streams (Bockermann                                       shift backward a sensor’s values to maximize its pairwise
& Blom, 2012). We used Storm as the DSPS that will                                        correlation with another adjacent sensor. The first two ap-
perform the real-time processing of incoming sensor data.                                 proaches use two well known measures in time series anal-
Storm is one of the most commonly used DSPS, and is sup-                                  ysis. The disadvantage is that they check pairs of sensors
ported by major companies such as Twitter8 . It has been                                  and not the group of sensors as a whole. For this reason we
successfully applied for processing high volume of data in                                applied a third approach that can be thought as a multivari-
different application domains, achieving high throughput                                  ate ARIMA model which deals with the aforementioned
   6
     lambda-architecture.net
                                                                                          problem and is faster than the other approaches.
   7
     https://spark.apache.org                                                                9
                                                                                                 http://www.mongodb.org/
   8
     http://twitter.com


                                                                                        88
                                                       ICML Workshop

4.1.1. P EARSON ’ S C ORRELATION                                  In order to identify anomalies with cross-correlation we
                                                                  followed a similar approach to the one utilizing the Pear-
The Pearson’s correlation coefficient is a well known statis-
                                                                  son’s correlation measure, described before. The main dif-
tic that measures the linear relationship of two variables X
                                                                  ference is that we identified, using historical data, the lag
and Y . It takes values in [−1, 1], where 1 means that the
                                                                  dmax that maximized the correlation between two SCATS
variables are positively correlated, −1 stands for negative
                                                                  sensors X and Y . In the streaming analysis in order to cal-
correlation and 0 for no correlation between X and Y . The
                                                                  culate the cross-correlation between the sensors we shifted
Pearson’s correlation, noted ρX,Y , is calculated by dividing
                                                                  dmax lags backward the Y and we calculated its correla-
the covariance of X and Y with the product of the standard
                                                                  tion with X. Finally, we measured how much the streaming
deviations of X and Y (see Equations 1 and 2).
                                                                  cross-correlation deviates from the offline calculated cross-
                             cov(X, Y )                           correlation between X and Y using the optimal lag value
                    ρX,Y =                                (1)     dmax .
                               σX σY
          cov(X, Y ) = E[(X − µX )(Y − µY )]              (2)     4.1.3. M ULTIVARIATE ARIMA M ODEL
In our scenario we calculated the pair-wise correlation be-       A common strategy to detect outliers in multivariate time
tween all SCATS sensors X and Y whose spatial distance            series (Yi et al., 2000) is to build a regression model for
does not exceed a predefined threshold. This restriction          each time series and evaluate whether the actual values vary
creates a sparse correlation matrix that contains non-zero        significantly from the predictions. The model receives as
elements when SCATS sensors are spatially adjacent. We            input the previous L degree of saturation measurements for
calculate the sparse correlation matrix from the historical       a particular sensor with ID = 0 and the sensor’s N nearest
data. Then, utilizing the streaming data that arrive contin-      SCATS sensors {si,j : i ∈ [0, N ], j ∈ [0, L], i, j ∈ Z}.
uously in our system we periodically calculate the stream-        The goal of the model is to make the best prediction for
ing correlation of the adjacent SCATS sensors. We note as                             ˆ . The model is presented in detail in
                                                                  s0,t , denoted as s0,t
noisy sensors’ pairs those that their streaming correlations      Equation 5. This model can be thought as a multivariate
disagree significantly with the correlations calculated from      ARIMA model, as multiple sensors are used in order to
the historical data. If a particular sensor disagrees signif-     make the predictions.
icantly with the majority of his neighbors then a crowd-
sourcing task is posed.                                                ˆ =φ0,1 s0,t−1 + · · · + φ0,L s0,t−L +
                                                                      s0,t
                                                                           φ1,0 s1,t + φ1,1 s1,t−1 + · · · + φ1,L s1,t−L +
4.1.2. C ROSS -C ORRELATION                                                                                                                  (5)
                                                                           ...
Cross-correlation is a statistical measure of similarity be-                φN,0 sN,t + φN,1 sN,t−1 + · · · + φN,L sN,t−L
tween two variables X and Y as a function of the lag of one
relative to the other. More specifically cross-correlation be-    In the training phase we use the historical degree of sat-
tween X and Y is calculated by shifting forward or back-          uration values in order to calculate the coefficients Φ of
ward Y and calculating its correlation coefficient with X.        Equation 5. In order to solve this problem we created the
Cross-correlation with lag d, noted ρX,Y (d), is calculated       matrix A and vector b containing the input data (degree of
as seen in Equation 3. The numerator of the equation calcu-       saturation values) and the target values respectively. The Φ
lates the covariance of X and Y shifted d time bins back-         parameters are the values that optimally solve Equation 6.
ward. Finally the denominator is the product of the stan-         The solution of this system is given with the pseudo-inverse
dard deviations of X and the lagged Y .                           transformation of the input presented in Equation 7. The
                     P                                            key property of this approach, in contrast to the two pre-
                         [x(i) − µX )(y(i + d) − µY )]            viously described techniques, is that it monitors the differ-
  ρX,Y (d) = pP i                     pP
                     i (x(i) − µX )
                                    2
                                         i (y(i + d) − µY )
                                                            2     ent sensors together as a whole. The Pearson’s correlation
                                                           (3)    and the cross-correlation approaches investigated only pair-
                                                                  wise correlation between SCATS sensors, ignoring poten-
A traffic anomaly at a particular location, in a road net-
                                                                  tially useful information. On the other hand, the ARIMA-
work, may require some time in order to be propagated to
                                                                  based method aims at exploiting this information.
the adjacent sensors. This observation motivates us to con-
sider the cross-correlation between adjacent SCATS sen-                  Φ = [φ0,1        ...    φ0,L    ...     φ2,0     ...     φ2,L ]
sors. More specifically we calculated the dmax that maxi-
mized the correlation between two adjacent sensors X and
                                                                                                                                                   
                                                                      s0,t−1       ...          s0,t−L    ...      sN,t         ...        sN,t−L
Y (see Equation 4).                                                 s0,t−2        ...      s0,t−L−1      ...     sN,t−1        ...    sN,t−L−1 
                                                                  A=
                                                                     ...          ..           ..        ..         ..         ..         ..   
                                                                                      .          .           .        .            .        .
                                                                                                                                                
                dmax = arg max(ρX,Y (d))                  (4)
                             d                                         s0,L        ...         s0,0       ...      sN,L         ...      sN,0


                                                                 8e
                                                       ICML Workshop
                                               ⊤
                 b = [s0,t    ...   s0,L+1 ]


                             b = AΦ                       (6)

                    Φ̂ = (AT A)−1 AT b                    (7)

In order to integrate this approach we split the historical
data in training and test set. Initially we calculated of-
fline, using the training set, the Φ̂ parameters. These pa-
rameters are the coefficients regarding the sensor’s previ-
ous measurements and its adjacent sensors’ past measure-
ments. Then we calculated how well the data fitted to these       Figure 3. Crowdsourcing Application (a) Main Application, (b)
models computing for each sensor its Mean Absolute Er-            Push Notification, (c) Map Task
ror (MAE). Finally, in order to identify anomalous SCATS
sensors while monitoring the streaming data we compute
                                                                  Reduce tasks are responsible for computing the metrics de-
for each sensor its MAE at a particular time window. We
                                                                  scribed in Section 4.1 and store the results in MongoDB.
label a sensor as ‘anomalous’ if its streaming MAE notice-
ably differs from its MAE measured using the testing set.         In the stream processing component, we implemented a
                                                                  Storm topology (see Figure 2) that processes the real-time
4.2. Implementation                                               sensor data. We exploit the parallelism offered by Storm by
                                                                  having multiple instances of our Analysis component, run-
Our system calculates the correlation among adjacent              ning in parallel, in order to decrease latency. The topology
SCATS sensors. This is achieved by adding the SCATS               pre-processes the incoming data and stores them in Mon-
sensors’ GPS locations in a k-d tree data structure during        goDB. Also the pre-processed data are sent to the compo-
system initialization and calculating the k nearest SCATS         nent that invokes the three different techniques. Again we
sensors for each sensor. Furthermore, we developed our            partition the data based on the offline mapping, to guaran-
system using the Lambda architecture. So we should en-            tee that all neighboring data will be processed by the same
sure that the required data are transmitted to the appropriate    component’s instance. Detected events are forwarded to the
cluster nodes. Thus, we created a mapping of each SCATS           Crowdsourcing component that is responsible to inform the
sensor ID to one or more cluster nodes. This guarantees           users that will help us detect if the sensor is faulty.
that each computing node contains all the required data for
a sensor’s adjacent sensors.
                                                                  4.3. Crowdsourcing System
We define three parameters that help us configure the com-
                                                                  Misco. Our Crowdsourcing system has been developed us-
ponents of our system. The first one is job periodicity
                                                                  ing the Misco framework (Dou et al., 2011; 2010; Kakan-
and defines when the batch jobs should re-execute (e.g.
                                                                  tousis et al., 2012), which is based on the MapReduce
each day, every week). The other two control the
                                                                  paradigm and tailored for mobile devices to provide an ex-
stream processing computations. More specifically, the
                                                                  tensible and efficient way to develop distributed applica-
stream threshold parameter defines how often we should
                                                                  tions.
re-compute the examined metrics (e.g. every ten minutes),
while time window defines the sliding time window (e.g.           Our Crowdsourcing system is structured using (i) a Mas-
the previous hour) that will be used for keeping the past         ter Server that keeps track of the tasks tk submitted when
sensor data necessary for the computations.                       anomalies are detected from SCATS sensors, assigns them
                                                                  to human workers wj and returns the responses to the sys-
As we described in Section 3, we periodically invoke
                                                                  tem, and (ii) the Workers who are the human contributors
Hadoop jobs that compute the different metrics we ex-
                                                                  that process the crowdsourcing tasks. Each Worker is re-
plained in Section 4.1. Map tasks read the pre-processed
                                                                  sponsible to process queries and return the results to the
sensor data from the MongoDB, and send them to the re-
                                                                  server. These tasks are executed by workers through their
duce tasks. We partition the data based on the SCATS sen-
                                                                  personal smartphone devices or tablets.
sor ID to cluster node mapping. The idea is that neigh-
boring sensors should always end up on the same reduce            Task assignment. Suppose that we need to exploit Crowd-
task in order to appropriately compute the examined met-          sourcing to determine the source of an event using a task
rics. Each sensor may belong to more than one nodes in            tj . We describe the step-by-step sequence followed so as to
such cases we send the tuple multiple times (i.e. equal to        process the task and return the results. In the implementa-
the number of nodes it is part of) to avoid information loss.     tion described below we considered Android-based devices


                                                                 8d
                                                         ICML Workshop

and thus we have utilized the Android SDK10 .                                    Parameter                 Value
                                                                                 job periodicity          24 hours
For every task tj , the Master Server spans the task to a set
                                                                                 stream threshold        10 minutes
of map tasks that need to be forwarded to the human work-
                                                                                 time window               1 hour
ers wk . Since these tasks are geo-located only the workers
that reside close to the specific selection need to be selected                Table 1. Basic Configuration Parameters
by the Master Server to provide information. However, in
order to avoid tracking the users we follow a different pol-
icy. We forward the task to all the workers and the tasks are
locally filtered at the mobile devices if their location is far     to the same LAN and their clocks were synchronized us-
from the location of the task tj .                                  ing the NTP protocol. The frameworks we used were the
                                                                    following: Storm 0.8.2, Esper 5.1 and MongoDB 2.6.5.
We use Push Notifications services to initiate the communi-
cation with the human workers, to be able to send the Map           In Table 1, you can see the values of the basic configuration
task to the users without being restricted by their connec-         parameters described in Section 4.2. For the experiments,
tion (WiFi, 3G, etc). Such services exist in all major mobile       we used SCATS data from the period of April and May
operator systems and allow users to register for message            of 2014. The distance threshold used for the neighbour-
delivery when they are online through a connection server.          ing sensors computation was set to 250 meters. Data from
                                                                    April were used in order to calculate the historical corre-
In order to be able to receive map tasks, each user first           lations, cross-correlations as well as the ARIMA models.
needs to login to our system so that the Master server will         On the other hand, data from May were used for different
be aware of the user. At the same time the user also regis-         experimental runs (see below).
ters in the push notification service to retrieve its unique id.
During normal operation the Crowdsourcing applications              For the Pearson Correlation method we have stored the his-
runs in the background (Figure 3a).                                 torical correlations of the neighbour-pairs in the MongoDB
                                                                    component. 7116 neighbour pairs were identified under the
When the Master Server retrieves a new task tj from the             distance threshold from a set of 900 SCATS sensors. The
requester, it delivers a push notification to the user devices      correlation value ranged from almost perfect correlation,
with the task, through the Push Notification service. Once          for sensors of the same junction under different lane, to no
the device receives the notification it examines whether the        correlation at all for more distant sensors. Negative corre-
user current location is close to the location of the task so       lation values between nearby sensors were also observed.
as to alert the user (Figure 3b). Next, if the user selects the     This could be explained by the opposite direction of the
notification on his mobile device the Crowdsourcing appli-          lane the sensors are responsible for. In Figure 4, the cor-
cation is triggered and the task will be displayed in the user      relation matrix for a set of 30 nearby sensors is presented.
screen to process the task (Figure 3c).                             As expected, clusters are formed by adjacent sensors that
Finally, the responses for each map task are forwarded to           are highly correlated. Thus, it is reasonable to argue that
the Master Server that initiates the reduce phase to aggre-         when the expected correlation is not observed there might
gate the answers. The reduce phase is performed through             be a problem with the sensor. For the Cross-Correlation
Majority Voting. Hence the Master Server identifies the re-         method apart from storing the correlation value itself we
sponse responsejk for task tk with the maximum amount               have also stored the time lag that maximizes the pair-wise
of answers from all users wj and forwards the response that         sensor correlation. The time lag range was set to a maxi-
represents the cause of the event to the system.                    mum of 10 minutes since the sensors are quite close to each
                                                                    other and larger time lags are unlikely to significantly favor
Crowd Feedback. The response retrieved by the crowd-                the correlation value. In addition, the larger the time lag
sourcing component enables the system to determine                  range is, the more computationally demanding the method
whether the irregular readings derive from an unexpected            will be. As it was expected, in most cases the highest cross-
event (e.g., roadworks) or if the sensor is indeed faulty           correlation was observed with no time lag at all, since most
when most of the workers answer “None of the above”.                sensor pairs are responsible for different lanes of the same
                                                                    highway junction. However, for more distant sensor-pairs
5. Evaluation                                                       responsible for different highway junctions, small time lags
                                                                    gave a boost on their correlation value. One way to under-
We have evaluated our proposals on our local cluster con-           stand this is because vehicles require a short time to reach
sisting of 4 VMs. Each VM had two CPU processors at-                consecutive junctions. In addition, this behaviour could be
tached and 3, 096 MB of RAM. All VMs were connected                 also explained by the operation of traffic lights that trans-
  10
       Android platform: http://www.android.com/                    fer the traffic from junction to junction on fixed time inter-
                                                                    vals. Figure 5 depicts the distribution of the optimal time


                                                                   83
                                                                                                                             ICML Workshop

                                                                                                                                                                                                                    QQ Plot
                              Correlation                                                                             3500                                                                120
                       30                                 1
                                                                                                                      3000                                                                100


                                                                                          #Times Maximized CorrCoef
                                                          0.8
                       25
                                                                                                                      2500                                                                 80


                                                                                                                                                                       Predicted Values
                                                          0.6
                       20                                                                                             2000
Sensor ID


                                                          0.4                                                                                                                              60

                       15                                 0.2                                                         1500                                                                 40

                       10                                 0                                                           1000                                                                 20
                                                          −0.2
                        5                                                                                             500                                                                   0
                                                          −0.4
                                                                                                                        0                                                                 −20
                              10      20         30                                                                      0                  5              10                                0             50        100      150
                               Sensor ID                                                                                                   lag                                                                  Actual Values


Figure 4. The correlation matrix of a set of                                              Figure 5. Distribution of the optimal time lag                               Figure 6. The predicted and the actual de-
30 neighbors                                                                              value over all the neighbour-pairs                                           gree of saturation values over the validation
                                                                                                                                                                       dataset

                                    Model Fit                                                                                 Model Fit
                       120
                                         Measured Value                                                                            Measured Value                                                          Pearson
                       100               Forecast                                       200                                        Forecast                                                                 (105)
Degree of Saturation


                                                                 Degree of Saturation


                        80
                                                                                        150                                                                                                                 10
                        60
                                                                                        100                                                                                                           82
                        40                                                                                                                                                                                            0
                                                                                                                                                                                                            13
                        20                                                               50
                                                                                                                                                                                                 37             3         161
                         0
                                                                                          0
                                                                                                                                                                Cross Correlation                                               ARIMA
                       −20
                          0    50       100      150                                          0                                  500              1000                (135)                                                      (177)
                                    Time (min)                                                                                Time (min)

Figure 7. On the left there is a sensor whose values agree well with our forecast. On                                                                      Figure 8. Venn diagram for the three proposed
the right there is a noisy sensor whose values diverge significantly from the predicted                                                                    methods
and it is considered as faulty


lag value over a sample of neighbor pairs.                                                                                                 time, it is likely that it is faulty. These sensors are flagged
                                                                                                                                           by our system for further manual evaluation or inspection
In terms of the multivariate ARIMA method, a different
                                                                                                                                           from the traffic operators.
model was fitted to each sensor using as features all the
neighbor sensors. The performance of all the models was                                                                                    The three methods were compared in terms of the number
aggregated and measured in terms of Mean Absolute Error                                                                                    of faulty sensors they identify. In addition, since the Corre-
(MAE), Root Mean Squared Error(RMSE) and Correlation                                                                                       lation and the ARIMA approaches focus at a very different
Coefficient (CC) using a validation dataset. The results fol-                                                                              aspect of the same problem we measured the overlap be-
low on Table 2 with a low MAE value of 16.18 indicating                                                                                    tween their results. Figure 8 displays the Venn diagram of
a decent fit.                                                                                                                              the results obtained over the period of one day during May
                                                                                                                                           of 2014. As it was expected, the results of Pearson Corre-
Figure 6 shows the forecasting performance of the ARIMA
                                                                                                                                           lation and Cross-Correlation are highly overlapping since
model on the validation dataset and Figure 7 gives an ex-
                                                                                                                                           for many sensors the optimal time lag is zero. On the other
ample on forecasting two different sensors. Sensors such
                                                                                                                                           hand, the ARIMA method identified different sensors as er-
as the one presented in Figure 7 (left) will be considered as
                                                                                                                                           roneous suggesting that the methods are complementary to
non-faulty since the deviation between the observed mea-
                                                                                                                                           each other. Interestingly enough, 13 sensors were identified
surements and the expectation is not significant. On the
                                                                                                                                           as erroneous from all methods indicating that sensors op-
other hand, sensors such as the one in Figure 7 (right),
                                                                                                                                           erate in an unexpected way in many settings and are more
given that it reports maximum values for a long period of
                                                                                                                                           likely to be faulty.


                                                                                                                                      8N
                                                       ICML Workshop

                      Metric    Result                            work exchanges among neighboring nodes. (Fried et al.,
                      CC        0.68                              2015) proposed a Bayesian approach to model time series
                      MAE       16.8                              of counts, using Metropolis-Hastings algorithm in order to
                      RMSE      23.18                             estimated the parameters of the model.

Table 2. The measurements that indicate the performance of the
multivariate ARIMA model                                          7. Conclusions
                                                                  In this paper we presented an efficient approach for re-
                                                                  solving whether irregular sensor measurements are due to
6. Related Work                                                   faulty sensors or unexpected traffic. Our approach exploits
Traffic monitoring has been a field of great interest in the      sensors’ past measurements and the crowd’s wisdom for
scientific community (Biem et al., 2010), (Patroumpas &           decision making. We implemented our proposals using the
Sellis, 2012). These works detect unusual events based on         Lambda-Architecture for processing real-time and histori-
pre-defined rules so any updates to the traffic conditions        cal data, and an Android application for extracting answers
overtime is not taken into account. In contrast, our pro-         from the human crowd. We applied three different outlier
posal exploits historical data for updating the expected sen-     detection techniques that identified complementary set of
sor correlations and detects events only when the real-time       faulty sensors. Finally, our detailed experimental evalua-
conditions deviate significantly from the expected. Authors       tion indicates that our approach can effectively resolve the
in (Ma et al., 2013) propose a novel city transportation ap-      source of irregular measurements in real-time.
plication that enables sharing of taxi rides in a large city.
Their goal was to develop an application that is beneficial       Acknowledgments
for both the citizens and the taxi drivers.
                                                                  This research has been co-financed by the European Union
There has been significant work in traffic monitoring in the      (European Social Fund ESF) and Greek national funds
use-case of Dublin. (Artikis et al., 2014) proposed a traffic     through the Operational Program Education and Lifelong
management system, based on heterogeneous data, which             Learning of the National Strategic Reference Framework
used Crowdsourcing in order to resolve conflicting sensors        (NSRF) - Research Funding Program: Thalis-DISFER,
reports. (Zygouras et al., 2015) focused on monitoring the        Thalis-CompGeom, Aristeia-MMD Investing in knowl-
traffic conditions of the city by considering the metrics re-     edge society through the European Social Fund, the FP7
ported from sensors mounted on top of public buses. While         INSIGHT project and the ERC IDEAS NGHCS project.
(Liebig et al., 2014a) and (Liebig et al., 2014b) perform in-
dividual trip planning that considers future traffic hazards
in routing. Furthermore, their approach estimates the ex-
                                                                  References
pected traffic flow in areas with low sensor coverage.            Artikis, A., Weidlich, M., Schnitzler, F., Boutsis, I., Liebig,
                                                                    T., Piatkowski, N., Bockermann, C., Morik, K., Kaloger-
Anomaly detection methods have been widely applied for
                                                                    aki, V., Marecek, J., Gal, A., Mannor, S., Kinane, D., and
mining data streams including techniques such as data
                                                                    Gunopulos, D. Heterogeneous Stream Processing and
clustering (Guo et al., 2009), principal component analy-
                                                                    Crowdsourcing for Urban Traffic Management. in Proc.
sis (PCA) (Lakhina et al., 2004), wavelet transform (No-
                                                                    17th International Conference on Extending Database
vakov et al., 2013) and many others. Some detection meth-
                                                                    Technology (EDBT), Athens, Greece, March 24-28, pp.
ods follow a time series analysis perspective and focus on
                                                                    712-723, 2014.
forecasting methods such as ARIMA (Zare Moayedi &
Masnadi-Shirazi, 2008; Fujimaki et al.). ARIMA mod-               Biem, Alain, Bouillet, Eric, Feng, Hanhua, Ranganathan,
els are a wide family of analysis and forecasting models            Anand, Riabov, Anton, Verscheure, Olivier, Koutsopou-
that are used widely in forecasting urban traffic time series       los, Haris, and Moran, Carlos. IBM Infosphere Streams
data (Lee & Fambro, 1999; Williams et al., 1998). This              for Scalable, Real-time, Intelligent Transportation Ser-
makes ARIMA models suitable for our scenario. (Nien-                vices. Proceedings of the 2010 ACM SIGMOD Interna-
nattrakul et al., 2010), used distance-based outlier detec-         tional Conference on Management of data, 2010.
tion techniques, reducing the size of the original database,
in order to efficiently identify outliers in massive stream-      Bockermann, C. and Blom, H. The streams framework.
ing datasets. (Schettlinger et al., 2010) proposed an on-           Technical Report 5, TU Dortmund University, 2012.
line time series filter, using repeated median regression,
which is able to smooth the data and keep intact the sig-         Boutsis, Ioannis and Kalogeraki, Vana. On task assignment
nal’s trend. (Branch et al., 2013) developed a distributed          for real-time reliable crowdsourcing. In ICDCS, pp. 1–
and in-network model in order to detect outliers on net-            10, Madrid, Spain, June 2014.


                                                                 ey
                                                       ICML Workshop

Branch, JoelW., Giannella, Chris, Szymanski, Boleslaw,            Lee, Sangsoo and Fambro, Daniel B. Application of sub-
  Wolff, Ran, and Kargupta, Hillol. In-network outlier              set autoregressive integrated moving average model for
  detection in wireless sensor networks. Knowledge and              short-term freeway traffic volume forecasting. Trans-
  Information Systems, 34(1):23–54, 2013.        ISSN               portation Research Record: Journal of the Transporta-
  0219-1377.        doi:    10.1007/s10115-011-0474-5.              tion Research Board, 1678(1):179–188, 1999.
  URL              http://dx.doi.org/10.1007/
  s10115-011-0474-5.                                              Liebig, Thomas, Piatkowski, Nico, Bockermann, Christian,
                                                                    and Morik, Katharina. Predictive trip planning-smart
Dou, Adam, Kalogeraki, Vana, Gunopulos, Dimitrios,                  routing in smart cities. In EDBT/ICDT Workshops, pp.
  Mielikainen, Taneli, and Tuulos, Ville H. Misco: a                331–338, 2014a.
  mapreduce framework for mobile systems. In PETRA,
  June 2010.                                                      Liebig, Thomas, Piatkowski, Nico, Bockermann, Christian,
                                                                    and Morik, Katharina. Route planning with real-time
Dou, Adam Ji, Kalogeraki, Vana, Gunopulos, Dimitrios,               traffic predictions. In Proceedings of the 16th LWA Work-
  Mielikinen, Taneli, Tuulos, Ville, Foley, Sean, and Yu,           shops: KDML, IR and FGWM, Aachen, Germany, pp.
  Curtis. Data clustering on a network of mobile smart-             83–94, 2014b.
  phones. In SAINT, pp. 118–127, Munich, Germany, July
  2011.                                                           Ma, Shuo, Zheng, Yu, and Wolfson, Ouri. T-Share: A
                                                                   Large-Scale Dynamic Taxi Ridesharing Service. ICDE,
Fried, Roland, Agueusop, Inoncent, Bornkamp, Bjrn,
                                                                   2013.
  Fokianos, Konstantinos, Fruth, Jana, and Ickstadt,
  Katja.   Retrospective bayesian outlier detection in            McCreadie, Richard, Macdonald, Craig, Ounis, Iadh, Os-
  ingarch series.   Statistics and Computing, 25(2):               borne, Miles, and Petrovic, Sasa. Scalable Distributed
  365–374, 2015. ISSN 0960-3174. doi: 10.1007/                     Event Detection for Twitter. BigData Conference: 543-
  s11222-013-9437-x. URL http://dx.doi.org/                        549, 2013.
  10.1007/s11222-013-9437-x.
                                                                  Niennattrakul, V., Keogh, E., and Ratanamahatana, C.A.
Fujimaki, Ryohei, Yairi, Takehisa, and Machida, Kazuo.              Data editing techniques to allow the application of
  An anomaly detection method for spacecraft using rel-             distance-based outlier detection to streams. In Data Min-
  evance vector. In Learning, The Ninth Pacific-Asia                ing (ICDM), 2010 IEEE 10th International Conference
  Conference on Knowledge Discovery and Data Mining                 on, pp. 947–952, Dec 2010. doi: 10.1109/ICDM.2010.
  (PAKDD, pp. 785–790. Springer.                                    56.
Guo, Feng, Yang, Yingzhen, and Duan, Lian. Anomaly
                                                                  Novakov, Stevan, Lung, Chung-Horng, Lambadaris, Ioan-
  detection by clustering in the network. In Compu-
                                                                    nis, and Seddigh, Nabil. Studies in applying pca and
  tational Intelligence and Software Engineering, 2009.
                                                                    wavelet algorithms for network traffic anomaly detec-
  CiSE 2009. International Conference on, pp. 1–4. IEEE,
                                                                    tion. In High Performance Switching and Routing
  2009.
                                                                    (HPSR), 2013 IEEE 14th International Conference on,
Kakantousis, Theofilos, Boutsis, Ioannis, Kalogeraki,               pp. 185–190. IEEE, 2013.
  Vana, Gunopulos, Dimitrios, Gasparis, Giorgos, and
  Dou, Adam. Misco: A system for data analysis applica-           Patroumpas, Kostas and Sellis, Timos. Event Process-
  tions on networks of smartphones using mapreduce. In              ing and Real-time Monitoring over Streaming Traffic
  MDM, pp. 356–359, Bengaluru, India, July 2012. IEEE.              Data. Web and Wireless Geographical Information Sys-
                                                                    tems Lecture Notes in Computer Science Volume 7236,
Kinane, D., Schnitzler, F., Mannor, S., Liebig, T., Morik,          pp 116-133, 2012.
  K., Marecek, J., Gorman, B., Zygouras, N., Katakis, Y.,
  Kalogeraki, V., and Gunopulos, D. Intelligent synthe-           Schettlinger, K., Fried, R., and Gather, U. Real-time sig-
  sis and real-time response using massive streaming of             nal processing by adaptive repeated median filters. In-
  heterogeneous data (insight) and its anticipated effect on        ternational Journal of Adaptive Control and Signal Pro-
  intelligent transport systems (its) in dublin city, ireland.      cessing, 24(5):346–362, 2010. ISSN 1099-1115. doi:
  In ITS, Dresden, Germany, November 2014.                          10.1002/acs.1105. URL http://dx.doi.org/10.
                                                                    1002/acs.1105.
Lakhina, Anukool, Crovella, Mark, and Diot, Christiphe.
  Characterization of network-wide anomalies in traffic           Williams, Billy M, Durvasula, Priya K, and Brown, Don-
  flows. In Proceedings of the 4th ACM SIGCOMM con-                ald E. Urban freeway traffic flow prediction: applica-
  ference on Internet measurement, pp. 201–206. ACM,               tion of seasonal autoregressive integrated moving av-
  2004.                                                            erage and exponential smoothing models. Transporta-


                                                                 eR
                                                    ICML Workshop

  tion Research Record: Journal of the Transportation Re-
  search Board, 1644(1):132–141, 1998.

Yi, B.-K., Sidiropoulos, N.D., Johnson, T., Jagadish, H.V.,
  Faloutsos, C., and Biliris, A. Online data mining for co-
  evolving time sequences. In Data Engineering, 2000.
  Proceedings. 16th International Conference on, pp. 13–
  22, 2000.
Zare Moayedi, H. and Masnadi-Shirazi, M.A. Arima model
  for network traffic prediction and anomaly detection.
  In Information Technology, 2008. ITSim 2008. Interna-
  tional Symposium on, volume 4, pp. 1–6, Aug 2008.
Zygouras, Nikolas, Zacheilas, Nikos, Kalogeraki, Vana,
  Kinane, Dermot, and Gunopulos, Dimitrios. Insights
  on a Scalable and Dynamic Traffic Management System.
  EDBT, 2015.


                                                              ek