Towards detection of faulty traffic sensors in real-time Nikolas Zygouras, Nikolaos Panagiotou, Ioannis Katakis, Dimitrios Gunopulos {NZYGOURAS , N . PANAGIOTOU , KATAK , DG}@ DI . UOA . GR University of Athens, Athens, Greece Nikos Zacheilas, Ioannis Boutsis, Vana Kalogeraki {ZACHEILAS , MPOUTSIS , VANA}@ AUEB . GR Athens University of Economics and Business, Athens, Greece Abstract Automatic identification of anomalies in streaming data is an emerging field of research due to the large number of Detecting traffic events using the sensor network applications (intrusion detection, event identification, etc). infrastructure is an important service in urban en- Many algorithms that utilize machine learning and time se- vironments that enables the authorities to han- ries analysis techniques have been successfully applied for dle traffic incidents. However, irregular mea- the detection of unexpected events during the last years (Yi surements in such settings can derive either from et al., 2000). These methods offer high quality results and faulty sensors or from unpredictable events. In are able to perform on massive data streams in real-time. this paper, we propose an efficient solution to An interesting use-case is the automatic analysis of traffic resolve in real-time the source of such irregu- data generated by Smart Cities infrastructures. Human per- lar readings by examining the correlation and the sonnel are unable to monitor and efficiently identify prob- consistency among neighbor sensors and exploit- lems on these data. The utilization of anomaly detection ing the wisdom of the crowd. Our experimental techniques would provide great assistance to traffic opera- evaluation illustrates the efficiency and practical- tors as it would enable the automatic real-time identifica- ity of our approach. tion of traffic issues. Recently, Crowdsourcing has emerged as an attractive 1. Introduction paradigm to exploit the intelligence of ubiquitous human crowd (citizens) to extract useful information. Traditional Sensor network infrastructures have been widely used for Crowdsourcing systems such as AMT1 , CrowdFlower2 , traffic management in smart cities to provide important ser- etc., constitute marketplaces for human intelligence tasks vices for the benefit of pedestrians, cyclists, motorists and (HITs), that allow a requester to define a task, which is per- public transport. Such services are typically provided by formed by other human workers in exchange for a reward. analyzing data provided by heterogeneous static and mo- For example, mobile human workers with different char- bile sensors. This enables the implementation of numerous acteristics can be queried for geo-located tasks to extract applications like proposing alternative routes, altering traf- real-time information without needing an expensive infras- fic lights, etc. tructure(Boutsis & Kalogeraki, 2014). The most common type of sensor which is utilized in such In this paper we develop an efficient approach that identi- environments is the SCATS sensor. They are static sensors fies faulty readings from traffic sensors by examining the embedded at the city roads providing rich, real-time infor- correlations among them and by taking advantage of the mation such as traffic flow measurements based on vehicles ubiquitous citizens through Crowdsourcing. We summa- that cross a specific segment. Despite their utility in many rize our contributions below: traffic applications, SCATS sensors can be faulty. Thus, one fundamental challenge in these settings is how to ef- • We present an efficient approach that identifies ficiently distinguish between irregular and faulty measure- anomalous sensors and uses Crowdsourcing to resolve ments before taking any unnecessary actions. whether irregular measurements are due to faulty sen- Proceedings of the 2 nd International Workshop on Mining Urban sors or irregular traffic. Data, Lille, France, 2015. Copyright c 2015 for this paper by its 1 http://www.mturk.com/ authors. Copying permitted for private and academic purposes. 2 http://www.crowdflower.com/ 8j ICML Workshop LiveDrive radio where users can report traffic, and (iv) pedestrian counters. 2.2. System model In this section we provide our system model for the data sensors that we examine, namely the SCATS sensors and Crowdsourcing. SCATS Sensors. SCATS (Sydney Coordinated Adaptive Traffic System) is an innovative computerized traffic man- agement system developed by Roads and Maritime Ser- vices (RMS) Australia. SCATS sensors are fixed mag- Figure 1. The SCATS sensors locations at Dublin’s city centre netic sensors deployed on intersections to measure the traf- fic flow and the degree of saturation of roads’ lanes. In • We tackle the problem of automatically detecting Dublin city, each SCATS sensor produces and transmits a anomalous SCATS sensors with three methods: (i) new record every minute. Each record contains information Pearson’s correlation, (ii) cross-correlation and (iii) related to the timestamp t of the measurement, the sensor’s multivariate ARIMA model. The proposed methods ID i and finally the degree of saturation and traffic flow have to tackle the task efficiently in real-time. measurements. In the provided dataset there are approxi- mately 300 SCATS controlled intersections and 1000 dif- • We develop our approach using the Lambda- ferent SCATS sensors throughout the road network. The Architecture which combines a batch processing GPS locations of the SCATS sensors are presented in Fig- framework (i.e. Hadoop3 ) and a distributed stream ure 1. Degree of saturation measures how much a road’s processing system (i.e. Storm4 ) for efficiently pro- lane is utilized, while traffic flow measures the vehicles’ cessing both historical and real-time data. volume divided by the highest volume that has been mea- sured in a sliding window of a week5 . In this work we • We develop a Crowdsourcing system used to extract decided to monitor the degree of saturation value, noted as answers from the human crowd based on the MapRe- s, as it more reliable and informative than the traffic flow. duce paradigm. The degree of saturation of a particular SCATS sensor with ID i at the timestamp t is noted si,t . • We provide an experimental evaluation, which illus- trates that our approach is practical and can effectively Crowdsourcing. Our crowdsourcing system comprises a identify irregular measurements in real-time. set of human workers denoted as wj which are able to re- ceive task assignments. Tasks are being inserted to the sys- tem by an authority, such as the DCC. Each task tk is as- 2. Problem Description and System Model sociated with a number of attributes as < idk , latitudek , 2.1. Smart City longitudek , rewardk , descriptionk >. Hence, every task posses a unique identifier (idk ), the geographical co- Smart cities exploit digital sensor devices that can be either ordinates of the location that the task involves (latitudek , embedded at the city infrastructure or they can be mobile longitudek ), the corresponding reward (rewardk ) for ex- (e.g., smartphones) in order to provide services for their cit- ecuting the task and a task description that describes the in- izens that enhance their well-being. Such services may re- formation that needs to be provided by the human worker. late to traffic management, housekeeping information, etc. An example of such a task description is: “Is there traffic In this paper we focus on Dublin, a smart city that uti- in O’Connell Street? Yes/No”. Finally each response pro- lizes sensors for supervising and managing road traffic vided by a worker is captured with a record by our system (Kinane et al., 2014). In Dublin the traffic is controlled using the worker and the task identifiers, coupled with the by the Dublin City Council (DCC), which is responsible response as follows: < wj , idk , responsejk >. to develop, maintain and manage the city road network. To achieve that they exploit several heterogeneous data 3. Architecture sources that include: (i) SCATS sensors which are em- bedded on the road and monitor real-time traffic density, In Figure 2 we display our system architecture which (ii) GPS traces from sensors embedded on buses, (iii) the consists of the following components: (i) a Distributed 5 3 https://hadoop.apache.org/ http://dublinked.com/datastore/datasets/ 4 https://github.com/nathanmarz/storm dataset-274.php 89 ICML Workshop Distributed Database MapReduce Job and low response latencies (McCreadie et al., 2013). Fur- thermore, we decided to use Storm due to its scalability fea- Processing Batch tures that we also exploit in our previous work (Zygouras Map Reduce et al., 2015). Storm users can change the parallelism of the processing components to adapt to possibly workload bursts. Processing Finally, for the analysis of the historical sensor data we Stream Input Source Preprocessing rocessing CrowdSourcing CrowdSou dSourcing used the most commonly used open-source implementation Analysis of the MapReduce programming model, Hadoop. We ex- ecute periodical (i.e. at the end of each day) Hadoop jobs Application’s for computing the basic metrics required by our proposed Users techniques, described in more detail in Section 4. Our jobs retrieve historical data from a distributed database, more specifically MongoDB9 . We decided to use MongoDB in- stead of the Hadoop Distributed Filesystem (HDFS), as we Figure 2. System Architecture want to have fast access to the data from the DSPS compo- nent of our architecture, for computing and storing short- term statistics in real-time. Stream Processing System (DSPS), (ii) a batch processing framework, (iii) a distributed database system, and (iv) the Crowdsourcing component that consists of the users’ mo- 4. Methodology bile devices. Our architecture is an instance of the Lambda- The goal of this work is to monitor the streaming traf- Architecture6 as we exploit the fast processing offered by fic data and automatically pose Crowdsourcing tasks when DSPS and the fault-tolerance and parallelism provided by anomalous sensors are identified. In order to identify current batch processing frameworks. anomalous sensors we propose three different outlier tests Incoming SCATS-sensor data are forwarded to a stream that examine whether the SCATS sensors behave differ- processing graph. These data are pre-processed and stored ently from their normal behavior. These outlier tests in the Distributed Database (i.e. Preprocessing component are based on the following statistical measurements: (i) in Figure 2) for further processing by the batch processing Pearson’s Correlation (ii) Cross-Correlation and (iii) the component. We analyze the reported metrics via the Anal- ARIMA Model. The normal behavior for each sensor is ysis component which examines if one of the sensors de- calculated offline using the historical data. These meth- viates significantly from its neighbors so it could possibly ods are implemented using the Lambda architecture and be a faulty sensor. This component uses both the current Crowdsourcing tasks are assigned to users when anoma- conditions and historical data for identifying such condi- lous SCATS sensors are identified. tions. In case that one such sensor is detected, the Analysis component informs the Crowdsourcing component about 4.1. Identifying Anomalous Sensors this situation. The latter is responsible to send the appro- priate Crowdsourcing tasks that will enable us to detect if In this section we describe the three statistical measure- the sensor is a faulty-one. Finally, the batch processing ments that are used and we explain how these are utilized component periodically computes new statistics about the to detect anomalous SCATS sensors. Initially we applied a historical sensor data. simple statistic measurement named Pearson’s correlation that identifies the correlation between pairs of SCATS sen- There are multiple DSPSs which support low latency pro- sors. Then we used an extension of the first method, named cessing in real-time. Some of these systems are Apache cross-correlation, to identify how many lags we should Storm, Spark Streaming7 and TUD-Streams (Bockermann shift backward a sensor’s values to maximize its pairwise & Blom, 2012). We used Storm as the DSPS that will correlation with another adjacent sensor. The first two ap- perform the real-time processing of incoming sensor data. proaches use two well known measures in time series anal- Storm is one of the most commonly used DSPS, and is sup- ysis. The disadvantage is that they check pairs of sensors ported by major companies such as Twitter8 . It has been and not the group of sensors as a whole. For this reason we successfully applied for processing high volume of data in applied a third approach that can be thought as a multivari- different application domains, achieving high throughput ate ARIMA model which deals with the aforementioned 6 lambda-architecture.net problem and is faster than the other approaches. 7 https://spark.apache.org 9 http://www.mongodb.org/ 8 http://twitter.com 88 ICML Workshop 4.1.1. P EARSON ’ S C ORRELATION In order to identify anomalies with cross-correlation we followed a similar approach to the one utilizing the Pear- The Pearson’s correlation coefficient is a well known statis- son’s correlation measure, described before. The main dif- tic that measures the linear relationship of two variables X ference is that we identified, using historical data, the lag and Y . It takes values in [−1, 1], where 1 means that the dmax that maximized the correlation between two SCATS variables are positively correlated, −1 stands for negative sensors X and Y . In the streaming analysis in order to cal- correlation and 0 for no correlation between X and Y . The culate the cross-correlation between the sensors we shifted Pearson’s correlation, noted ρX,Y , is calculated by dividing dmax lags backward the Y and we calculated its correla- the covariance of X and Y with the product of the standard tion with X. Finally, we measured how much the streaming deviations of X and Y (see Equations 1 and 2). cross-correlation deviates from the offline calculated cross- cov(X, Y ) correlation between X and Y using the optimal lag value ρX,Y = (1) dmax . σX σY cov(X, Y ) = E[(X − µX )(Y − µY )] (2) 4.1.3. M ULTIVARIATE ARIMA M ODEL In our scenario we calculated the pair-wise correlation be- A common strategy to detect outliers in multivariate time tween all SCATS sensors X and Y whose spatial distance series (Yi et al., 2000) is to build a regression model for does not exceed a predefined threshold. This restriction each time series and evaluate whether the actual values vary creates a sparse correlation matrix that contains non-zero significantly from the predictions. The model receives as elements when SCATS sensors are spatially adjacent. We input the previous L degree of saturation measurements for calculate the sparse correlation matrix from the historical a particular sensor with ID = 0 and the sensor’s N nearest data. Then, utilizing the streaming data that arrive contin- SCATS sensors {si,j : i ∈ [0, N ], j ∈ [0, L], i, j ∈ Z}. uously in our system we periodically calculate the stream- The goal of the model is to make the best prediction for ing correlation of the adjacent SCATS sensors. We note as ˆ . The model is presented in detail in s0,t , denoted as s0,t noisy sensors’ pairs those that their streaming correlations Equation 5. This model can be thought as a multivariate disagree significantly with the correlations calculated from ARIMA model, as multiple sensors are used in order to the historical data. If a particular sensor disagrees signif- make the predictions. icantly with the majority of his neighbors then a crowd- sourcing task is posed. ˆ =φ0,1 s0,t−1 + · · · + φ0,L s0,t−L + s0,t φ1,0 s1,t + φ1,1 s1,t−1 + · · · + φ1,L s1,t−L + 4.1.2. C ROSS -C ORRELATION (5) ... Cross-correlation is a statistical measure of similarity be- φN,0 sN,t + φN,1 sN,t−1 + · · · + φN,L sN,t−L tween two variables X and Y as a function of the lag of one relative to the other. More specifically cross-correlation be- In the training phase we use the historical degree of sat- tween X and Y is calculated by shifting forward or back- uration values in order to calculate the coefficients Φ of ward Y and calculating its correlation coefficient with X. Equation 5. In order to solve this problem we created the Cross-correlation with lag d, noted ρX,Y (d), is calculated matrix A and vector b containing the input data (degree of as seen in Equation 3. The numerator of the equation calcu- saturation values) and the target values respectively. The Φ lates the covariance of X and Y shifted d time bins back- parameters are the values that optimally solve Equation 6. ward. Finally the denominator is the product of the stan- The solution of this system is given with the pseudo-inverse dard deviations of X and the lagged Y . transformation of the input presented in Equation 7. The P key property of this approach, in contrast to the two pre- [x(i) − µX )(y(i + d) − µY )] viously described techniques, is that it monitors the differ- ρX,Y (d) = pP i pP i (x(i) − µX ) 2 i (y(i + d) − µY ) 2 ent sensors together as a whole. The Pearson’s correlation (3) and the cross-correlation approaches investigated only pair- wise correlation between SCATS sensors, ignoring poten- A traffic anomaly at a particular location, in a road net- tially useful information. On the other hand, the ARIMA- work, may require some time in order to be propagated to based method aims at exploiting this information. the adjacent sensors. This observation motivates us to con- sider the cross-correlation between adjacent SCATS sen- Φ = [φ0,1 ... φ0,L ... φ2,0 ... φ2,L ] sors. More specifically we calculated the dmax that maxi- mized the correlation between two adjacent sensors X and   s0,t−1 ... s0,t−L ... sN,t ... sN,t−L Y (see Equation 4). s0,t−2 ... s0,t−L−1 ... sN,t−1 ... sN,t−L−1  A=  ... .. .. .. .. .. ..  . . . . . .  dmax = arg max(ρX,Y (d)) (4) d s0,L ... s0,0 ... sN,L ... sN,0 8e ICML Workshop ⊤ b = [s0,t ... s0,L+1 ] b = AΦ (6) Φ̂ = (AT A)−1 AT b (7) In order to integrate this approach we split the historical data in training and test set. Initially we calculated of- fline, using the training set, the Φ̂ parameters. These pa- rameters are the coefficients regarding the sensor’s previ- ous measurements and its adjacent sensors’ past measure- ments. Then we calculated how well the data fitted to these Figure 3. Crowdsourcing Application (a) Main Application, (b) models computing for each sensor its Mean Absolute Er- Push Notification, (c) Map Task ror (MAE). Finally, in order to identify anomalous SCATS sensors while monitoring the streaming data we compute Reduce tasks are responsible for computing the metrics de- for each sensor its MAE at a particular time window. We scribed in Section 4.1 and store the results in MongoDB. label a sensor as ‘anomalous’ if its streaming MAE notice- ably differs from its MAE measured using the testing set. In the stream processing component, we implemented a Storm topology (see Figure 2) that processes the real-time 4.2. Implementation sensor data. We exploit the parallelism offered by Storm by having multiple instances of our Analysis component, run- Our system calculates the correlation among adjacent ning in parallel, in order to decrease latency. The topology SCATS sensors. This is achieved by adding the SCATS pre-processes the incoming data and stores them in Mon- sensors’ GPS locations in a k-d tree data structure during goDB. Also the pre-processed data are sent to the compo- system initialization and calculating the k nearest SCATS nent that invokes the three different techniques. Again we sensors for each sensor. Furthermore, we developed our partition the data based on the offline mapping, to guaran- system using the Lambda architecture. So we should en- tee that all neighboring data will be processed by the same sure that the required data are transmitted to the appropriate component’s instance. Detected events are forwarded to the cluster nodes. Thus, we created a mapping of each SCATS Crowdsourcing component that is responsible to inform the sensor ID to one or more cluster nodes. This guarantees users that will help us detect if the sensor is faulty. that each computing node contains all the required data for a sensor’s adjacent sensors. 4.3. Crowdsourcing System We define three parameters that help us configure the com- Misco. Our Crowdsourcing system has been developed us- ponents of our system. The first one is job periodicity ing the Misco framework (Dou et al., 2011; 2010; Kakan- and defines when the batch jobs should re-execute (e.g. tousis et al., 2012), which is based on the MapReduce each day, every week). The other two control the paradigm and tailored for mobile devices to provide an ex- stream processing computations. More specifically, the tensible and efficient way to develop distributed applica- stream threshold parameter defines how often we should tions. re-compute the examined metrics (e.g. every ten minutes), while time window defines the sliding time window (e.g. Our Crowdsourcing system is structured using (i) a Mas- the previous hour) that will be used for keeping the past ter Server that keeps track of the tasks tk submitted when sensor data necessary for the computations. anomalies are detected from SCATS sensors, assigns them to human workers wj and returns the responses to the sys- As we described in Section 3, we periodically invoke tem, and (ii) the Workers who are the human contributors Hadoop jobs that compute the different metrics we ex- that process the crowdsourcing tasks. Each Worker is re- plained in Section 4.1. Map tasks read the pre-processed sponsible to process queries and return the results to the sensor data from the MongoDB, and send them to the re- server. These tasks are executed by workers through their duce tasks. We partition the data based on the SCATS sen- personal smartphone devices or tablets. sor ID to cluster node mapping. The idea is that neigh- boring sensors should always end up on the same reduce Task assignment. Suppose that we need to exploit Crowd- task in order to appropriately compute the examined met- sourcing to determine the source of an event using a task rics. Each sensor may belong to more than one nodes in tj . We describe the step-by-step sequence followed so as to such cases we send the tuple multiple times (i.e. equal to process the task and return the results. In the implementa- the number of nodes it is part of) to avoid information loss. tion described below we considered Android-based devices 8d ICML Workshop and thus we have utilized the Android SDK10 . Parameter Value job periodicity 24 hours For every task tj , the Master Server spans the task to a set stream threshold 10 minutes of map tasks that need to be forwarded to the human work- time window 1 hour ers wk . Since these tasks are geo-located only the workers that reside close to the specific selection need to be selected Table 1. Basic Configuration Parameters by the Master Server to provide information. However, in order to avoid tracking the users we follow a different pol- icy. We forward the task to all the workers and the tasks are locally filtered at the mobile devices if their location is far to the same LAN and their clocks were synchronized us- from the location of the task tj . ing the NTP protocol. The frameworks we used were the following: Storm 0.8.2, Esper 5.1 and MongoDB 2.6.5. We use Push Notifications services to initiate the communi- cation with the human workers, to be able to send the Map In Table 1, you can see the values of the basic configuration task to the users without being restricted by their connec- parameters described in Section 4.2. For the experiments, tion (WiFi, 3G, etc). Such services exist in all major mobile we used SCATS data from the period of April and May operator systems and allow users to register for message of 2014. The distance threshold used for the neighbour- delivery when they are online through a connection server. ing sensors computation was set to 250 meters. Data from April were used in order to calculate the historical corre- In order to be able to receive map tasks, each user first lations, cross-correlations as well as the ARIMA models. needs to login to our system so that the Master server will On the other hand, data from May were used for different be aware of the user. At the same time the user also regis- experimental runs (see below). ters in the push notification service to retrieve its unique id. During normal operation the Crowdsourcing applications For the Pearson Correlation method we have stored the his- runs in the background (Figure 3a). torical correlations of the neighbour-pairs in the MongoDB component. 7116 neighbour pairs were identified under the When the Master Server retrieves a new task tj from the distance threshold from a set of 900 SCATS sensors. The requester, it delivers a push notification to the user devices correlation value ranged from almost perfect correlation, with the task, through the Push Notification service. Once for sensors of the same junction under different lane, to no the device receives the notification it examines whether the correlation at all for more distant sensors. Negative corre- user current location is close to the location of the task so lation values between nearby sensors were also observed. as to alert the user (Figure 3b). Next, if the user selects the This could be explained by the opposite direction of the notification on his mobile device the Crowdsourcing appli- lane the sensors are responsible for. In Figure 4, the cor- cation is triggered and the task will be displayed in the user relation matrix for a set of 30 nearby sensors is presented. screen to process the task (Figure 3c). As expected, clusters are formed by adjacent sensors that Finally, the responses for each map task are forwarded to are highly correlated. Thus, it is reasonable to argue that the Master Server that initiates the reduce phase to aggre- when the expected correlation is not observed there might gate the answers. The reduce phase is performed through be a problem with the sensor. For the Cross-Correlation Majority Voting. Hence the Master Server identifies the re- method apart from storing the correlation value itself we sponse responsejk for task tk with the maximum amount have also stored the time lag that maximizes the pair-wise of answers from all users wj and forwards the response that sensor correlation. The time lag range was set to a maxi- represents the cause of the event to the system. mum of 10 minutes since the sensors are quite close to each other and larger time lags are unlikely to significantly favor Crowd Feedback. The response retrieved by the crowd- the correlation value. In addition, the larger the time lag sourcing component enables the system to determine range is, the more computationally demanding the method whether the irregular readings derive from an unexpected will be. As it was expected, in most cases the highest cross- event (e.g., roadworks) or if the sensor is indeed faulty correlation was observed with no time lag at all, since most when most of the workers answer “None of the above”. sensor pairs are responsible for different lanes of the same highway junction. However, for more distant sensor-pairs 5. Evaluation responsible for different highway junctions, small time lags gave a boost on their correlation value. One way to under- We have evaluated our proposals on our local cluster con- stand this is because vehicles require a short time to reach sisting of 4 VMs. Each VM had two CPU processors at- consecutive junctions. In addition, this behaviour could be tached and 3, 096 MB of RAM. All VMs were connected also explained by the operation of traffic lights that trans- 10 Android platform: http://www.android.com/ fer the traffic from junction to junction on fixed time inter- vals. Figure 5 depicts the distribution of the optimal time 83 ICML Workshop QQ Plot Correlation 3500 120 30 1 3000 100 #Times Maximized CorrCoef 0.8 25 2500 80 Predicted Values 0.6 20 2000 Sensor ID 0.4 60 15 0.2 1500 40 10 0 1000 20 −0.2 5 500 0 −0.4 0 −20 10 20 30 0 5 10 0 50 100 150 Sensor ID lag Actual Values Figure 4. The correlation matrix of a set of Figure 5. Distribution of the optimal time lag Figure 6. The predicted and the actual de- 30 neighbors value over all the neighbour-pairs gree of saturation values over the validation dataset Model Fit Model Fit 120 Measured Value Measured Value Pearson 100 Forecast 200 Forecast (105) Degree of Saturation Degree of Saturation 80 150 10 60 100 82 40 0 13 20 50 37 3 161 0 0 Cross Correlation ARIMA −20 0 50 100 150 0 500 1000 (135) (177) Time (min) Time (min) Figure 7. On the left there is a sensor whose values agree well with our forecast. On Figure 8. Venn diagram for the three proposed the right there is a noisy sensor whose values diverge significantly from the predicted methods and it is considered as faulty lag value over a sample of neighbor pairs. time, it is likely that it is faulty. These sensors are flagged by our system for further manual evaluation or inspection In terms of the multivariate ARIMA method, a different from the traffic operators. model was fitted to each sensor using as features all the neighbor sensors. The performance of all the models was The three methods were compared in terms of the number aggregated and measured in terms of Mean Absolute Error of faulty sensors they identify. In addition, since the Corre- (MAE), Root Mean Squared Error(RMSE) and Correlation lation and the ARIMA approaches focus at a very different Coefficient (CC) using a validation dataset. The results fol- aspect of the same problem we measured the overlap be- low on Table 2 with a low MAE value of 16.18 indicating tween their results. Figure 8 displays the Venn diagram of a decent fit. the results obtained over the period of one day during May of 2014. As it was expected, the results of Pearson Corre- Figure 6 shows the forecasting performance of the ARIMA lation and Cross-Correlation are highly overlapping since model on the validation dataset and Figure 7 gives an ex- for many sensors the optimal time lag is zero. On the other ample on forecasting two different sensors. Sensors such hand, the ARIMA method identified different sensors as er- as the one presented in Figure 7 (left) will be considered as roneous suggesting that the methods are complementary to non-faulty since the deviation between the observed mea- each other. Interestingly enough, 13 sensors were identified surements and the expectation is not significant. On the as erroneous from all methods indicating that sensors op- other hand, sensors such as the one in Figure 7 (right), erate in an unexpected way in many settings and are more given that it reports maximum values for a long period of likely to be faulty. 8N ICML Workshop Metric Result work exchanges among neighboring nodes. (Fried et al., CC 0.68 2015) proposed a Bayesian approach to model time series MAE 16.8 of counts, using Metropolis-Hastings algorithm in order to RMSE 23.18 estimated the parameters of the model. Table 2. The measurements that indicate the performance of the multivariate ARIMA model 7. Conclusions In this paper we presented an efficient approach for re- solving whether irregular sensor measurements are due to 6. Related Work faulty sensors or unexpected traffic. Our approach exploits Traffic monitoring has been a field of great interest in the sensors’ past measurements and the crowd’s wisdom for scientific community (Biem et al., 2010), (Patroumpas & decision making. We implemented our proposals using the Sellis, 2012). These works detect unusual events based on Lambda-Architecture for processing real-time and histori- pre-defined rules so any updates to the traffic conditions cal data, and an Android application for extracting answers overtime is not taken into account. In contrast, our pro- from the human crowd. We applied three different outlier posal exploits historical data for updating the expected sen- detection techniques that identified complementary set of sor correlations and detects events only when the real-time faulty sensors. Finally, our detailed experimental evalua- conditions deviate significantly from the expected. Authors tion indicates that our approach can effectively resolve the in (Ma et al., 2013) propose a novel city transportation ap- source of irregular measurements in real-time. plication that enables sharing of taxi rides in a large city. Their goal was to develop an application that is beneficial Acknowledgments for both the citizens and the taxi drivers. This research has been co-financed by the European Union There has been significant work in traffic monitoring in the (European Social Fund ESF) and Greek national funds use-case of Dublin. (Artikis et al., 2014) proposed a traffic through the Operational Program Education and Lifelong management system, based on heterogeneous data, which Learning of the National Strategic Reference Framework used Crowdsourcing in order to resolve conflicting sensors (NSRF) - Research Funding Program: Thalis-DISFER, reports. (Zygouras et al., 2015) focused on monitoring the Thalis-CompGeom, Aristeia-MMD Investing in knowl- traffic conditions of the city by considering the metrics re- edge society through the European Social Fund, the FP7 ported from sensors mounted on top of public buses. While INSIGHT project and the ERC IDEAS NGHCS project. (Liebig et al., 2014a) and (Liebig et al., 2014b) perform in- dividual trip planning that considers future traffic hazards in routing. Furthermore, their approach estimates the ex- References pected traffic flow in areas with low sensor coverage. Artikis, A., Weidlich, M., Schnitzler, F., Boutsis, I., Liebig, T., Piatkowski, N., Bockermann, C., Morik, K., Kaloger- Anomaly detection methods have been widely applied for aki, V., Marecek, J., Gal, A., Mannor, S., Kinane, D., and mining data streams including techniques such as data Gunopulos, D. Heterogeneous Stream Processing and clustering (Guo et al., 2009), principal component analy- Crowdsourcing for Urban Traffic Management. in Proc. sis (PCA) (Lakhina et al., 2004), wavelet transform (No- 17th International Conference on Extending Database vakov et al., 2013) and many others. Some detection meth- Technology (EDBT), Athens, Greece, March 24-28, pp. ods follow a time series analysis perspective and focus on 712-723, 2014. forecasting methods such as ARIMA (Zare Moayedi & Masnadi-Shirazi, 2008; Fujimaki et al.). ARIMA mod- Biem, Alain, Bouillet, Eric, Feng, Hanhua, Ranganathan, els are a wide family of analysis and forecasting models Anand, Riabov, Anton, Verscheure, Olivier, Koutsopou- that are used widely in forecasting urban traffic time series los, Haris, and Moran, Carlos. IBM Infosphere Streams data (Lee & Fambro, 1999; Williams et al., 1998). This for Scalable, Real-time, Intelligent Transportation Ser- makes ARIMA models suitable for our scenario. (Nien- vices. Proceedings of the 2010 ACM SIGMOD Interna- nattrakul et al., 2010), used distance-based outlier detec- tional Conference on Management of data, 2010. tion techniques, reducing the size of the original database, in order to efficiently identify outliers in massive stream- Bockermann, C. and Blom, H. The streams framework. ing datasets. (Schettlinger et al., 2010) proposed an on- Technical Report 5, TU Dortmund University, 2012. line time series filter, using repeated median regression, which is able to smooth the data and keep intact the sig- Boutsis, Ioannis and Kalogeraki, Vana. On task assignment nal’s trend. (Branch et al., 2013) developed a distributed for real-time reliable crowdsourcing. In ICDCS, pp. 1– and in-network model in order to detect outliers on net- 10, Madrid, Spain, June 2014. ey ICML Workshop Branch, JoelW., Giannella, Chris, Szymanski, Boleslaw, Lee, Sangsoo and Fambro, Daniel B. Application of sub- Wolff, Ran, and Kargupta, Hillol. In-network outlier set autoregressive integrated moving average model for detection in wireless sensor networks. Knowledge and short-term freeway traffic volume forecasting. Trans- Information Systems, 34(1):23–54, 2013. ISSN portation Research Record: Journal of the Transporta- 0219-1377. doi: 10.1007/s10115-011-0474-5. tion Research Board, 1678(1):179–188, 1999. URL http://dx.doi.org/10.1007/ s10115-011-0474-5. Liebig, Thomas, Piatkowski, Nico, Bockermann, Christian, and Morik, Katharina. Predictive trip planning-smart Dou, Adam, Kalogeraki, Vana, Gunopulos, Dimitrios, routing in smart cities. In EDBT/ICDT Workshops, pp. Mielikainen, Taneli, and Tuulos, Ville H. Misco: a 331–338, 2014a. mapreduce framework for mobile systems. In PETRA, June 2010. Liebig, Thomas, Piatkowski, Nico, Bockermann, Christian, and Morik, Katharina. Route planning with real-time Dou, Adam Ji, Kalogeraki, Vana, Gunopulos, Dimitrios, traffic predictions. In Proceedings of the 16th LWA Work- Mielikinen, Taneli, Tuulos, Ville, Foley, Sean, and Yu, shops: KDML, IR and FGWM, Aachen, Germany, pp. Curtis. Data clustering on a network of mobile smart- 83–94, 2014b. phones. In SAINT, pp. 118–127, Munich, Germany, July 2011. Ma, Shuo, Zheng, Yu, and Wolfson, Ouri. T-Share: A Large-Scale Dynamic Taxi Ridesharing Service. ICDE, Fried, Roland, Agueusop, Inoncent, Bornkamp, Bjrn, 2013. Fokianos, Konstantinos, Fruth, Jana, and Ickstadt, Katja. Retrospective bayesian outlier detection in McCreadie, Richard, Macdonald, Craig, Ounis, Iadh, Os- ingarch series. Statistics and Computing, 25(2): borne, Miles, and Petrovic, Sasa. Scalable Distributed 365–374, 2015. ISSN 0960-3174. doi: 10.1007/ Event Detection for Twitter. BigData Conference: 543- s11222-013-9437-x. URL http://dx.doi.org/ 549, 2013. 10.1007/s11222-013-9437-x. Niennattrakul, V., Keogh, E., and Ratanamahatana, C.A. Fujimaki, Ryohei, Yairi, Takehisa, and Machida, Kazuo. Data editing techniques to allow the application of An anomaly detection method for spacecraft using rel- distance-based outlier detection to streams. In Data Min- evance vector. In Learning, The Ninth Pacific-Asia ing (ICDM), 2010 IEEE 10th International Conference Conference on Knowledge Discovery and Data Mining on, pp. 947–952, Dec 2010. doi: 10.1109/ICDM.2010. (PAKDD, pp. 785–790. Springer. 56. Guo, Feng, Yang, Yingzhen, and Duan, Lian. Anomaly Novakov, Stevan, Lung, Chung-Horng, Lambadaris, Ioan- detection by clustering in the network. In Compu- nis, and Seddigh, Nabil. Studies in applying pca and tational Intelligence and Software Engineering, 2009. wavelet algorithms for network traffic anomaly detec- CiSE 2009. International Conference on, pp. 1–4. IEEE, tion. In High Performance Switching and Routing 2009. (HPSR), 2013 IEEE 14th International Conference on, Kakantousis, Theofilos, Boutsis, Ioannis, Kalogeraki, pp. 185–190. IEEE, 2013. Vana, Gunopulos, Dimitrios, Gasparis, Giorgos, and Dou, Adam. Misco: A system for data analysis applica- Patroumpas, Kostas and Sellis, Timos. Event Process- tions on networks of smartphones using mapreduce. In ing and Real-time Monitoring over Streaming Traffic MDM, pp. 356–359, Bengaluru, India, July 2012. IEEE. Data. Web and Wireless Geographical Information Sys- tems Lecture Notes in Computer Science Volume 7236, Kinane, D., Schnitzler, F., Mannor, S., Liebig, T., Morik, pp 116-133, 2012. K., Marecek, J., Gorman, B., Zygouras, N., Katakis, Y., Kalogeraki, V., and Gunopulos, D. Intelligent synthe- Schettlinger, K., Fried, R., and Gather, U. Real-time sig- sis and real-time response using massive streaming of nal processing by adaptive repeated median filters. In- heterogeneous data (insight) and its anticipated effect on ternational Journal of Adaptive Control and Signal Pro- intelligent transport systems (its) in dublin city, ireland. cessing, 24(5):346–362, 2010. ISSN 1099-1115. doi: In ITS, Dresden, Germany, November 2014. 10.1002/acs.1105. URL http://dx.doi.org/10. 1002/acs.1105. Lakhina, Anukool, Crovella, Mark, and Diot, Christiphe. Characterization of network-wide anomalies in traffic Williams, Billy M, Durvasula, Priya K, and Brown, Don- flows. In Proceedings of the 4th ACM SIGCOMM con- ald E. Urban freeway traffic flow prediction: applica- ference on Internet measurement, pp. 201–206. ACM, tion of seasonal autoregressive integrated moving av- 2004. erage and exponential smoothing models. Transporta- eR ICML Workshop tion Research Record: Journal of the Transportation Re- search Board, 1644(1):132–141, 1998. Yi, B.-K., Sidiropoulos, N.D., Johnson, T., Jagadish, H.V., Faloutsos, C., and Biliris, A. Online data mining for co- evolving time sequences. In Data Engineering, 2000. Proceedings. 16th International Conference on, pp. 13– 22, 2000. Zare Moayedi, H. and Masnadi-Shirazi, M.A. Arima model for network traffic prediction and anomaly detection. In Information Technology, 2008. ITSim 2008. Interna- tional Symposium on, volume 4, pp. 1–6, Aug 2008. Zygouras, Nikolas, Zacheilas, Nikos, Kalogeraki, Vana, Kinane, Dermot, and Gunopulos, Dimitrios. Insights on a Scalable and Dynamic Traffic Management System. EDBT, 2015. ek