-

SMART: Simple Monitoring enterprise Activities by RFID Tags

Fabrizio Angiulli

fangiulli@deis.unical.it 0

Elio Masciari

masciari@icar.cnr.it 0 DEIS-UNICAL Via P. Bucci , 87036 Rende (CS) Italy 1 Institute of Italian National Research Council

Datastreams are potentially in¯nite sources of data that °ow continuously while monitoring a physical phenomenon, like temperature levels or other kind of human activities, such as clickstreams, telephone call records, and so on. RFID technology has lead in recent years the generation of huge streams of data. Moreover, RFID based systems allow the e®ective management of items tagged by RFID tags, especially for supply chain management or objects tracking. In this paper we introduce SMART (Simple Monitoring enterprise Activities by RFID Tags) a system based on outlier template de¯nition for detecting anomalies in RFID streams. We describe SMART features and its application on a real life scenario that shows the e®ectiveness of the proposed method for e®ective enterprise management.

In this paper we will focus on Radio Frequency Identi¯cation (RFID) data streams monitoring as RFID based systems are emerging as key components in systems devoted to perform complex activities such as objects tracking and supply chain management. Sometimes RFID tags are referred to as electronic bar codes. Indeed, RFID tags emit a signal that contains basic identi¯cation information about a product. Such tags can be used to track a product from manufacturing through distribution and then on to retailers. These features of RFID tags open new perspectives both for hardware and data management. In fact, RFID is going to create a lot of new data management needs. In more details, RFID applications will generate a lot of so called \thin" data, i.e. data pertaining to time and location. In addition to providing insight into shipment and other supply chain process e±ciencies, such data provide valuable information for determining product seasonality and other trends resulting in key information for the companies management. Moreover, companies are exploring more advanced uses for RFID. For instance, tire manufacturers plan to embed RFID chips in tires to determine the tire deterioration. Many pharmaceutical companies are embedding RFID chips in drug containers to better track and avert the theft of highly controlled drugs. Airlines are considering RFID-enabling key onboard parts and supplies to optimize aircraft maintenance and airport gate preparation turnaround time.

Such a wide variety of systems for monitoring data streams could bene¯t of the de¯nition of a suitable technique for detecting anomalies in the data °ows being analyzed. As a motivating example you may think about a company that would like to monitor the mean time its goods stay on the aisles. Items are tagged by RFID tags so the reader continuously produces a readings that report the electronic product code of the item being scanned, its location and timestamp, this information can be used, as an example, for signaling that the item lays too much on the shelf since it is repeatedly scanned in the same position. It could be the case that the package is damaged and consequently customers tend to avoid the purchase. If an item exhibits such a feature it deserves further investigation. Such a problem is relevant to a so huge number of application scenario that it is impossible to de¯ne an absolute notion of anomalies (in the follow we refer to anomalies as outliers). In this paper we propose a framework for dealing with the outlier detection problem in massive datastreams generated in a network environment for objects tracking and management. The main idea is to provide users a simple but rather powerful framework for de¯ning the notion of outlier for almost all the application scenarios at an higher level of abstraction, separating the speci¯cation of data being investigated from the speci¯c outlier characterization. 2

Preliminaries

An RFID system consists of three components: the tag, the reader and the application which uses RFID data. Tags consist of an antenna and a silicon chip encapsulated in glass or plastic. RFID readers or receivers are composed of a radio frequency module, a control unit and an antenna to query electronic tags via radio frequency (RF) communication. They also include an interface that communicates with an application (e.g., the check-out counter in a store). Readers can be hand-held or mounted in speci¯c locations in order to ensure they are able to read the tags as they pass through a query zone that is the area within which a reader can read the tag. The query zone are the locations that must be monitored for application purposes. In order to explain the typical features of an RFID application we consider the typical supply chain scenario.

The chain from the farm to the customer has many stages. At each stage goods are typically delivered to the next stage, but in some case a stage can be missing. The following three cases may occur: 1) the goods lifecycle begin at a given point (i.e. production stages, goods are tagged there and then move through the chain) and thus the reader in the zone register only departures of goods, we refer to this reader as source reader ; 2) goods are scanned by the reader both when they arrive and they leave the aisle, in this case we refer to these reader as intermediate reader ; 3) goods are scanned and the tag is killed, we refer to these readers as destination reader.

A RFID stream is (basically) composed of an ordered set of n sources (i.e., tag readers) located at di®erent positions, denoted by fr1; : : : ; rng producing n independent streams of data, representing tag readings. Each RFID stream can be basically viewed as a sequence of triplets hidr; epc; ¿si, where: 1) idr 2 f1; ::; ng is the tag reader identi¯er (observe that it implicitly carries information about the spatial location of the reader) ; 2) epc is the product code read by the source identi¯ed by idr and 3)¿s is a timestamp, i.e., a value that indicates the time when the reading epc was produced by the source idr.

An outlier is an observation that markedly di®ers from other observations as to lead to the suspect that it was generated by a di®erent mechanism [ 4 ]. There exist several approaches to the identi¯cation of outliers, namely, statistical-based [ 2 ], distance-based [ 3 ], density-based [ 5 ] and MDEF-based [ 6 ]. The problem has been tackled from di®erent viewpoint and in di®erent scenarios such as static dataset, dynamic dataset and very large dataset[ 1 ]. In our application scenario we deal with massive datastreams that can be viewed as kind of a very large dynamic dataset. Based on the notion of RFID stream introduce so far, it is easy to see that each RFID reading generated by an RFID tag could be an outlier either because 1) the (product) features (obtained by the epc such as price, weight, height and so on) greatly di®ers from the others readings or 2) the latency time that the tagged item spent in a given location deviates signi¯cantly from an expected value.

In our system we will assume either distance based outlier function or statistical based outlier function to catch both source of anomaly and since we are interested in the problem formalization, we disregard here the actual outlier function implementation. More formally, given a set of objects S, a positive integer k, and a positive real number R. An object o 2 S is a DB(k; R)- outlier, or a distance-based outlier with respect to parameters k and R, if less than k objects in S lie within distance R from o. This kind of function will be exploited when searching for outliers based on their product features. To deal with deviation on time features we resort to statistical based outlier function. We point out that a formal analysis of the possible outlier detection methods is out of the scope of this paper, we mentioned here the main approaches used in literature since in our system implementation we allow any stream oriented implementation of outlier function to be used. The latter observation guarantees a high °exibility in our system for dealing with every possible application scenarios. 3

Statement of the Problem

In our model, epc is the identi¯er associated with a single unit being scanned (this may be a pallet or a single item, depending on the level of granularity chosen for tagging the goods being monitored).

This basic schema is simple enough to be used as a basic schema for a data stream environment, anyway since more information are needed about the outlier being detected we can access additional information by using some auxiliary tables maintained at a M aster site as shown in ¯gure 2. More in detail, the M aster maintains an intermediate local warehouse of RFID data that stores information about items, items' movements, product categories and locations and is exploited to provide details about RFID data upon user requests. The information about items' movements are stored in the relation ItemM ovement and the information about product categories and locations are stored in the relations P roduct and Locations, respectively. These relations represents, respectively, the Product and the Location hierarchy. Relation EPCProducts maintains the association between epcs and product category, that is, every epc is associated to a tuple at the most speci¯c level of the Product hierarchy. Finally, RFID readers constitute the most speci¯c level of the Location hierarchy.

ItemM ovements contains tuples of the form hepc; DLi, where epc has the usual meaning, and DL is string built as follows: each time an epc is read for the ¯rst time at a node Ni a trigger ¯res and DL is updated appending the node identi¯er.

In the following we de¯ne a framework for integrating DSMS technologies and outlier detection framework in order to e®ectively manage outliers in RFID datastreams. In particular we will exploit the following features: a) The definition of a template for specifying outlier queries on datastreams that could be implemented on top of a DSMS by mapping the template in a suitable set of continuous queries expressed in a continuous query language language ESLlike[ 7 ]; b) The template need to be powerful enough to model all the interesting surveillance scenarios. In this respect, it should allow the de¯nition of four components, namely: 1) the kind of objects (O) to be monitored (e.g. RFID data concerning dairy products),2) the reference population P (due to the in¯nite nature of datastream) depending on the application context (e.g. a subset of the items belonging to dairy products category), 3) the attributes (A) of the population used for signing out anomalies (e.g. time spent at a given location), 4) the outlier de¯nition by means of a suitable function F (P; A; O) ! f0; 1g (e.g. deviation from the average time spent at a given location by an item); c) A mapping function that for a given template and DSMS schema, resolve the template in a set of outlier continuous queries to be issued on the datastream being monitored.

The basic intuition behind the template de¯nition is that we want to run an aggregate function that is raises by the Master (that is a central node collecting the queries and the aggregate statistics along with the sample populations) and then instantiated on a subset of nodes in the network. An incoming stream is processed at each node where the template is activated by the M aster that issue the request for monitoring the stream. Once a possible outlier is detected, it is signaled to the M aster. The master maintains management information about the network and some additional information about the items using two auxiliary tables OutlierM ovement and N ewT rend. In the OutlierM ovement table it stores information about the outlying objects, in particular it stores their identi¯ers and the paths traveled so far as explained above for ItemM ovements. The N ewT rend table stores information about objects that are not outliers but instead they represent a new phenomenon in the data. It contains tuples of the form hepc; N; ¿a; ¿l; i, where N is a node, ¿a and ¿l are, respectively, the arrival time and the time interval spent at node N by the epc. The latter table is really important since it is intended to deal with the concept drift that could a®ect the data. Indeed, when items are marked as unusual but they are not an anomalies as in the case of varied selling rates they are recorded for later use in outlier de¯nition. In particular, once the new trend has been consolidated, new statistics for the node where the objects appeared will be computed at M aster level and then forwarded to the pertaining node in order to update the parameters of its population.

As mentioned above candidate outliers are signaled at node level but they are managed by the master. More in detail, as a possible outlier is signaled by a given node the master stores it in the OutlierM ovement table along with its path if it is recognized as an anomaly or in the N ewT rend table if a signaled item could represent the symptom of a new trend in data. To summarize, given a signaled object o two cases may occur: 1) o is an outlier and then it is stored in the Outlier table; 2) o represent a new trend in data distribution and then it should not be considered an outlier and we store it in the N ewT rend table. To better understand such a problem we de¯ne three possible scenarios on a toy example.

Example 1. Consider a container (whose epc is p1) containing dangerous material that has to be delivered through check points c1; c2; c3 in the given order and consider the following sequence of readings: SeqA = f(p1; c1; 1); (p1; c1; 2); (p1; c2; 3); (p1; c2; 4); (p1; c2; 5); (p1; c2; 6); (p1; c2; 7); (p1; c2; 8); (p1; c2; 9); (p1; c2; 10); (p1; c2; 11); (p1; c2; 12)g. Sequence A correspond to the case in which the pallet tag is read repeatedly at the check point c2. This sequence may occur because: i) the pallet (or the content) is damaged so it can no more be shipped until some recovery operation has been performed, ii) the shipment has been delayed. Depending on which one is the correct interpretation di®erent recovery action need to be performed. To take into account this problem in our prototype implementation we maintain appropriate statistics on latency time at each node for signaling the possible outlier. Once the object has been forwarded to the master a second check is performed in order to store it either in OutlierM ovement or in N ewT rend table. In particular, it could happen that due to new shipping policy additional checks have to be performed on dangerous material, obviously this will cause a delay in shipping operations, thus the tuple has to be stored in the N ewT rend table.

Consider now a di®erent sequence of readings: SeqB = f(p1; c1; 1); (p1; c1; 2); (p1; c1; 3); (p1; c1; 4); (p1; c3; 5); (p1; c3; 6); (p1; c3; 7); (p1; c3; 8); (p1; c3; 9); (p1; c3; 10); (p1; c3; 11); (p1; c3; 12)g. Sequence B correspond to a more interesting scenario, in particular it is the case that the pallet tag is read at check point c1, is not read at check point c2 but is read at checkpoint c3. Again two main explanation could be considered: i) the original routing has been changed for shipment improvement, ii) someone changed the route for fraudulent reason (e.g. in order to steal the content or to modify it). In this case suppose that the shipping plan has not been changed, this means that we are dealing with an outlier then we store it in the OutlierM ovement table along with its path.

Finally, consider the following sequence of readings regarding products p1; p2; p3 that are frozen foods, and product p4 that is perishables, all readings generated at a freezer warehouse c: SeqC = f(p1; c; 1); (p2; c; 2); (p3; c; 3); (p4; c; 4); (p1; c; 5); (p2; c; 6); (p3; c; 7); (p4; c; 8); (p1; c; 9); (p2; c; 10); (p3; c; 11); (p4; c; 12)g. Obviously, p4 is an outlier for that node of the supply chain and this can be easily recognized using a distance based outlier function since its expiry date greatly deviates from the expiry dates of other goods.

The Template in a short In this section we will describe the functionalities and syntax of the T emplate introduced so far. A T emplate is an aggregate function that takes as input a stream. Since the physical stream could contain several attributes as explained in previous sections we allow selection and projection operation on the physical stream. As will be clear in next section we will use a syntax similar to ESL with some speci¯c additional features pertaining to our application scenario. This ¯ltering step is intended for feeding the reference population P . In particular, as an object is selected at a given node it is included in the reference population for that node using an Initialize operation, it persists in the reference population as a Remove operation is invoked (it can be seen as an Initialize operation on the tuples exiting the node being monitored).

We recall that a RFID tagged object is scanned multiple times at a given node N so when the reader no more detects the RFID tag no reading is generated. First time an object is read a V alidate trigger ¯res and send the information to the M aster that eventually updates the ItemM ovement table. In response to a V alidate trigger the M aster performs a check on the item path, in particular it checks if shipping constraints are so far met. In particular, it checks the incoming reading for testing if the actual path so far traveled by current item is correct. This check can be performed by the following operations: 1) selection of the path for that kind of item stored in ItemM ovement, 2) add the current node to the path, 3) check the actual path stored in an auxiliary table DeliveryP lans storing all the delivery plans (we refer to this check as DELIVERY CHECK ). This step is crucial for signaling path anomalies since as explained in our toy examples that source of anomaly arise at this stage. If the item is not validated the M aster stores the item information in order to solve the con°ict, in particular it could be the case that delivery plans are changing (we refer to this check as NEW PATH CHECK ) so information is stored in N ewT rend table for future analysis , otherwise it is stored in the OutlierM ovement table. To better understand this behavior consider the SeqB in example 1. When the item is ¯rst time detected at node c3 the V alidate trigger ¯res, the path so far traveled for that object is retrieved obtaining path = c1, the current node is added thus updating path = c1:c3 but when checked against the actual path stored in DeliveryP lans an anomaly is signaled since it was supposed to be c1:c2:c3. In this case the item is stored in the OutlierM ovement table and the M aster signal for a recovery action. It works analogously for SeqA as explained in example 1.

When an epc has been validated it is added to the reference population for that node (PN ) then it stays at the node and is continuously scanned. It may happen that during its stay at a given node an epc could not be read due to temporary ¯eld problem, we should distinguish this malfunction from the \normal" behavior that arise when an item is moved for shipping or (in case of destination nodes) because it has been sold. To deal with this feature we provide a trigger F orget that ¯res when an object is not read for a (context depending) number of reading cycles (we refer in the following as TIMESTAMP CHECK. We point out that this operation is not lossy since we recall that at each node we maintain (updated) statistics on items. When F orget runs, it removes the \old" item from the actual population and update the node statistics. Node statistics (we refer hereafter to them as modelM where N is the node they refer to) we take into account for outlier detection are: number of items grouped by product category (count), average time spent at the node by items belonging to a given category (m), variance for items (v) belonging to a given category, maximum time spent at the current node by items belonging to a given category (maxt), minimum time spent at the current node by items belonging to a given category (mint). By means of the reference population PN and the node statistics modelN the chosen outlier function checks for anomalies. In particular, we can search for two kind of anomalies: 1) item based anomalies, i.e. anomaly regarding the item features, in this case we will run a distance-based outlier detection function; 2) time based anomalies, i.e. anomaly regarding arrival time or latency time, in this case we will run a statistical based outlier detection function. In this section we formalize the syntax for template de¯nition. For basic stream operation we will refer to ESL-like syntax[ 7 ]. We point out that even if in this paper we focus on RFID data and outlier detection task, the framework is rather general and could be exploited in several application domains and for other task such as aggregate queries evaluation.

The ¯rst step is to create the stream related to nodes being monitored. Once the streams are created at each node the T emplate de¯nition has to be provided.

Aggregate function can be any SQL available function applied on the reference population as shown in Fig. 5, where Return and N ext have the same interpretation as in SQL and < T ype > can be any SQL aggregate function. An empty T ERM IN AT E clause refer to a non-blocking version of the aggregate.

As the template has been de¯ned it must be instantiated on the nodes being monitored. In particular triggers V alidate and F orget are activated at each node. As mentioned above they will continuously update the reference population and node and M aster statistics. The syntax of these triggers is shown in ¯gure 6.

We point out again that V alidate trigger has the important side-e®ect of signaling path outliers. We point out that the above presented de¯nition is completely °exible so if the user may need a di®erent outlier de¯nition she simply needs to add its de¯nition as a plug-in in our system.

CREATE STREAM < name > ORDER BY < attribute > SOURCE < systemnode > DEFINE OUTLIER TEMPLATE < name > ON STREAM < streamname > REFERENCE POPULATION (< def inepopulation >) MONITORING (< target >) USING < outlierf unction > < def inepopulation > INSERT INTO < P opulationName >

SELECT < attributelist > FROM < streamname >

WHERE < conditions > < target > < attributelist > j < aggegatef unction > < outlierf unction > < distancebased > j < statisticalbased > <Function Name> <Type>(Next Real) : Real <Table Name> (<attribute list>); f INSERT INTO <Table Name> VALUES (Next, 1); g f UPDATE <Population Name> SET <update condition>;

SELECT <output attribute> FROM <Table Name> g

Angiulli and

Fassetti . DOLPHIN: An E±cient Algorithm for Mining DistanceBased Outliers in Very Large Datasets . TKDD , 8 ( 1 ), 2009 .

Barnett and

Lewis . Outliers in Statistical Data . Wiley and Sons, 1994 .

R. NG E.

Knorr and

Tucakov . Distance-based outlier: Algorithms and applications . VLDB J., 8 ( 3 -4): 237 { 253 , 2000 .

Hawkins . Identi¯cation of Outliers. Monographs on Applied Probability and Statistics. Chapman and Hall , 1980 .

5. R. Ng J. Sander M.M. Breunig , H. Kriegel . Lof: Identifying density-based local outliers . In In Proceedings of the International Conference on Managment of Data (SIGMOD00).

Gibbons S. Papadimitriou ,

Kitagawa and

Faloutsos . Loci: Fast outlier detection using the local correlation integral . In In Proceedings of the International Conference on Data Enginnering (ICDE) , pages 315 { 326 , 2003 .

7. H. Thakkar H. Wang Y. Bai , R.C.

Luo and C.

Zaniolo . An introduction to the Expressive Stream Language (ESL) . Tech. Report.