Deriving Semantic Sensor Metadata from Raw Measurements Jean-Paul Calbimonte1 , Zhixian Yan2 , Hoyoung Jeung3 , Oscar Corcho1 , and Karl Aberer2 1 OEG, Facultad de Informática,Universidad Politécnica de Madrid, Spain jp.calbimonte@upm.es,ocorcho@fi.upm.es 2 LSIR, Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland zhixian.yan@epfl.ch,karl.aberer@epfl.ch 3 SAP Research, Brisbane, Australia hoyoung.jeung@sap.com Abstract. Sensor network deployments have become a primary source of big data about the real world that surrounds us, measuring a wide range of physical properties in real time. With such large amounts of heterogeneous data, a key challenge is to describe and annotate sensor data with high-level metadata, using and extending models, for instance with ontologies. However, to automate this task there is a need for en- riching the sensor metadata using the actual observed measurements and extracting useful meta-information from them. This paper proposes a novel approach of characterization and extrac- tion of semantic metadata through the analysis of sensor data raw ob- servations. This approach consists in using approximations to represent the raw sensor measurements, based on distributions of the observa- tion slopes, building a classification scheme to automatically infer sensor metadata like the type of observed property, integrating the semantic analysis results with existing sensor networks metadata. 1 Introduction Ubiquitous sensor networks are a primary source of observations from the phys- ical world, from environmental measuring stations, participatory or citizen sens- ing, to various sensor applications in traffic, media and health monitoring. Pub- lishing sensor networks data on the web has the potential of increasing public awareness and involvement on these different domains at a massive scale [1]. Cheap sensing devices can be easily configured and deployed, plugged to sensor data platforms such as Cosm1 for exploitation, storage and querying. The increasing availability of sensor data in the web introduces higher het- erogeneity, which makes it more difficult for potential users to make sense out of these data sources and be able to identify which ones are useful for their applications. An example of this scenario is the Swiss Experiment2 project, a 1 Cosm, formerly Pachube https://cosm.com/ 2 Swiss Experiment: http://www.swiss-experiment.ch/ platform that enables real-time publishing environmental data on the web, from a large-scale federation of sensor networks, mainly in the Swiss Alps. The pub- lished data is heterogeneous as it comes from different geographical locations, with different time spans (e.g. observations collected during 1 year, 3 months, etc.), as well as varying sampling rates (e.g. per minute, per 10 minutes). More- over, the metadata for these sensor types is not always complete and coherent. As an example, to indicate that a sensor measures temperature (i.e. the observed property), different sensors use various tag names, like “temperature”, “temp”, “t”, “msptemperature”, “tp”, etc. Although the data is available for anyone to use, these noisy descriptions are not understandable enough and do not provide semantic information about what this data is about. In less-controlled scenarios than the Swiss Experiment, the problems of het- erogeneity are even more noticeable. For instance in the Cosm web platform, users tag their sensor data as means of metadata, identifying which types of measurements they are publishing. Projects like the Air Quality Egg3 , aiming at promoting air-quality participatory sensing, enable almost any citizen to pub- lish measurements at web-scale. However, the user-provided metadata is often incomplete. In many cases these tags are misleading or they are not provided at all, making it very hard for other users to query or make use of this data. To overcome this problem, establishing explicit semantics on the metadata has been proposed in previous works, using sensor ontologies [2]. When using these ontologies, sometimes it is needed to manually map the semantic informa- tion from the sources to the new metadata model [3], which is a cumbersome and error-prone task. In this paper we propose a novel approach of semantic sensor analysis that infers semantic properties such as the type of observed property, using the raw sensor observations as input. The main contributions of this paper are the following: – We propose a novel method for representing time series as distributions that represent the slopes of a linear approximation of the initial numeric sensor measurements. – Based on the statistics of the observation slopes, we infer the type of observed property of the sensor measurements. We use a classification method that exploits the similarity of the slopes distributions. – We provide a mechanism for enriching the sensor metadata, based on the SSN Ontology [2], with the metadata inferred from the observation slopes. – We build a self-contained evaluation system linking raw sensor measure- ments to high-level semantics, and validate our method using two real-life environmental sensor datasets, from the Swiss Experiment and AEMET4 (the Spanish meteorological office). The remainder of this paper is organized as follows: Section 2 describes the global approach proposed for semantic analysis of sensor data. Section 3 studies 3 AirQuality Egg http://airqualityegg.wikispaces.com/ 4 Agencia Estatal de Metereologı́a: http://www.aemet.es the sensor data representation using slopes, whereas Section 4 focuses on build- ing classification algorithm for inferring observed property types and integrating them to the sensor metadata. In Section 5, we experimentally evaluate our ap- proach. Section 6 summarizes existing related work. Finally, Section 7 includes concluding remarks and points to future works. 2 From Raw Measurements to Semantic Metadata Sensor data is typically represented as time series, describing the evolution over time of a certain observed property. Raw sensor data without any metadata that describes it, has limited use as it is hard to discover, integrate or interpret. While in controlled environments the sensor metadata can be reasonably well managed and controlled by the data owners, in the context of the sensor web, where any citizen is able to produce and publish data, it becomes a more diffi- cult task. While semantic metadata has been shown to be effective for managing large sensor metadata repositories, current proposals require expensive manual curation and tagging (see Section 6). However, these approaches do not look into the data values, from which we can derive some of these metadata properties using analysis and mining techniques. We describe in Figure 1 our architecture for deriving semantic metadata from sensor data measurements. The approach includes characterizing sensor time series and extracting their observed property types to enrich sensor metadata, and consists of four main layers: – At the sensor deployment layer, sensor nodes provide initial measurements in terms of real-time numerical values, e.g. temperature, humidity, etc. – In the semantic sensor analysis layer, we first represent the sensor data stream using linear approximations and calculate the observation slopes. Based on the sensor slopes, we are able to compute similarity between sensor data series, detecting the observed property types through classification, and performing detection of these types with partial information. – A semantic representation of the analysis component is integrated into the semantic metadata. Using the SSN Ontology as a basis, and combined with domain specific ontologies, this enriched metadata is made available for fur- ther processing or querying. – In the application layer, users can build tools and visualizations to query such sensor data and receive results that include the new metadata computed by the analysis layer. The deployment layer is usually built using sensor or stream data manage- ment systems. These systems centralize the data captured by the devices and provide storage, query interfaces and streaming operators. As for the seman- tic metadata, we built upon previous work on semantic management of sensor networks [3], centered on the use of the SSN Ontology, coupled with domain ontologies and vocabularies for quantities and units of measurements. For the analysis of the time series, we propose a representation based on the slopes of a linear approximation of the data, as described in Section 3. Then these repre- sentations can be used to compare and find similarities among new and existing time series, classifying them according to the detected observed property type, etc. As a result, we are able to complete and query the sensor metadata, as detailed in Section 4. Fig. 1: Semantic Sensor Analysis Architecture 3 Sensor Data Representation with Slopes In environmental time series, similar patterns can be observed periodically over time. These patterns can be characteristic to a type of sensor data, and therefore help to recognize it. If we represent a time series using a linear representation, such as the one in Figure 2(b), the patterns of the data can be associated to the angles of the linear segments or its corresponding slope. For instance, a steep slope indicates a sudden increase of the measured property. The intuition is that if these slopes are repetitive over time, we can build slope distributions that can be representative of a type of time series. Using slopes makes it possible to find similarities between time series that not necessarily have the same value ranges but similar behavior, e.g such as the air temperature in two different locations. (a) Linear approximation (b) Constructing the convex hulls and segments Fig. 2: Piecewise linear representations 3.1 Piecewise Linear Representation We can use linear segments to approximate a time series (Piecewise Linear Rep- resentation, PLR), and analyze the trends by observing the angles that the segments form. For instance in Figure 2(a), we use 2 segments to represent the original 10 data points. Notice that the number of points for a segment can be variable (adaptive approximations). We used the algorithm of [4] for the con- struction of piecewise linear histograms. Consider we have a time series of n data points X = x1 , x2 , ..., xn , and we want to fit it in m << n segments. The algorithm maintains a set B of buckets bi = hi , begi , endi , li , ri , hi , where hi is a convex hull of data points, and (begi , li ), (endi , ri ) are the coordinates of the segment that best fits the convex hull (the segment that bisects the thinnest bounding rectangle of hi [4]). The slope of bi can be calculated as slope(bi ) = endrii−l −begi . The algorithm adds i elements to B from X, until there are no buckets available, and then it starts to merge those adjacent buckets bi and bi+1 that combined produce the smallest increase in total error. Merging is reduced to a convex hull merge of hi and hi+1 . The algorithm iterates until all elements of X have been placed in a bucket. The resulting set of segments of each bucket bi is the linear approximation of X. For instance in Figure 2(b), the convex hull hi encloses 8 data points and its minimum rectangle is bisected by the thick black segment defined by the points (begi , li ), (endi , ri ). This is the linear representation for these 8 points. During the computation of the linear representation, if merging hi with the next hull hi+1 reduces the approximation error, they will form a new single hull with its own bisecting segment. Once we apply this PLR algorithm we have the time series represented as line segments, each with a distinctive slope. 3.2 Slope Distributions To build the slope distributions, we first compute a linear approximation of the time series, using the algorithm described in Section 3.1. It is possible to create linear approximations of different accuracy, depending on the number of seg- ments per unit of time. For instance for a time series of 30 days, if we use 4 segments per day, their slopes will reflect coarse-grained changes in the data during each day. Time series of originally different sampling times, can be repre- sented using the same segment/day rate, in order to be comparable. Obviously, if the original sampling interval is greater than the number of segments/day, the representation with that rate is not possible. Once the linear representation is built, we can compute the slopes and ana- lyze them. The slope or gradient space, bounded in the [∞, −∞] interval for the possible angles [ π2 , − π2 ], can be divided in sectors, each represented with a sym- bol αj from an alphabet A and we can assign each segment to its corresponding symbol. We propose using the segment representation discussed in the previ- ous section, to compute slope symbolizations, which characterize a time series as a sequence S of symbols si from an alphabet A that correspond to a type of slope. In this way, we characterize a time series by the type of variations present in the sensor data, regardless of the data values. For example if we divide the angle space in 4 sectors (labeled a, b, c, d), at intervals of π4 , we can match each segment slope with one symbol. For instance in Figure 3 we have 4 segments, whose symbolic representation is adac, by matching each slope with a symbol. Fig. 3: Slopes symbolization. The angle space in this example is divided in 4 sectors, each of π4 . According to which division the segments falls in, it is assigned a symbol. Having this symbolic representation of the slopes, it is possible to compare them to check if two series have similar slope patterns. One simple way to do so, is to generate symbol distributions, or histograms that count how many symbols of each type exist in a time series. So a distribution of a sequence S can be defined as a set DS of elements dαj = | {si ∈ S, si = alphaj } |, for all symbols in A. For the previous example, it would be a vector 2, 0, 1, 1, which can be normalized by the total elapsed time, so that we can compare series encompassing different time spans. A simple distance measure is the euclidean distance, qP defined for two n 2 distributions DS1 , DS2 of length n as: deucl (DS1 , DS2 ) = i (dS1 i − dS2 i ) 3.3 Choosing the angle divisions Although we can arbitrarily choose how to divide the angle space (e.g. 4 sectors of π4 as in the previous example), the actual angles may be more concentrated in some intervals than others. For instance time series with highly changing angles such as wind speed, may have steeper gradients than a more stable series. Taking into account this fact, we propose to analyze the training data sets to determine an angle division that better represents the actual distribution of angles in the training set. Using this distribution information, we can divide the angle space in divisions that hold the same number of angles of the training data. 4 Deriving Semantic Metadata After establishing how the data is segmented and symbolized, we can use the symbol distributions for data analysis tasks to help understanding the semantics of the data. Given a time series, if it does not contain appropriate metadata, the potential user of this data can use already analyzed time series and compare the new one with them. We show how this can be done using our symbolization and a simple classification scheme, even with a partial subset of a time series. 4.1 Semantic Descriptions A semantic description of an observation is a collection of statements that in- cludes the observed property (e.g. humidity, pressure), feature of interest (e.g. the air at some location), unit of measurement, among others. For instance, using the vocabulary of the SSN Ontology [2], we describe a wind speed observation in Listing 1. The observation, identified as swissex:WindSpeedObservation1, has been observed by sensor swissex:SensorWind1 and reported a value of 6.245. The sensor observed property type cf-property: wind speed (speed of the wind feature) is defined in a domain specific vocabulary (in this case the Climate and Forecast vocabulary defined by the W3C SSN-XG group5 ). Additional metadata about this observation are omitted for brevity. swissex : W i n d S p e e d O b s e r v a t i o n 1 rdf : type ssn : Observation ; ssn : f e a t u r e O f I n t e r e s t cf - feature : wind ; ssn : o b s e r v e d P r o p e r t y cf - property : wind_speed ; ssn : o b s e r v a t i o n R e s u l t [ rdf : type ssn : SensorOutput ; ssn : hasValue [ qudt : numericValue " 6.245 " ^^ xsd : double ]]; ssn : observedBy swissex : SensorWind1 ; Listing 1: Wind Speed observation in rdf according to the ssn ontology Concretely, the cf-property:wind speed property indicates that this is an observation of wind speed, and it has further semantic information in the Climate & Forecast ontology, as seen in Listing 2. It states that it is a property of the wind (cf-feature:wind) and is a property of the more general speed quantity (qu:speed). In order to extract this information, the type of observed property from an unannotated dataset, we propose the classification scheme in the next subsection. The goal is basically to identify the ssn:observedProperty for a time series. cf - property : wind_speed rdf : type dim : V el oc it y Or Sp e ed ; rdfs : label " wind speed " ; ssn : isPropertyOf cf - feature : wind ; qu : propertyType qu : scalar ; qu : g e n e r a l Q u a n t i t y K i n d qu : speed . Listing 2: Wind Speed property according to the Climate and Forecast vocabulary 4.2 Data Classification Given two sets of time series, a training set already annotated according to the type of data that is captured, and an unannotated test set, we are interested in finding the observed property for the second set. Assume we have a collection D of symbol distributions D1 , ..., Di , ..., Dn as a training set, each of them cor- responding to a time series tsi , already classified with a type observed property (e.g. “wind speed”). The classification task consists in finding the best property for time series tstest in the test set. We can use a simple k-nearest neighbor scheme, which has been successfully used for time series classification [5,6]. First, the time series tstest is segmented 5 C&F vocabulary: http://purl.oclc.org/NET/ssnx/cf/cf-property and symbolized. Then, we generate a symbol distribution Dtest , as described in Section 3.2, which can be compared iteratively with each of the distributions Di in D. From the k distributions closer to Dtest , we select the observed property of the majority. 4.3 Using Partial Data Subsets This classification technique may use all the complete time series for computing the symbolization and the slope distribution. However, for types of data with recurring patterns such as the ones present in environmental and meteorological data, using a smaller subset of data can be enough to extract the feature that help detecting the type of observed property. In that case for the construction of the linear representation of the data, we simply choose a subset of the original data: X = x1 , x2 , ..., xn , with a different n0 such that n0 < n. 4.4 Querying using the Analysis Results After executing the classification, we can use the extracted information to com- plete the sensor metadata, that is then available for querying. In Listing 4 we show a simple sparql query that asks for sensors that measure air temperature. SELECT ? sensor WHERE { ? sensor a ssn : Sensor ; ssn : observes cf - property : ai r_ t em pe ra t ur e .} Listing 3: Query all sensors that measure air temperature The streams produced by sensors can be seen as streaming datasets, whose metadata can also be queried. The stream, identified by a URI, can be seen as an unbounded dataset of observations, some of which are actually used to compute the slope symbolizations and classification described above. The ob- served properties obtained for the sensor (e.g. cf-property:air temperature) are therefore the observed properties of the stream observations. We can also query for more general types of data, for instance, the generic temperature prop- erty. In Listing 4 we ask for all stream URIs of sensors that measure some type of temperature. SELECT ? stream ? o b s e r v e d P r o p e r t y WHERE { ? sensor a ssn : Sensor ; ssn : observes ? o b s e r ve d P r o p e r t y . ? stream ssn : isProducedBy ? sensor . ? o b s e r v e d P r o p e r t y qu : g e n e r a l Q u a n t i t y K i n d qu : temperature .} Listing 4: Query all streams of sensors that measure air temperature Furthermore, we can expose the similarity measurements computed between the time series, so that users can also query this information. As an example, in Listing 5 we use the Similarity Ontology6 (sim) to represent the computed distance between two series, using our slope representation. Then we can query, for instance the top 5 series similar to a given time series. 6 The Similarity Ontology: http://purl.org/ontology/similarity/ swissex : slopeSim1_2 a sim : Similarity ; sim : subject swissex : timeseries1 ; sim : object swissex : timeseries2 ; sim : weight 0.32; sim : method swissex : S l o p e D i s t r i b u t i o n D i s t a n c e . Listing 5: Slope distribution similarity between two time series This type of queries allows users not only to use the final results of a classifi- cation task, but also to query more detailed information including the precision of the computations. This information can be used to validate this metadata or provide insight about the analysis process and the relationship of a sensor stream with other streams. In the case of the early detection of the observed property of a time series, the user may be interested in knowing, for example, how many days of data are typically used for classifying those sensors that measure wind speed 6. SELECT ? sensor ? dur WHERE { ? sensor a ssn : Sensor ; ssn : observes cf - property : wind_speed . ? timeseries ssn : isProducedBy ? sensor . ? timeseries swissex : duration [ qu : numer icalVal ue ? dur ].} Listing 6: Query the number of data days used for classifying wind speed sensors 5 Experimentation The main goal of these experiments is to show that the proposed sensor data representation using slopes can be used to characterize sensor data and extract sensor metadata corresponding to the types of observed properties. First we show how the classification behaves with two real life data sets, in terms of precision. Next, we are interested in experimenting with smaller subsets of data samples, and observing how the classification behaves with less data, as we know there are repeating data patterns. Finally, we compare our approach with a classification using the widely used SAX symbolic representation of the data [5]. To validate the classification approach presented in Section 4.2, we imple- mented and applied it to two different datasets in the environmental domain: one from the Swiss Experiment7 and another form AEMET. The data is hetero- geneous as it comes from different geographical locations, some have different time spans (e.g. observations collected during 1 year, 3 months, etc), others have different sampling rates.Also the number of sensors per observation type varies (e.g. 78 for temperature, only 4 for snow height). Due to the conditions of the deployments, some of them experimental and others deployed in harsh environments, this dataset contains a considerable amount of noise in the data. The AEMET dataset consists of sensor data from 100 weather stations man- aged by the Spanish meteorological office. The data is heterogeneous, coming from stations all over Spain, and was originally collected in intervals of 10 min- utes. It contains, in general, less noise and anomalies than the Swiss Experiment dataset, as it comes from stations daily used for meteorological forecasts. 7 The dataset is available at: http://lsirpeople.epfl.ch/qvhnguye/benchmark/ 5.1 Classification in Swiss Experiment and AEMET The goal of our first experiment consists in evaluating the effectiveness of the classification in terms of precision and recall. The classifier is expected to assign the correct label (the type of observed property, e.g. “humidity”) to time series from a test set. The classifier uses a training set of time series and the evaluation criteria is computed in terms on the number of true positives (tp), false positives tp tp (f p) and false negatives (f n): precision (p = tp+f p ), and recall (r = tp+tn ). Swiss Experiment The heterogeneity of the Swiss Experiment dataset required applying different parameters for the linear approximation step. Some time series had very short sampling time intervals (e.g. every 2 seconds for pressure, for at most two days), while others had very long ones (e.g. every half-an-hour for several months). Hence, the approximations were very different in these cases (hundreds of segments per day for short intervals, and only a few per day for long ones).We applied a 5-fold cross validation scheme to divide our dataset in training and test set, and then apply the nearest neighbor algorithm. We present the confusion matrix in Table 4, for k = 5. Fig. 4: Swiss Experiment confusion matrix, k=5. Column header abbrevia- tions: ra:radiation, mo:moisture, te:temperature, wd:wind direction, ws:wind speed, hu:humidity, ly:lysimeter, pr:pressure, co:CO2 , sh: snow height, vo:voltage We can observe that the effectiveness of the classification varies among the different types of data. The nearest neighbor scheme is also biased as the dataset is highly unbalanced. Since we have comparatively much more samples of tem- perature or wind speed, than for pressure or snow height, these last are less likely find nearest neighbors of the same class. For instance for lysimeter and snow height, almost no series are correctly identified, as we have a very small number of series. Nevertheless, in the cases of pressure or CO2 the precision is good regardless of the low number of series. This is a special case, since these series have very different slope distributions, and also, have very short sampling interval. Since their resolution is much smaller (e.g. every 2 seconds) than most of the other series in the dataset, their comparison throws very large distances that are quickly discarded. In cases where the total number of time series was very small (e.g. only 4 for snow height), the approach is clearly not effective. It requires a larger training set to have an acceptable precision. Also, when the series are very irregular (sometimes due to noise and false non-curated data in the original dataset), they logically fail to be correctly classified. AEMET For the AEMET dataset, we followed the same approach as with the Swiss-Experiment. However, for the AEMET data, we had a larger number of time series for every type of data, thus avoiding the problem of lack of training data encountered in the previous tests. Moreover, the dataset sampling interval is the same, making it easier to compare their slope distributions. We applied the classification scheme with a 10-fold cross validation for this dataset. We provide the confusion matrix for k = 5 in Table 5. Fig. 5: AEMET confusion matrix, k=5. Column header abbreviations: st:soil temper- ature, ba:battery, te:air temperature, wd:wind direction, ws:wind speed, hu:humidity, wsx: wind speed (max), pr:pressure, wdx: wind direction (max), pre:precipitation. We can notice that in this case the approach achieves better precision, as ex- pected, since we avoided the problems of sampling times and unbalanced types (the number of series per each type is similar or the same). However, it can be observed that there are important false positives at some specific spots. For in- stance the number of soil temperature series falsely identified as air temperature is very high. This is in fact an expected result, since both are specializations of the more general type temperature. Hence, both share patterns in the time series, that are reflected in the slope distributions that are compared during the classification process. The same situation can be seen between wind speed and wind speed (max), and for wind direction and wind direction (max). It is also interesting to see that if we consider the “unification” of simi- lar types of data (e.g. wind speed and maximum wind speed ), the precision is much higher (Figure 6). This suggests that the slope distributions are useful for identifying similar data, because they have very similar slope distributions. This is an expected behavior, for instance for wind speed and wind speed (max), which are measurements of the same type of data. In order to discern between Fig. 6: Precision in AEMET, not differencing the specific types wind speed (max) and wind direction (max). small differences like these, other characteristics of the data have to be taken into account. In these cases where two types of observations are similar, we can use a higher level definition of observed property. For instance, in the Climate and Forecast vocabulary, the specific properties cf-property:air temperature and cf-property:soil temperature both have qu:temperature as its general quantity kind. 5.2 Classification with Partial Information In this experiment we aim at showing how the classification precision varies when using smaller subsets of the test data. As we discussed in Section 4.3, for our environmental and meteorological datasets, recurrent slope patterns in the data can be representative enough to compute the slope distribution, and make it possible to classify the data. We have tested the classification reducing the number of days-of-data used for computation. In Figure 7(a) and Figure 7(b) we plot the precision for the AEMET and Swiss Experiment dataset series, for dif- ferent subsets of the data (expressed in terms of the number of days of measured data). In total we have around 200 days of observations, but we can see that for some types of data we require much less and obtain similar precision in the classification. This is the case especially with series that include very repetitive patterns on a daily basis, but not for others that have a more unpredictable behavior such as wind speed. In this case we see that it needs more days-of-data than other types to increase the precision. (a) AEMET (b) Swiss Experiment Fig. 7: Classification precision, for different partial datasets, in terms of the days of data used. 5.3 Comparison with SAX Classification The goal of this experiment is to compare our approach with a classification based on the widely used SAX representation of time series [5]. The compari- son is based on the precision using both approaches. By classifying with SAX we can verify how well our method behaves in comparison to a well established technique. The SAX approach also produces a symbolization of the time series, although the angles and slopes are not taken into account, as it uses a PAA ap- proximation. We applied the same classification method used for our slope-based representation. We show the classification precision for the Swiss Experiment and AEMET datasets in Figure 8(a) and Figure 8(b) respectively. (a) Swiss Experiment (b) AEMET Fig. 8: Classification precision with SAX and the Slope representation. As it can be seen, the classification throws similar results for both methods, with small differences in AEMET, and slightly better for the slope-based ap- proach in the Swiss experiment dataset. Using the slopes distributions shows to be helpful at differencing time series with similar values but very different angles. In the case of AEMET, the measured values are already enough to discern be- tween two different types of observation, and hence the results are not improved by the slope distribution. While the SAX representation has been exploited in other ways, for example by considering substrings of a fixed size, instead of only one symbol, this experiment shows that our approach is also able to extract features that help characterizing a type of time series, and enabling its seman- tic identification. A classification technique throws different results depending on the type of data. Further amendments could be plugged to the classifica- tion scheme, but they risk to be too specific to the characteristics of certain datatypes, and such methods are outside of the scope of this work. 6 Related Work Previous works on time series classification and mining, have studied different approaches for summarizing and exploiting sensor raw data, and have been com- plemented with semantic representations for sensor data management. Data Approximations High level representations reduce the dimensionality of time series data, in order to reduce the complexity of indexing and comparison algorithms, using different techniques. These include piecewise constant and lin- ear approximations (e.g. PAA[7], APCA[8], PLR[9]) that use constant and linear segments respectively, to represent the original time series. Generally simple to compute, either in batch mode and online using sliding window algorithms, these methods offer accurate approximations of the original data. These representa- tions have been widely used for tasks including similarity search, fuzzy queries, dynamic time warping, clustering and classification [9]. While these approximations reduce dimensionality, some approaches intro- duce a further step that consists in the symbolization of the time series. These techniques, such as SAX [5], have shown to be space and time efficient for in- dexing, classification and clustering, and also for additional tasks such as motif discovery and visualization [10]. These symbolizations can be used to compute distance measures that help in classification and clustering tasks [5]. Other works have considered also the slopes of linear approximations such as the STS dis- tance [11] for clustering time series. SAX symbolization has also been used for sensor events detection [12] and for creating high-level perception abstractions from the raw sensor data, by matching SAX patterns with low-level thematic abstractions [13]. Time Series Classification Particularly, for the task of classification, differ- ent techniques such as decision trees, neural networks and bayesian classifiers have been used [6]. Classification approaches usually fall into the following three categories: distance-based, feature-based and model-based[6]. Simple distance measures such as euclidean, are very limited because they only consider one- to-one matches in the time axis. Distance measures with more elastic matching for the time axis, such as Dynamic Time Warping (DTW), have been proved effective for similarity matching [14]. These have been coupled with k-nearest neighbor (k-NN) classifiers, proving an effective combination for a number of time series classification problems [15,16]. These techniques have space and time computation limitations in some scenarios, and offer little explanation on why a series belongs to a particular class [17]. Feature-based approaches try to find properties that are representative of a type of series, in order to classify them. Most of these approaches use a high level representation e.g. symbolization or discretization methods, before extracting the features[6] while others work ex- tracting representative subsequences (e.g. shapelets [17]). Semantic Sensor Representations The task of modeling sensor data and metadata with ontologies has been addressed by the semantic web research com- munity in recent years. Early ontology proposals for describing wireless sensors have been reviewed in [18]. However, the focus of most of these approaches was on sensor meta information, while the description of observations was generally overlooked. Besides some of these approaches lack ontology design best practices of reuse and alignment with standards an reference ontologies. Others, including the OntoSensor ontology [19], use the concepts defined in the OGC SensorML8 standard as a basis. More recent proposals like [20] and [21], also consider the OGC Observations and Measurements (O&M) standard9 to represent observa- tions captured by sensor networks. Recently, through the W3C SSN-XG group, the semantic web and sensor network communities have made an effort to provide a domain independent ontology, generic enough to adapt to different use-cases, and compatible with the OGC standards at the sensor and observation levels. The result, the ssn ontology [2], is based on the stimulus-sensor-observation design pattern [22] and the OGC standards. 8 SensorML. http://www.opengeospatial.org/standards/sensorml 9 OGC O&M: http://www.opengeospatial.org/standards/om 7 Conclusions and Future Work We have described an approach for identifying the type of data from sensor data sources, using a symbolic representation of the time series slopes. We have shown how this representation can be used for enriching semantic sensor metadata. We have shown specific use cases of time series data classification, providing similarity measures, and metadata aggregation that can be queried in terms of high-level standard ontologies. Finally, we evaluated our approach with real-life datasets of the Swiss-Experiment project and AEMET. We have shown through experimentation that this representation can be useful for balanced datasets, as the classification gets biased when there are small numbers of samples in the training set, for a particular type of data. Moreover, our results show that this representation can help grouping data of the same type, despite geographical locations, since it is based on the distribution of slopes of a linear approximation. Therefore, it can identify similarities of related types of data: e.g. air temperature and soil temperature. We have compared our characterization of sensor data with a competitive approach, and showed that for the chosen environmental datasets it effectively enables the extraction of semantic metadata. The proposed approach, however, was evaluated within the same dataset, and in the future we will study its applicability in an inter-dataset classification. This framework could be used in the future for other tasks such as clustering, or for identifying simple patterns in streams of sensor data. Moreover, complex sym- bolizations consisting of sequences of slopes could be considered, which would represent more complete patterns that can be exploited. Also, we can consider building a more complex representation that includes not only the slopes infor- mation but also the value ranges, and even tags and labels provided the data publishers. This may enable a more complete and accurate extraction of meta- data that enriches the growing Semantic Sensor Web. As a final future path, we may consider applying online execution of these techniques for real-time analysis. References 1. Sheth, A., Henson, C., Sahoo, S.: Semantic sensor web. IEEE Internet Computing 12(4) (2008) 78–83 2. Compton, M., Barnaghi, P., Bermudez, L., Garcı́a-Castro, R., Corcho, O., Cox, S., Graybeal, J., Hauswirth, M., Henson, C., Herzog, A., Huang, V., Janowicz, K., Kelsey, W.D., Phuoc, D.L., Lefort, L., Leggieri, M., Neuhaus, H., Nikolov, A., Page, K., Passant, A., Sheth, A., Taylor, K.: The SSN ontology of the W3C semantic sensor network incubator group. Journal of Web Semantics (In press) (2012) 3. Calbimonte, J.P., Jeung, H., Corcho, O., Aberer, K.: Semantic sensor data search in a large-scale federated sensor network. In: Proc. 4th International Workshop on Semantic Sensor Networks. (2011) 14–29 4. Buragohain, C., Shrivastava, N., Suri, S.: Space efficient streaming algorithms for the maximum error histogram. In: Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, Ieee (2007) 1026–1035 5. Lin, J., Keogh, E.J., Wei, L., Lonardi, S.: Experiencing sax: a novel symbolic representation of time series. Data Min. Knowl. Discov. 15(2) (2007) 107–144 6. Xing, Z., Pei, J., Keogh, E.J.: A brief survey on sequence classification. SIGKDD Explorations 12(1) (2010) 40–48 7. Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. Knowledge and information Systems 3(3) (2001) 263–286 8. Chakrabarti, K., Keogh, E., Mehrotra, S., Pazzani, M.: Locally adaptive dimen- sionality reduction for indexing large time series databases. ACM Transactions on Database Systems (TODS) 27(2) (2002) 188–228 9. Keogh, E., Chu, S., Hart, D., Pazzani, M.: Segmenting time series: A survey and novel approach. Data mining in time series databases 57 (2004) 10. Kasetty, S., Stafford, C., Walker, G., Wang, X., Keogh, E.: Real-time classification of streaming sensor data. In: Tools with Artificial Intelligence, 2008. ICTAI’08. 20th IEEE International Conference on. Volume 1., IEEE (2008) 149–156 11. Möller-Levet, C., Klawonn, F., Cho, K., Wolkenhauer, O.: Fuzzy clustering of short time-series and unevenly distributed sampling points. Advances in Intelligent Data Analysis V (2003) 330–340 12. Zoumboulakis, M., Roussos, G.: Escalation: Complex event detection in wireless sensor networks. In: Proceedings of the 2nd European conference on Smart sensing and context, Springer-Verlag (2007) 270 – 285 13. Payam Barnaghi, Frieder Ganz, C.H., Sheth, A.: Computing perception from sensor data. In: Proceedings of the 2012 IEEE Sensors Conference (to appear). (2012) 14. Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., Keogh, E.J.: Querying and mining of time series data: experimental comparison of representations and dis- tance measures. PVLDB 1(2) (2008) 1542–1552 15. Xi, X., Keogh, E., Shelton, C., Wei, L., Ratanamahatana, C.: Fast time series classification using numerosity reduction. In: Proceedings of the 23rd international conference on Machine learning ICML 06. Volume 150., ACM Press (2006) 1033– 1040 16. Geurts, P.: Pattern extraction for time series classification. Principles of Data Mining and Knowledge Discovery (2001) 115–127 17. Ye, L., Keogh, E.: Time series shapelets: a new primitive for data mining. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM (2009) 947–956 18. Compton, M., Henson, C., Lefort, L., Neuhaus, H., Sheth, A.: A survey of the se- mantic specification of sensors. In: Proc. 2nd International Workshop on Semantic Sensor Networks. (2009) 17 19. Russomanno, D., Kothari, C., Thomas, O.: Sensor ontologies: from shallow to deep models. In: Proc. 37th Southeastern Symposium on System Theory. (2005) 107–112 20. Barnaghi, P., Meissner, S., Presser, M., Moessner, K.: Sense and sensability: Se- mantic data modelling for sensor networks. In: Proceedings of the ICT Mobile Summit. (2009) 21. Compton, M., Neuhaus, H., Taylor, K., Tran, K.: Reasoning about sensors and compositions. In: SSN. (2009) 22. Janowicz, K., Compton, M.: The Stimulus-Sensor-Observation Ontology Design Pattern and its Integration into the Semantic Sensor Network Ontology. In: SSN. (2010) 7–11