=Paper= {{Paper |id=Vol-2245/mde4iot_paper_2 |storemode=property |title=Domain Model-Based Data Stream Validation for Internet of Things Applications |pdfUrl=https://ceur-ws.org/Vol-2245/mde4iot_paper_2.pdf |volume=Vol-2245 |authors=Simon Pizonka,Timo Kehrer,Matthias Weidlich |dblpUrl=https://dblp.org/rec/conf/models/PizonkaKW18 }} ==Domain Model-Based Data Stream Validation for Internet of Things Applications== https://ceur-ws.org/Vol-2245/mde4iot_paper_2.pdf
                       Domain Model-Based Data Stream Validation
                          for Internet of Things Applications
                Simon Pizonka                                    Timo Kehrer                                 Matthias Weidlich
       Humboldt-Universität zu Berlin                 Humboldt-Universität zu Berlin                    Humboldt-Universität zu Berlin
             Berlin, Germany                                 Berlin, Germany                                   Berlin, Germany
        simon.pizonka@hu-berlin.de                 timo.kehrer@informatik.hu-berlin.de                  matthias.weidlich@hu-berlin.de

ABSTRACT                                                                  created based on historical data representing normal behavior, e.g.,
The Internet of Things (IoT) has become ubiquitous, connecting an         as presented in [21].
ever increasing amount of devices, many of which are online 24/7             In this paper, we propose a complementary approach to data
and send data continuously. The quality of these data plays a pivotal     stream validation for IoT applications, in which validation rules
role for many IoT applications, which demands for continuous              are derived from pre-defined domain models which are being inter-
monitoring and validation of streaming data in order to spot and          preted in a stream processing framework at run-time. Following ba-
react to potential errors. Yet, implementing such validation facilities   sic principles of Model-Driven Engineering (MDE) [2], our goal is to
requires a deep understanding of the processed data. IoT developers       specify device properties in a high-level and platform-independent
are often bothered with technical details such as the data structure      fashion, while the validation itself is achieved in a fully automatic
and format, which is not only tedious but also prone to errors.           way without requiring the need for manual development efforts on
   In this paper, we advocate a model-based approach to this prob-        the technical level. We present a reference implementation of our
lem, deriving validation facilities from models written in the Vorto      approach, referred to as VortoFlow, in which IoT device informa-
modeling language, an emerging domain-specific modeling lan-              tion is modeled using the Vorto DSL2 , an emerging domain-specific
guage for declaratively describing basic characteristics of IoT de-       modeling language for declaratively describing basic characteris-
vices. We evaluate our approach and prototypical implementation           tics of IoT devices, and these device models are interpreted within
using the so-called Intel Lab Data as experimental subject. While the     Apache Beam3 , serving as an abstraction layer over a set of widely
experiment showcases the feasibility of our approach, we also iden-       used stream processing frameworks. We evaluate our approach and
tify limitations to be addressed in future work to fully realize our      prototypical implementation using the so-called Intel Lab Data [13]
vision of domain model-based data validation for IoT applications.        as experimental subject.
                                                                             While the experiment showcases the feasibility of our approach,
KEYWORDS                                                                  we also identify limitations which need to be addressed in future
                                                                          work in order to fully realize our vision of domain model-based
Model-driven Engineering, Internet of Things, sensor devices, stream
                                                                          data validation for IoT applications. However, we believe that the
processing, data validation
                                                                          automated derivation of data validation facilities from domain mod-
                                                                          els is another consequent step in leveraging MDE principles for the
1    INTRODUCTION                                                         development of IoT applications [3, 4, 14].
The Internet of Things (IoT) has become reality and is constantly            The remainder of the paper is structured as follows. Section 2
growing. There are several forecasts of how many IoT devices we           introduces a running example which motivates our approach, an
will have in the future. One frequently cited source is the analyst       overview of which is presented in Section 3 and whose applicability
company Gartner Inc., which expects around 20.4 billion IoT devices       is evaluated in Section 4. Related work will be studied in Section 5,
by 2020 1 . All these devices will be connected to the internet, many     before we conclude and outline future work in Section 6.
of them will be online 24/7 and will send continuous streams of data
which need to be processed and stored. The quality of these data          2    MOTIVATING EXAMPLE
plays a pivotal role for many IoT applications, which demands for         With the IoT, many objects of our daily life get connected and
continuous monitoring and validation of streaming data in order           controllable via the internet. We would like to pick-up here a kitchen
to spot and react to potential errors.                                    blender serving as running example. Traditionally, a kitchen blender
   To date, data validation facilities are often implemented ad-hoc       has a physical interface, comprising a rotary knob to turn on and
and in a manual fashion. This is a tedious task, prone to errors,         off the device and to control its speed, and additional buttons to
which not only requires expert knowledge in the respective appli-         enable advanced features, e.g., for crushing ice cubes or preparing
cation domain, but also a deep understanding of various technical         smoothies. Now, imagine there is a mobile application to monitor
details, such as the format and structure of the processed data, how      the kitchen blender. This application can show, e.g., whether the
to plug-in the validation routines into a suitable stream processing      blender is active, the rotation speed, and which of the advanced
framework, etc. Approaches towards more automated data valida-            features are enabled.
tion solutions have started to be developed, yet, are still in their         The kitchen blender periodically sends messages comprising
infancy. One common idea is to detect anomalies in data streams           various meta-data (e.g., a timestamp) as well as information about its
by using a learning approach where a statistical reference model is
                                                                          2 https://www.eclipse.org/vorto/
1 http://www.gartner.com/newsroom/id/3598917                              3 https://beam.apache.org/
     MDE4IoT’18, October 2018, Copenhagen, Denmark                                               Simon Pizonka, Timo Kehrer, and Matthias Weidlich

 1   @ProcessElement                                                            1   if(rotations < 0) {
 2   public void processElement(ProcessContext c) {                             2       throw new
 3     String entry = "";                                                       3           MinConstraintViolation("Rotations < 0");
 4     try {                                                                    4   }
 5       entry = c.element();                                                   5   else if(rotations > 12000) {
 6       String [] elms = entry.split(";");                                     6       throw new
 7       // parse values                                                        7           MaxConstraintViolation("Rotations > 12000");
 8       int rotations = Integer.parseInt(elms[0]);                             8   }
 9       long runTime = Long.parseLong(elms[1]);
                                                                                    Listing 2: Implementation of additional check routines vali-
10       Date dateTime = dateTimeFormat.parse(elms[2]);
11       // new data structure                                                      dating the value range of the rotation property.
12       ObjectNode root = mapper.createObjectNode();
13       root.put("rotations", rotations);
14       root.put("runtime", runTime);
15       root.put("datetime", dateTimeFmt.format(dateTime));                        repetitive yet very schematic code needs to be produced in order
16       // output                                                                  to implement data validation facilities such as the rather simple
17       c.output(dataOK, mapper.writeValueAsString(root));                         checks used in our running example. Finally, the hand-crafted vali-
18                                                                                  dation routines are highly technology-specific and cannot be easily
19       } catch (Exception e) {
20         LOG.error("Processing failed for: " + entry, e);
                                                                                    transferred to other platforms and frameworks.
21         c.output(dataError, entry);
22       }                                                                          3     APPROACH AND PROTOTYPICAL
23   }
                                                                                          IMPLEMENTATION
     Listing 1: Apache Beam user-defined function implement-
                                                                                    In this section, we present our approach and prototypical imple-
     ing a message data conversion.
                                                                                    mentation to combining a domain model with a stream process-
                                                                                    ing system in order to validate data streams in IoT applications.
     current state (activity, rotation speed etc.). Messages are transmitted        Specifically, as illustrated in Section 3.1, we use the Vorto DSL to
     in a device-specific message format. In our example, in a comma-               declaratively describe the capabilities of IoT devices, which includes
     separated string encoding in which single values are separated by              the platform-independent specification of message structures and
     a semicolon. Each value represents a dedicated part of the message,            further data integrity constraints such as the measurement range
     depending on its position. A recurring problem is to convert such              of a sensor device. Such a model can be used in a stream process-
     a native message format into some other data structure, e.g., a                ing system to validate incoming data streams. To easily adapt to
     structured JSON object with is better suited for further processing            multiple stream processing systems, we use Apache Beam as an ab-
     in the cloud. Here, we use Apache Beam as a software abstraction               straction layer over several standard stream processing engines for
     layer over a concrete stream processing engine. Our exemplary                  that purpose. An overview of our integration with Vorto, referred
     data transformation of incoming messages may be plugged-in into                to as VortoFlow, is presented in Section 3.2.
     Apache Beam by providing a so-called user-defined function, a Java
     implementation of which is shown in Listing 1. As we can see in                3.1    Device Information Modeling in Vorto
     lines 8 to 10, dedicated data values may be accessed via their fixed           The Vorto project, which serves as a basis for our approach and
     position within the comma-separated message string, while the                  prototypical implementation, aims at achieving interoperability
     type-specific parsing of values is delegated to built-in Java functions.       among IoT device manufacturers, platform providers and applica-
     If parsing of an input value fails, the parsing exception is caught            tion developers through the generation of platform adapters (aka.
     and the message is marked as invalid (lines 19 to 21). Otherwise,              stubs) from domain models. Therefore, Vorto provides a high-level
     a simple JSON object representing the message is constructed in                domain-specific modeling language, the Vorto DSL, to describe
     lines 12 to 15.                                                                the functionality and characteristics of IoT devices in terms of so-
        In addition to the pure syntactic validation of the message string,         called Information Models. An information model contains one or
     we would now like to progress towards a more semantic data val-                multiple Function Blocks. These function blocks are structured into
     idation by incorporating domain knowledge. For example, from                   five Sections. The Configuration section defines read- and writable
     the data sheet of the kitchen blender, we know that it has has a               properties to configure a device, while the Status, Fault and Events
     maximum rotation speed of 12.000 rounds per minute, which means                sections define readable properties that define the device’s current
     that the value range for rotation speed is between 0 and 12.000.               status, fault states, and publishable messages, respectively. Proper-
     Listing 2 shows the code we need to add to validate the value range            ties are typed, and a type may be a primitive type or a complex type.
     of the rotation property. The code snippet is to be inserted after             The latter can contain further complex types, primitive types and
     parsing the input values and before creating the JSON object. This             enumerations. Finally, the Operations section defines operations
     additional code is required for each property to validate the value            that can be invoked on the device from, e.g., external applications.
     range. Typically this code is handwritten. Similar checks may be                  Listing 3 shows a function block describing the kitchen blender
     added for other properties of the blender.                                     of our running example. In the event section (lines 14 to 20), we
        As we can see, even for our small example, developers of IoT                declaratively describe the structure of a message called speed
     applications are typically confronted with multiple technical details          which is periodically published by the device. In contains the same
     such as message protocols, data formats, etc. Moreover, a lot of               properties (rotations, runTime and dateTime) as used in
     Domain Model-Based Data Stream Validation for IoT Applications                                  MDE4IoT’18, October 2018, Copenhagen, Denmark

                                                                                                               Model
 1   namespace de.hu_berlin.blender
 2   version 1.0.0                                                                                                uses to
                                                                              describes                           generate code
 3   displayname "Blender Function Block"                                     capabilities,                                                                    uses
 4   functionblock Blender {                                                  restrictions                                                                     to validate
                                                                              e.g. min, max             Vorto Generator                                        data
 5       configuration {
 6           mandatory firmwareVersion as int                                                                     generates
 7       }
                                                                                                 IoT Device                                         Cloud
 8       status {
 9           mandatory speed as int                                                           Platform        data
                                                                                       Sensors                                       IoT Platform      Data validation       ...
                                                                                                              Adapter
10           mandatory powerOn as boolean
11           optional iceCrushActive as boolean
12           optional smoothieActive as boolean
13       }                                                                                    Figure 1: Using VortoFlow in an IoT pipeline.
14       events {
15           speed {
                                                                                  Technically, the realization of VortoFlow is based on the follow-
16                mandatory rotations as int 
17                mandatory runTime as long                            ing design decisions. First, instead of generating data validation
18                optional dateTime as dateTime                               components from domain models, we choose an interpretative ap-
19           }                                                                proach in which a generic data validation component interprets
20       }                                                                    the domain model at run-time. This enables a flexible deployment
21       operations {
22           mandatory updateFirmware()
                                                                              process when the domain model changes. Second, this generic data
23       }                                                                    validation component is implemented as a Java library which can
24   }                                                                        be included in an Apache Beam project. The idea is that, besides
     Listing 3: Vorto information model describing the character-             model validation, further processing steps can be included in the
     istics of a kitchen blender.                                             final project. This is resource-efficient because the messages are al-
                                                                              ready loaded. Finally, the current processing function in VortoFlow
                                                                              is stateless, and thus can be included without much effort.
                                                                                  The implementation of the generic data validation component
                                                                              is rather straightforward. To date, VortoFlow supports syntactical
                                                                              conformance checking w.r.t. the message structure defined by the
                                                                              domain model, and to check value ranges constrained by lower and
                                                                              upper bounds as the one used in our running example. Furthermore,
                                                                              due to the stateless functioning of VortoFlow, only a single message
                                                                              is processed at the same time. Please note that, as positive side-
     our manual implementation in Section 2. However, note that we            effect of this simplicity, VortoFlow can be operated in stream and
     can now use the MIN and MAX constraints of the Vorto DSL to              batch mode. While the classical use case is to process a stream of
     define the value range of the rotations property.                        incoming real-time data and to give instant feedback, VortoFlow
                                                                              also supports the validation of existing data. This can be helpful
                                                                              for multiple reasons. First of all, data that already exists can be
                                                                              validated and a Vorto model can be created afterwards. Secondly, it
     3.2    Data Stream Validation through Model                              allows the user to re-evaluate data if the model has changed.
            Interpretation in Apache Beam
     Figure 1 illustrates how a Vorto model can be used in an IoT scenario.   4      EVALUATION
     Specific code generators, collectively referred to as Vorto Generator    We evaluate the applicability of our approach and prototypical
     in Figure 1, enable the generation of platform adapters supporting       implementation with respect to two research questions:
     communication and message exchange between components on                     • RQ.1 (Error Detection): Is it possible in principle to find
     different platforms. Receiving the measurements and readings from              errors in real-world IoT streaming data using our model-
     sensor devices, the platform adapter is capable of transforming                based validation approach?
     the incoming data to a format which the IoT platform can process.            • RQ.2 (Scalability): Does the validation by model interpre-
     In our prototypical implementation, device-specific messages are               tation scale up to realistic IoT applications, which process
     converted into a structured JSON object, which is passed to the IoT            streaming data of high volume and veracity?
     platform running in the cloud. The platform receives the incoming
     data stream and forwards it for data validation which, in our case,      4.1        Experimental Subject and Setup
     takes place in Apache Beam on some concrete stream processing              Intel Lab Data. In the Intel Berkeley Research Lab, 54 Mica2Dot
     engine. The validation itself is performed in a fully automated way      Mote4 boards equipped with weather boards were deployed and
     by the Data Validation component contributed by VortoFlow. The           operated from February 28 to April 5 2004, measuring the tem-
     validation rules which are to be executed on the data stream are         perature, humidity and light through environment sensors [13].
     obtained from the domain model. If the validation fails, the message     The collected dataset contains several obvious errors which makes
     is marked and equipped with details about the validation error.
                                                                              4 https://www.eol.ucar.edu/isf/facilities/isa/internal/CrossBow/DataSheets/mica2dot.pdf
     MDE4IoT’18, October 2018, Copenhagen, Denmark                                              Simon Pizonka, Timo Kehrer, and Matthias Weidlich

 1   functionblock MICA2DOTWeatherSensor {                                     at 26th March 2004 00:30:05, the humidity dropped below zero, the
 2     status {                                                                respective values are marked by the dotted line. These are val-
 3       // yyyy-mm-dd
 4       mandatory date as string
                                                                               ues which violate the MIN constraint of the humidity property
 5                                         defined by our domain model.
 6       // hh:mm:ss.xxxxxx                                                       On the other hand, some errors passed the validation undetected.
 7       mandatory time as string                                              For instance, when considering the graph in Figure 4 depicting
 8           
                                                                               temperature values recorded by one of the temperature sensors, it
 9       mandatory epoch as int 
10       mandatory moteid as int                                is obvious that there is something wrong with the data. However,
11       mandatory temperature as float                                        the exceptional increase in the temperature was not spotted as an
12                                          error by VortoFlow since all values are still in the valid range of
13       mandatory humidity as float                                           [−40, 123.8] as defined by the weather board model.
14           
15       mandatory light as float
                                                                                  Nonetheless, the first example shows that the general approach
16                                          works and that errors can be detected in principle, which lets us
17       mandatory voltage as float                                            formulate a positive answer for RQ.1.
18           
19       }                                                                        RQ.2 (Scalability). Table 1 lists the execution times of three inde-
20   }                                                                         pendent runs of the pipeline shown in Figure 2 for processing the
       Listing 4: Information model: Mica2Dot weather board.                   Intel dataset. For each step, the wall-clock time is given along with
                                                                               the average over all runs. The wall time represents the approximate
                                                                               time taken from initialization to termination. There are multiple
     it an ideal experimental subject for our study. To validate the In-       reasons why the results are varying from run to run. The read and
     tel dataset with VortoFlow, we developed a domain model of the            write tasks require the system to access the network. Here, the
     weather board using the Vorto DSL, and a test program processing
     the dataset in Apache Beam.                                                                            humidity - moteid 1
        Domain Model. The domain model of the weather board is shown                50                                                         ok
     in Listing 4, its properties are described in the Function Block’s sta-
                                                                                                                                               error
     tus section. Here, we used domain knowledge such as the provided               40
     sensor data sheets [15] to derive the respective boundaries. For ex-
     ample, the temperature and humidity sensor have a measurement                  30
     range from -40°C to 123.8°C and 0% to 100%, respectively.
                                                                               %




                                                                                    20
        Test Program. The test program comprises the processing pipeline
     shown in Figure 2. First, the Intel Lab dataset is loaded as a ZIP file        10
     from a Google Cloud Storage Bucket5 . The file is extracted into a
     CSV file being processed line by line, each line represents a message          0
     which is to be validated. Therefore, each line of the CSV input is
     transformed to a JSON object which is compatible with our Vorto                        2004-03-02 2004-03-09 2004-03-16 2004-03-23 2004-03-30
     domain model of the weather board. The JSON object is passed to                                               Time
     the generic validation function of VortoFlow and validated w.r.t.
                                                                                    Figure 3: Errors in humidity readings (Mote with id 1).
     the constraints defined by the domain model. All messages which
     contain an error are written to a text file on a Google Cloud Storage
     Bucket. The experiments are run on Google Cloud Dataflow6 and
                                                                                                           temperature - moteid 1
     using the latest version of the Apache Beam SDK (2.4.0) for Java.              120

                                                                                    100
            Load                Transform             Validate     Write


                                                                                     80
     Figure 2: Apache Beam pipeline for processing the Intel
                                                                               °C




     dataset used as experimental subject.                                           60

                                                                                     40
     4.2     Results
        RQ.1 (Error Detection). On the one hand, when validating the                 20                                                          ok
     dataset, multiple violations of the humidity constraints were de-
                                                                                             2004-03-02 2004-03-09 2004-03-16 2004-03-23 2004-03-30
     tected. Figure 3 shows an example of such a violation. Here, starting                                          Time
     5 https://cloud.google.com/storage/docs/json_api/v1/buckets
     6 https://cloud.google.com/dataflow/                                                 Figure 4: Temperature readings (Mote with id 1).
Domain Model-Based Data Stream Validation for IoT Applications                            MDE4IoT’18, October 2018, Copenhagen, Denmark


              Read     Transform            Validate     Write          Section 5.2 gives an overview of the state-of-the-art in the field of
                                                                        data stream validation.
    Run 1    19 sec.        26 sec.    1 min. 31 sec.    12 sec.
    Run 2    17 sec.        24 sec.    1 min. 14 sec.    10 sec.
    Run 3    14 sec.        26 sec.    1 min. 22 sec.    10 sec.
                                                                        5.1     Model-Driven Engineering for the IoT
                                                                        Both industry and academia have recognized the need for research
    Avg.:   ~17 sec.       ~25 sec.   ~1 min. 22 sec.   ~11 sec.        on a consolidated set of best practices that will guide developers
Table 1: Execution times of three independent runs of pro-              through the manifold challenges of software engineering for the
cessing the Intel Lab dataset with VortoFlow running on                 IoT [11]. Model-driven Engineering has been mentioned as one of
Google Cloud Dataflow.                                                  the key paradigms that bear the potential to tackle these challenges.
                                                                           One of the predominant challenges addressed by adopting MDE
                                                                        principles are distribution and heterogeneity in the IoT. An example
                                                                        for this is the ThingML (Internet of Things Modeling Language)
available bandwidth may vary. Furthermore, the processing of the        approach [8, 14]. It supports the modeling of IoT applications from
data requires memory and CPU time, which may be affected by the         different viewpoints (from the architectural level to the behavior
fact that the hardware is potentially shared with other users.          of individual devices) through a modeling language which com-
   Although VortoFlow is not optimized for performance at the           bines well-established visual modeling constructs (such as state
moment, the experiment with the Intel dataset shows that the vali-      charts and component diagrams) and an imperative yet platform-
dation can be done in a reasonable time. As expected, the validation    independent action language. The generation of platform-specific
step needs most of the time with around 1 min 22 sec, about three       code and adapters is supported through a set of readily available yet
times as long as reading and writing the dataset, which we consider     customizable code generators for popular programming languages
to be acceptable. The dataset contains 2,313,682 elements which         and open IoT platforms (e.g., Arduino, Raspberry Pi, Intel Edison).
means, per run, around 28,216 messages were validated per second.       As mentioned, the Vorto project follows a similar motivation and
Thus, RQ.2 can be answered positively as well.                          goal. The Vorto DSL has been used, e.g., to specify manufacturer-
                                                                        independent abstraction layers describing the functions and proper-
4.3    Discussion                                                       ties of vehicles on different levels of granularity [12, 19]. We selected
Using the Vorto DSL, it was possible to create a simple yet con-        Vorto as a technological basis for our work since it is actively de-
cise domain model for the considered domain of our experimental         veloped, maintained and continuously evolved (cf. commit logs on
subject. This model, in turn, could be used in VortoFlow to detect      GitHub7 . Moreover, Vorto is supported as an integral part of the
elements that violate the constraints defined by the domain model.      Bosch IoT Suite8 and based on the widely used Eclipse Modeling9
Using a model-driven approach saved us from writing plenty of           technology stack.
repetitive code compared to a manual implementation of the same            Besides heterogeneity and distribution, other values supported
data validation facilities.                                             by MDE principles such as separation of concerns for collaborative
   However, as indicated by the second example, checking the range      development, automation for enabling self-adaptation at run-time,
of values can be only seen as a first indication for errors. Not very   or reusability of development artifacts have been addressed, e.g.,
surprisingly, not all the errors comprised by the Intel Lab dataset     in [4]. More recently, the same group of authors has put a specific
could be detected using VortoFlow. Therefore, the expressiveness of     focus on the engineering of mission-critical IoT systems [3]. These
the Vorto DSL needs to be extended by further kinds of constraints      systems expose further challenges w.r.t. dependability requirements
which then need to be checked by the generic validation compo-          such as reliability, safety and security which may be tackled by
nent. A starting point for inspiration are classical data description   exploiting models for the sake of verification.
languages. JSON-Schema, for instance, has many more features to            A domain-specific MDE framework that targets IoT-based man-
validate a JSON document compared to the Vorto DSL [20]. More-          ufacturing systems in an Industry 4.0 context has been presented
over, to address the detection of data errors, outliers and anomalies   in [17]. Following other approaches to MDE in this domain (see, e.g.,
over time, like the exceptional increase of the temperature value       the research roadmap presented in [18]), the methodology exploits
shown in Figure 4, the current stateless processing of single mes-      the UML profiling mechanism [9] to tailor a set of popular UML
sages is no longer appropriate.                                         diagrams towards the specific needs of manufacturing engineers.
   From a technical point of view, there is much room for improve-         However, none of the existing approaches to leveraging MDE
ment w.r.t. optimizing the performance of our prototypical imple-       for the development of IoT applications exploits domain models for
mentation. The internal structure is not optimized for a quick ac-      the automated derivation of data stream validation facilities.
cess of all property values. For example, each time a validation of a
REGEX constraint is executed, the regular expression is recompiled.     5.2     Data Stream Validation
A better approach would be to cache the compiled expressions.           Aiming at scalability of stream validation, it has been suggested
                                                                        to rely on concepts of data stream processing [5]. In that case,
5     RELATED WORK                                                      languages for data stream processing enable the formalization of
In this section, we review related work from two different per-         7 https://github.com/eclipse/vorto
spectives. First, in Section 5.1, we will have a look at approaches     8 https://www.bosch-iot-suite.com
leveraging MDE for the development of IoT applications, before          9 https://www.eclipse.org/modeling
MDE4IoT’18, October 2018, Copenhagen, Denmark                                               Simon Pizonka, Timo Kehrer, and Matthias Weidlich


validation requirements using a well-defined set of streaming oper-        a smart meter). Third, constraint languages commonly adopted in
ators, including stateless ones such as filters and transformations,       MDE, such as OCL, provide another angle to increase expressive-
as well as stateful operators, e.g., to detect sequential patterns. Data   ness of information models w.r.t. to validity requirements.
stream management systems then enable the distributed execution
of these operators in a compute cluster [6].                               REFERENCES
   The application of these concepts has been illustrated in SVALI          [1] Fabrizio Angiulli and Fabio Fassetti. 2010. Distance-based outlier queries in data
                                                                                streams: the novel task and algorithms. Data Min. Knowl. Discov. 20, 2 (2010),
(Stream VALIdator) [21], a system that supports two data stream                 290–324.
validation modes: In a model-and-validate mode, users directly for-         [2] Marco Brambilla, Jordi Cabot, and Manuel Wimmer. 2012. Model-driven software
malize validation requirements as a function over streaming data,               engineering in practice. Synthesis Lectures on Software Engineering 1, 1 (2012),
                                                                                1–182.
which is then continuously evaluated. In a learn-and-validate mode,         [3] Federico Ciccozzi, Ivica Crnkovic, Davide Di Ruscio, Ivano Malavolta, Patrizio
a statistical reference model is learned from samples of normal                 Pelliccione, and Romina Spalazzese. 2017. Model-driven engineering for mission-
behavior, which is then used as basis for validation. Either way,               critical iot systems. IEEE Software 34, 1 (2017), 46–53.
                                                                            [4] Federico Ciccozzi and Romina Spalazzese. 2016. MDE4IoT: supporting the in-
validation requirements are defined on the technical level, not con-            ternet of things with model-driven engineering. In International Symposium on
nected to conceptual models of the application domain.                          Intelligent and Distributed Computing. Springer, 67–76.
                                                                            [5] Gianpaolo Cugola and Alessandro Margara. 2012. Processing flows of information:
   In a broader context, a plethora of techniques for the detection of          From data stream to complex event processing. ACM Comput. Surv. 44, 3 (2012),
anomalies in data streams has been presented in recent years. They              15:1–15:62.
have in common that they assess the characteristics of a stream             [6] Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi. 2016. Data Stream
                                                                                Management: Processing High-Speed Data Streams. Springer.
to detect data that deviate significantly from expected values and,         [7] Dimitrios Georgiadis, Maria Kontaki, Anastasios Gounaris, Apostolos N. Pa-
hence, can be thought of as a continuous variant of traditional                 padopoulos, Kostas Tsichlas, and Yannis Manolopoulos. 2013. Continuous outlier
outlier detection. Common techniques for anomaly detection in                   detection in data streams: an extensible framework and state-of-the-art algo-
                                                                                rithms. In Proceedings of the Intl. Conference on Management of Data. 1061–1064.
data streams are distance-based [1, 7, 10]. Here, a stream element          [8] Nicolas Harrand, Franck Fleurey, Brice Morin, and Knut Eilif Husa. 2016. Thingml:
is considered abnormal, if it is far from a pre-defined number of               a language and code generation framework for heterogeneous targets. In Proceed-
                                                                                ings of the ACM/IEEE 19th International Conference on Model Driven Engineering
neighboring streaming elements according to some distance func-                 Languages and Systems. ACM, 125–135.
tion. Moreover, anomaly detection may also exploit the ideas of             [9] Timo Kehrer, Michaela Rindt, Pit Pietsch, and Udo Kelter. 2013. Generating
density-based clustering to flag abnormal stream elements [16] or               Edit Operations for Profiled UML Models. In ME@MoDELS (CEUR Workshop
                                                                                Proceedings), Vol. 1090. CEUR-WS.org, 30–39.
be based on the angles of data elements in a high-dimensional value        [10] Maria Kontaki, Anastasios Gounaris, Apostolos N. Papadopoulos, Kostas Tsich-
space [22]. However, all such techniques characterize anomalies by              las, and Yannis Manolopoulos. 2011. Continuous monitoring of distance-based
means of a mathematical model over streaming data and are, there-               outliers over data streams. In Proceedings of the 27th International Conference on
                                                                                Data Engineering. 135–146.
fore, completely disconnected from domain models that describe             [11] Xabier Larrucea, Annie Combelles, John Favaro, and Kunal Taneja. 2017. Software
data sources and the context of a specific IoT application.                     engineering for the internet of things. IEEE Software 34, 1 (2017), 24–28.
                                                                           [12] Jeroen Laverman, Dennis Grewe, Olaf Weinmann, Marco Wagner, and Sebastian
                                                                                Schildt. 2016. Integrating Vehicular Data into Smart Home IoT Systems Using
                                                                                Eclipse Vorto. In IEEE 84th Vehicular Technology Conference. 1–5.
6    CONCLUSION                                                            [13] Samuel Madden et al. 2004. Intel Lab Data. http://db.csail.mit.edu/labdata/
In this paper, we demonstrated how MDE principles can be em-                    labdata.html
                                                                           [14] Brice Morin, Nicolas Harrand, and Franck Fleurey. 2017. Model-based software
ployed in the development of IoT applications. Specifically, we                 engineering to tame the iot jungle. IEEE Software 34, 1 (2017), 30–36.
focused on the question of how to validate data streams emitted            [15] Sensirion Inc. 2011. Datasheet SHT1x (SHT10, SHT11, SHT15) Humidity
by IoT sources through a model-driven approach. We proposed                     and Temperature Sensor IC.            https://www.sensirion.com/fileadmin/user_
                                                                                upload/customers/sensirion/Dokumente/0_Datasheets/Humidity/Sensirion_
VortoFlow, which builds upon the Vorto DSL for the specification                Humidity_Sensors_SHT1x_Datasheet.pdf
of IoT devices. It enables users to capture validity requirements in       [16] Sharmila Subramaniam, Themis Palpanas, Dimitris Papadopoulos, Vana Kaloger-
                                                                                aki, and Dimitrios Gunopulos. 2006. Online Outlier Detection in Sensor Data
terms of value ranges as part of an information model. This mod-                Using Non-Parametric Models. In Proceedings of the 32nd International Conference
els then serves as the basis for online validation of data streams:             on Very Large Data Bases. 187–198.
A generic data validation component, prototypically realized in            [17] Kleanthis Thramboulidis and Foivos Christoulakis. 2016. UML4IoT: A UML-based
                                                                                approach to exploit IoT in cyber-physical manufacturing systems. Computers in
Apache Beam, interprets the model at run-time and flags invalid                 Industry 82 (2016), 259–272.
data accordingly. We demonstrated the general feasibility and ap-          [18] Birgit Vogel-Heuser, Stefan Feldmann, Jens Folmer, Jan Ladiges, Alexander Fay,
plicability of VortoFlow using the case of a weather board.                     Sascha Lity, Matthias Tichy, Matthias Kowal, Ina Schaefer, Christopher Haubeck,
                                                                                et al. 2015. Selected challenges of software evolution for automated production
    In order to fully exploit the potential of model-driven validation          systems. In 13th IEEE International Conference on Industrial Informatics (INDIN).
of data streams, we intend to extend VortoFlow to support the spec-             IEEE, 314–321.
                                                                           [19] Marco Wagner, Jeroen Laverman, Dennis Grewe, and Sebastian Schildt. 2016.
ification of more expressive validity requirements, along several               Introducing a harmonized and generic cross-platform interface between a Vehicle
dimensions. First, the temporal context of data stream elements may             and the Cloud. In 17th IEEE International Symposium on A World of Wireless,
be worth to consider, e.g., by validating a sliding average of data             Mobile and Multimedia Networks. 1–6.
                                                                           [20] Austin Wright, Henry Andrews, and Geraint Luff. 2018. JSON Schema Validation:
stream values over a 1 minute window. Second, information models                A Vocabulary for Structural Validation of JSON. Working Draft. IETF Secretariat.
are specified per device, whereas the Vorto DSL currently does                  https://tools.ietf.org/html/draft-handrews-json-schema-validation-01
not support the specification of relations between the models of           [21] Cheng Xu, Daniel Wedlund, Martin Helgoson, and Tore Risch. 2013. Model-based
                                                                                validation of streaming data: (industry article). In The 7th ACM International
different devices. Enabling the definition of such relations, however,          Conference on Distributed Event-Based Systems. 107–114.
would be useful to capture validity requirements in terms of causal        [22] Hao Ye, Hiroyuki Kitagawa, and Jun Xiao. 2015. Continuous Angle-based Out-
                                                                                lier Detection on High-dimensional Data Streams. In Proceedings of the 19th
relations of data produced by different devices (e.g., activation of            International Database Engineering & Applications Symposium. 162–167.
an electric device should be correlated with load measurements at