Tracking sensitive and untrustworthy data in IoT *

Tracking sensitive and untrustworthy data in IoT * ChiaraBodei chiara@di.unipi.it LetterioGalletta galletta@di.unipi.it Dipartimento di Informatica Università di Pisa Partially supported Università di Pisa Tracking sensitive and untrustworthy data in IoT * 41A2DFF85C419A389A115A44B3FBDDEA GROBID - A machine learning software for extracting information from scholarly documents

The Internet of Things (IoT) produces and processes large amounts of data. Among these data, some must be protected and others must be carefully handled because they come from untrusted sources. Taint analysis techniques can be used to for marking data and for monitoring their propagation at run time, so to determine how they influence the rest of the computation. Starting from the specification language IoT-LySa, we propose a Control Flow Analysis for statically predicting how tainted data spread across an IoT system and for checking whether those computations considered security critical are not affected by tainted data.

Introduction

We are living the Internet of Things (IoT) revolution, where the things we use every day have computational capabilities and are always connected to the Internet. These devices are equipped with sensors, produce huge amounts of data, store them on the cloud or use them to affect the surrounding environment through actuators.

The IoT paradigm introduces new pressing security challenges. On the one hand, a large part of the collected data is sensitive and must be protected against attackers. However, these data need to be shared and processed to be useful. For instance, a fitness app should store and manipulate data about the heart rate or the expended calories to be helpful for users. On the other hand, an attacker can easily intercept communications or manipulate sensors to alter data (tampering). Thus, an IoT system must be resistant against "bad" data that may come from malicious sources or that may be the result of tampering with "good" sensors. For instance, deciding to stop an industrial plant in dependence of data coming from tampered sensors can have severe consequences.

Usually, formal methods provide designers with tools to support the development of systems and to reason about their properties. We follow this line of research by presenting preliminary results about using static analysis to study simple security properties of IoT systems, i.e. networks of nodes, where each node interacts with the environment through sensors and actuators.

Technically, our starting point is the formal specification language IoT-LySa, a process calculus recently proposed for IoT systems [4,3]. IoT-LySa may help designers to adopt a Security by Design development model. Indeed, designers can model the structure of the system and how its components (smart objects) interact with each other through the IoT-LySa primitives. Furthermore, they can reason about the system correctness and robustness by using the Control Flow Analysis (CFA) of IoT-LySa. This analysis safely approximates the system behaviour, by statically predicting how data from sensors may spread across the system and how objects may interact with each other. Technically, it "mimics" the evolution of the system, by using abstract values in place of concrete ones and by modelling the consequences of each possible action. Designers can detect possible security flaws through this "abstract simulation" and intervene as early as possible during the design phase.

Here, we propose a variant of this CFA for static taint analysis. Taint analysis predicts how information flows from specific data sources to the computations that use these data [11,10]. The basic idea is to classify data sources in tainted and untainted and to mark as tainted or untainted the derived data, through suitable propagation rules.

In order to detect possible confidentiality or privacy leaks, we mark as tainted data coming from sensitive sensors and check whether they are protected when in transit. Otherwise, we mark as tamperable data coming from sensors or memory locations that can be tampered and check which computations they affect, thus addressing integrity issues. Furthermore, we require designers (1) to identify some parts of the code as critical (critical points), e.g. a conditional statement whose result may trigger a risky actuation; and (2) to provide a set of functions that untaint data. In the case of sensitive data these functions remove privacy-critical information, e.g. by blurring people faces in surveillance videos. In the case of tamperable data these functions perform sanitisation. By exploiting the result of our analysis, we check whether (i) a variable is untrustworthy, because affected by possibly tampered data; (ii) sensitive data may be leaked; and (iii) computations in critical points depend on tamperable data. As a consequence, analysis results may help designers in making educated decisions, on the exposure of their data, as well as on their trustfulness and robustness. As far as the last point is concerned, they can indeed statically evaluate the impact of a certain amount of tamperable data, e.g. by answering to question like: how many sensors can be compromised without dramatically impacting on the overall behaviour of the system? Note that tamperable data may also produce, besides security issues, quality ones, such as unexpected behaviours.

Since the CFA computes an over-approximation of the behaviour of a system, if the analysis does not predict any dependence or any leak, we can be sure that at run time it will never occur. If instead the CFA does, there is only the possibility of the violation. Nevertheless, it can be worthwhile to track where the tainted values may come from and decide the points that deserve to be monitored at run time in order to prevent violations.

We assume that the processes running inside a node are protected against tampering. This can be obtained by resorting to standard mechanisms for integrity. However, we assume that an attacker can tamper with every sensor or memory location considered tamperable. When we focus on confidentiality, we assume instead that communication is not protected against eavesdropping, but that the attacker does not intervene in communications. Dealing with an attacker able to inject forged messages can be managed by following [2]. Finally, we assume to use perfect cryptography and to have keys, fixed once and for all, exchanged in a secure manner at deployment time, as it is often the case, e.g. ZigBee [12].

The paper is organised as follows. In Section 2 we introduce a motivating example, which is also instrumental in giving an overall picture of our methodology. In Section 3 we briefly recall the process calculus IoT-LySa as introduced in [4,3]. Then, we define the CFA for static taint analysis in Section 4, and we also show how to check the security properties described above. The Appendix includes part of the technical details of our formalism.

A storehouse with perishable food

In this section we illustrate our methodology through a scenario similar to the one of [5]. Consider an IoT system for keeping the temperature under control inside a storehouse storing perishable food (see Figure 1). The temperature must be regulated depending on the quantity and the kind of the food. The system also monitors how long the food is kept inside the Figure 1: The organisation of nodes in the IoT system of a storehouse with perishable food. storehouse to avoid passing the expiry-date. In the storehouse there are four nodes: N 1 , N 2 , N 3 and N 4 . The specification of the node N 1 in IoT-LySa is as follows:

N 1 = 1 ∶ [Σ 1 ∥ P c ∥ (S 0 ∥ S 1 ∥ S 2 ∥ S 3 )]

where

S i = µh.i ∶= v.τ.h i ∈ [0, 3] P c = µh.z 0 ∶= 0.z 1 ∶= 1.z 2 ∶= 2.z 3 ∶= 3.⟨⟨avg(z 0 , z 1 , z 2 , z 3 )⟩⟩ ▷ { 3 }. th 1 − th 3 ≤ avg(z 0 , z 1 , z 2 , z 3 ) ≤ th 2 + th 3 ? th 1 ≤ avg(z 0 , z 1 , z 2 , z 3 ) ≤ th 2 ? h ∶ ⟨j, start⟩. h ∶ ⟨⟨alarm⟩⟩ ▷ { 3 }.h

The node N 1 , uniquely identified in the system by the label 1 , is made of P c , a software control process that specifies the logic of the node, and four temperature sensors S i , one for each corner of the storehouse (a rectangular room). Each sensor S i periodically senses the environment and sends the temperature through a wireless communication to P c . In IoT-LySa each sensor inside a node is identified by an index i and communicates the value v read from the environment by storing it in the corresponding reserved location i of the shared store Σ 1 . In IoT-LySa intranode communications are abstractly modelled through operations over the shared store. The action τ denotes internal actions, which we are not interested in modelling, e.g. noise reduction of sensed data; the construct µh. implements the iterative behaviour of the sensor. The control process P c reads temperatures from sensors S i and stores them to local variables z i through the assignments z i ∶= i (with i identifier of the sensor S i ); then, it computes the average through the function avg and sends it to the node N 3 with label 3 through the primitive ⟨⟨ ⟩⟩ ▷ . Then, P c checks if this average is within the range [th 1 , th 2 ] (th 1 and th 2 are the threshold variables and the primitive ? ∶ is a conditional). If this is not the case and the temperature is out of range of a value lower than th 3 (tolerance threshold), the actuator j is started through the message ⟨j, start⟩ to accordingly turn on/off the cooling system. If the difference is too high (greater than th 3 ), the control unit sends instead an alarm to the node N 3 . The node N 2 with label 2 does the stocktaking and sends it to the node N 3 . We assume that each wood box containing food is equipped with a RFID read by the reader R 0 (0 is the index of the sensor) of N 2 when the box enters the storehouse. The specification of N 2 is:

N 2 = 2 ∶ [Σ 2 ∥ µh.x ∶= 0.db ∶= update(db, x).τ.h ∥ µh.⟨⟨db⟩⟩ ▷ { 3 }.h ∥ R 0 ]

The node is made of two processes. The first reads a value from R 0 (defined as the sensors S i above) and updates the stocktaking db that the second one periodically sends to N 3 .

The node N 3 with label 3 works as the system controller and as gateway towards the Cloud (i.e. the node N 4 with label 4 ). It receives the stocktaking and the temperature, and checks whether the temperature is acceptable for the quantity and the kind of the stored food. If this is not the case it drives N 1 to take the proper actions and raises an alarm through the Cloud. The specification of the process P k of N 3 performing these checks is as follows:

P k = µh.temp / ∈ validRange(db) ? ⟨⟨alarm⟩⟩ ▷ { 4 }.⟨⟨validRange(db)⟩⟩ ▷ { 1 }.h ∶ h

where validRange, given the current content of the storehouse, returns a range of admissible temperatures to be maintained inside the room.

Since an attacker may manipulate data from sensors and also the sensors themselves, we consider sensors as tamperable data sources, and we tag with the taint label for tamperable data the locations of Σ 1 reserved for the sensors S i of N 1 . Similarly, we tag the location of Σ 2 reserved for R 0 in N 2 . As discussed above, N 3 controls the system based on data received by others nodes. Thus, we mark the code checking if the temperature is right (the guard of the conditional of P k ) as a critical point.

However, our CFA can detect that our system is not robust against data manipulations. Indeed, an attacker by manipulating the sensors S i and R 0 can interfere or drive the decision of N 3 and may damage the system, because the computations in the critical point directly depend on these data. This is because the nodes N 1 and N 2 simply forward their data to N 3 without performing any sanitisation. In particular, the analysis propagates the tag and detects that local variables z i inside N 1 are also tamperable, and so it is the computed average that is sent to N 3 . The same holds for the stocktaking computed by N 1 .

To solve this problem we need to modify our design to perform data sanitisation. The CFA can help us during this process because we can try different approaches and test them through the analysis (what if reasoning). For instance, assume that the attacker can tamper with only one sensor, e.g. S 0 . A possible approach to deal with this fact is to modify the behaviour of N 1 by observing that sensors on the same side of the room should perceive the same temperature with a difference that can be at most a given value . The same happens when we consider the difference between two consecutive not manipulated samples: this difference is at most a given value δ. In this way we may compare data and discover possible manipulations and discard possibly tampered data. These checks can be implemented in N 1 specification through a function adjust that returns the temperature read from a sensor, adjusted taking into account the previous samples and temperature of the adjacent sensor. In the specification of P c the assignment to the variable z 0 would become z 0 = adjust(0, 3, s 0 ) where s 0 contains the previous samples of the sensor S 0 . The same happens for the other variables. Thus, the function adjust sanitises the input and does not propagate taint information, i.e. it never returns a value with tag . By using the CFA we can be sure that N 1 sends no tainted data to the node N 3 .

The calculus IoT-LySa

Here, we briefly review the process calculus IoT-LySa [4,3]. Differently from other process calculus approches to IoT, e.g. [8,9], the focus of IoT-LySa is on the development of a design framework that includes a static semantics to support verification techniques and tools for certifying properties of IoT applications.

Systems in IoT-LySa have a two-level structure and consist of a finite number of nodes (things), each of which hosts a store for internal communication and a finite number of control processes (representing the software), sensors and actuators. We assume that each sensor (actuator) in a node with label is uniquely identified by an index i ∈ I (j ∈ J , resp). Data are represented by terms. Labels a, a ′ , a i , ..., ranged over by A, identify the occurrences of terms. They are used in the analysis and do not affect the dynamic semantics. Formally, the syntax of labelled expression and the one of unlabelled terms are as follows.

E ∋ E ∶∶= labelled terms M ∋ M, N ∶∶= terms M a labelled term v value (v ∈ V) with a ∈ A i sensor location (i ∈ I ) x {E 1 , ⋯, E r } k 0 encryption with key k 0 ∈ K f (E 1 , ⋯, E r ) function on data

We assume as given a finite set K of secret keys owned by nodes, previously exchanged in a secure way, as it is often the case [12]. The encryption function {E 1 , ⋯, E r } k 0 returns the result of encrypting values E i for i ∈ [1, r] under the shared key k 0 . The term f (E 1 , ⋯, E r ) is the application of function f to r arguments; we assume given a set of primitive functions, typically for aggregating or comparing values. We assume the sets V, I , J , K be pairwise disjoint.

The syntax of systems of nodes and of its components is as follows.

N ∋ N ∶∶= systems of nodes B ∋ B ∶∶= node components 0 empty system Σ node store ∶ [B] single node ( ∈ L) P process N 1 | N 2 par. composition S sensor (label i ∈ I ) A actuator (label j ∈ J ) B ∥ B par. composition A node ∶ [B]

is uniquely identified by a label ∈ L that may represent further characterising information (e.g. node location). Sets of nodes are described through the (associative and commutative) operator | for parallel composition. The 0 denotes a system with no nodes. Inside a node ∶ [B] there is a finite set of components described by the parallel operator ∥.

We impose that there is a single store Σ ∶ X ∪ I → V, where X , V are the sets of variables and of values (integers, booleans, ...), resp. The store is essentially an array whose indexes are variables and sensors identifiers i ∈ I (no need of α-conversions). We assume that store accesses are atomic, e.g. through CAS instructions [7]. The other node components are control processes P , and sensors S (less than #(I )), and actuators A (less than #(J )) the actions of which are in Act. The syntax of processes is as follows. The prefix ⟨⟨E 1 , ⋯, E r ⟩⟩ ▷ L implements a simple form of multi-party communication: the tuple obtained by evaluating E 1 , . . . , E r is asynchronously sent to the nodes with labels in L that are "compatible" (according, among other attributes, to a proximity-based notion). The input prefix (E 1 ,⋯, E j ; x j+1 ,⋯, x r ) receives a r-tuple, provided that its first j elements match the corresponding input ones, and then assigns the variables (after ";") to the received values. Otherwise, the r-tuple is not accepted. A process repeats its behaviour, when defined through the tail iteration construct µh.P (h is the iteration variable). The process decrypt E as {E 1 , ⋯, E j ; x j+1 , ⋯, x r } k 0 in P tries to decrypt the result of the expression E with the shared key k 0 ∈ K. Also in this case we use the pattern matching. If the pattern matching succeeds, the process continues as P and the variables x j+1 , . . . , x r are suitably assigned. Sensors and actuators have the form: A sensor can perform an internal action τ or store the value v, gathered from the environment, into its store location i. An actuator can perform an internal action τ or execute one of its action γ, possibly received from its controlling process. Both sensors and actuators can iterate.

P ∶∶= 0 inactive process ⟨⟨E 1 , ⋯, E r ⟩⟩ ▷ L. P asynchronous multi-output L ⊆ L (E 1 , ⋯, E j ; x j+1 , ⋯, x r ). P input (with matching) decrypt E as {E 1 , ⋯, E j ; x j+1 , ⋯, x r } kS ∶∶=

For simplicity, here we neither provide an explicit operation to read data from the environment, nor to describe the impact of actuator actions on the environment. The semantics is based on a standard structural congruence and a two-level reduction relation → defined as the least relation on nodes and its components, where we assume the standard denotational interpretation [[E]] Σ for evaluating terms. The complete semantic is in Appendix. The reader not interested in the technical details can safely skip it without compromising the comprehension of the rest of the paper. As examples of semantic rules, we show the rules (Ev-out) and (Multi-com) that drive asynchronous multi-communications.

(Ev-out) ⋀ r i=1 v i = [[E i ]] Σ Σ ∥ ⟨⟨E 1 , ⋯, E r ⟩⟩ ▷ L. P ∥ B → Σ ∥ ⟨⟨v 1 , ⋯, v r ⟩⟩ ▷ L.0 ∥ P ∥ B (Multi-com) 2 ∈ L ∧ Comp( 1 , 2 ) ∧ ⋀ j i=1 v i = [[E i ]] Σ 2 1 ∶ [⟨⟨v 1 , ⋯, v r ⟩⟩ ▷ L. 0 ∥ B 1 ] | 2 ∶ [Σ 2 ∥ (E 1 , ⋯, E j ; x a j+1 j+1 , ⋯, x a r r ).Q ∥ B 2 ] → 1 ∶ [⟨⟨v 1 , ⋯, v r ⟩⟩ ▷ L \ { 2 }. 0 ∥ B 1 ] | 2 ∶ [Σ 2 {v j+1 /x j+1 , ⋯, v r /x r } ∥ Q ∥ B 2 ]

In the first rule, to send a message ⟨⟨v 1 , ..., v r ⟩⟩ obtained by the evaluation of ⟨⟨E 1 , ..., E r ⟩⟩, a node with label spawns a new process, running in parallel with the continuation P ; this new process offers the evaluated tuple to all the receivers with labels in L. In the second rule, the message coming from 1 is received by a node labelled 2 , provided that: (i) 2 belongs to the set L of possible receivers, (ii) the two nodes satisfy a compatibility predicate Comp (e.g. when they are in the same transmission range), and (iii) that the first j values match with the evaluations of the first j terms in the input. Moreover, the label 2 is removed by the set of receivers L of the tuple. The spawned process terminates when all the receivers have received the message (L = ∅).

Control flow analysis

Here we present a CFA for static taint analysis to track the propagation of sensitive data and data coming from possibly tampered sensors or variables. This CFA follows the same schema of the one in [4,3] for IoT-LySa. However, here we use different abstract values and propagation rules. Intuitively, abstract values "symbolically" represent runtime data so as to encode whether these data are tainted or not. Furthermore, we define functions over abstract values encoding how information about the taint is propagated to derived data. Finally, we show how to use the CFA results to check whether data from tamperable sources affect variables and computations considered security critical. We also check whether sensitive data are leaked.

Taint information To formally introduce our abstract values and operators through which we can combine and manipulate them, we first define the set of taint labels B (ranged over by b, b 1 , ...) whose elements (the colours in the pdf should help the intuition) are:

◇ untainted sensitive tamperable sensitive & tamperable

The idea is that these labels mark our abstract values with information about their sources. b that denote all the terms with a depth greater than a given threshold d. During the analysis, to cut these values, we will use the function ⌊−⌋ d , defined as expected (see Appendix). Note that once given the set of encryption functions occurring in a node N , the abstract values are finitely many.

We assume that designers provide the analysis with a classification of the data sources, i.e. the set of sensitive sensors S , and the set of the tamperable sources (sensors and variables) T , for each node with label .

⊗ ◇ ◇ ◇ ⊗ ◇ L L L ◇ ◇ L L L L L ′ L ′ L∩L ′ L∩L ′ L∩L ′ L ′ L ′ L∩L ′ L∩L ′ L∩L ′ L ′ L ′ L∩L ′ L∩L ′ L∩L ′

Table 1: The operator ⊗, version 1 (on the left) and version 2 (on the right) Definition 4.1 (Data classification). Given the set of sensitive sensors S , and the set of the tamperable sensors and variables T , the taint assignment function τ is defined as follows:

τ (y, ) = ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ if y ∈ S if y ∈ T if y ∈ S ∩ T ◇ o.w. where y ∈ I ∪ X τ (v, ) = ◇

The above function classifies only the data source; to propagate the taint information we resort to the taint combination operator ⊗ ∶ B × B → B defined in Table 1, on the left part. Note that ⊗ works as a join operator making our abstract values a lattice similar to the 4-point lattice LL,LH,HL,HH, used in the information flow literature, which combines confidentiality and integrity. This operator naturally extends to abstract values:

b ⊗ v = v b ′ ↓ 1 where v ∈ V and b ′ = b ⊗ v↓ 2 ; and to sets of abstract values: b ⊗ V = {b ⊗ v|v ∈ V ⊆ V}.

Now we need to specify how taint information propagate with aggregation functions. Below we define two simple propagation policies for function applications and encryptions. In the case of function application, the idea is that a single tainted argument turns to tainted the result of the application. Encryption protects its sensitive data, but preserves tamperable taint information. This means that encryption as a structure is marked with label ◇, unless one of its component is marked with or ; in these cases, encryption is labelled with .

Definition 4.2 (Taint propagation policies). Given the combination operator ⊗ ∶ B × B → B, the taint resulting by the application of

• a function f is F τ (f, v1 , . . . , vr ) = ⊗(v 1↓ 2 , . . . , vr↓ 2 ) • an encryption function is Enc τ (v 1 , . . . , vr ) = ◇ if ∀i.v i↓ 2 ∈ {◇, } o.w.

It is obviously possible to consider different policies for propagation, e.g. we could consider a set of functions that produce sensitive data independently from the taint information of its arguments. Moreover, we could deal with anonymisation functions that process their arguments to remove sensitive information. Consider e.g. blurring the faces of people in surveillance videos to protect their privacy. A policy for these functions is similar to the one for the encryption: the resulting taint label is ◇, if no argument is also tamperable, otherwise is .

Finally, we could extend the whole presented schema to keep also the source nodes of tainted information. To do that, we could introduce in our abstract values also the labels of nodes from which the taint information derives. To deal with this new information, the taint assignment function τ returns pairs (b, L), written b L (b when L = { }), our combinator operator becomes the one shown on the right part in Table 1, and the encryption operator takes as argument and uses it to annotate the resulting taint label.

CFA validation and correctness

We now have all the ingredients to define our CFA to approximate communications and data stored and exchanged and, in particular, the taint contents of messages and values. We specify our analysis in a logical form through a set of inference rules expressing the validity of the analysis results. The analysis result is a triple ( Σ, κ, Θ) (a pair ( Σ, Θ) when analysing a term), called estimate for N (for E), where Σ, κ, and Θ are the following abstract domains:

• the union Σ = ⋃ ∈L Σ of the super-sets Σ ∶ X ∪ I → 2 V of abstract values that may possibly be associated to a given location in I or a given variable in X ,

• a super-set κ ∶ L → L × ⋃ k i=1

Vi of the messages that may be received by the node , and

• a super-set Θ ∶ L → A → 2 V of

the taint information of the actual values computed by

each labelled term M a in a given node , at run time.

Once a proposed estimate is available, its correctness has to be validated. This requires that it satisfies the judgements defined on the syntax of nodes, node components and terms. They are defined by a set of clauses, fully presented in Appendix (see Tables 3 and 4). Here, we show some examples. The judgements for labelled terms have the form ( Σ, Θ) ⊧ M a . For each term M a occurring in the node , the corresponding judgement requires that Θ( )(a) includes all the abstract values v associated to M a . Consider, e.g. the following clause for the variable x a .

τ (x, ) ⊗ Σ (x) ⊆ Θ( )(a) ( Σ, Θ) ⊧ x a

In this case, an estimate is valid if Θ( )(a) includes the abstract values resulting from the combination (through the operator ⊗) of the taint information associated to the variable x (via τ (x, )), and the abstract values bound to x in Σ . This combination allows us to propagate the tamperable taint if x belongs to the set of tamperable variables.

The judgements for nodes have the form ( Σ, κ, Θ) ⊧ N . The rule for a single node ∶ [B] requires that its internal components B are analysed with judgements ( Σ, κ, Θ) ⊧ B. As examples of clauses, we consider the clauses for communication.

⋀ k i=1 ( Σ, Θ) ⊧ M a i i ∧ ( Σ, κ, Θ) ⊧ P ∧ ∀v 1 , ⋯, vr ∶ ⋀ r i=1 vi ∈ Θ( )(a i ) ⇒ ∀ ′ ∈ L ∶ ( , ⟨⟨v 1 , ⋯, vr ⟩⟩) ∈ κ( ′ ) ( Σ, κ, Θ) ⊧ ⟨⟨M a 1 1 , ⋯, M a r r ⟩⟩ ▷ L. P ⋀ j i=1 ( Σ, Θ) ⊧ M a i i ∧ ∀( ′ , ⟨⟨v 1 , ⋯, vr ⟩⟩) ∈ κ( ) ∶ Comp( ′ , ) ⇒ ⋀ r i=j+1 vi ⊗ τ (x i , ) ∈ Σ (x i ) ∧ ( Σ, κ, Θ) ⊧ P ( Σ, κ, Θ) ⊧ (M a 1 1 , ⋯, M a j j ; x a j+1 j+1 , ⋯, x a r r ). P

An estimate is valid for multi-output if it is valid for the continuation of P and the set of messages communicated by the node to each node ′ in L, includes all the messages obtained by the evaluation of the r-tuple ⟨⟨M

a 1 1 , ⋯, M a r

r ⟩⟩. More precisely, the rule (i) finds the sets Θ( )(a i ) for each term M a i i , and (ii) for all tuples of values (v 1 , ⋯, vr ) in Θ( )(a 1 ) × ⋯ × Θ( )(a r ) it checks if they belong to κ( ′ ) for each ′ ∈ L. Symmetrically, the rule for input requires that the values inside messages that can be sent to the node , passing the pattern matching, are included in the estimates of the variables x j+1 , ⋯, x r . More in detail, the rule analyses each term M a i i , and requires that for any message that it can receive, i.e. ( ′ , ⟨⟨v 1 , ⋯, vj , vj+1 , . . . , vr ⟩⟩) in κ( ) and Comp( ′ , ), vj+1 , . . . , vr are included and combined with the estimates of x j+1 , ⋯, x r .

Example 4.3. Consider again our example of the storehouse in Sect. 2. A valid estimate ( Σ, κ, Θ) must include the following entries, where τ (z 0 , ) = , ν

F τ (avg,ν ,ν ◇ ,ν ◇ ,ν ◇ ) = ν ,

and Avg, and Db are the labels of the terms avg(z 0 , z 1 , z 2 , z 3 ) and db.

Σ 1 (z 0 ) ⊇ {ν } Σ 2 (x) ⊇ {ν } Σ 2 (db) ⊇ {ν } Σ 3 (temp) ⊇ {ν } Θ( 1 )(Avg) ⊇ {ν } Θ( 2 )(Db) ⊇ {ν } κ( 3 ) ⊇ {( 1 , ν )} κ( 3 ) ⊇ {( 2 , ν )}

Indeed, an estimate must satisfy the checks of the CFA rules. For instance, the validation of P c requires, in particular, that ν is in Σ 1 (z 0 ) for the rule for variables, ( 1 , ν ) ∈ κ( 3 ) for the rule for output, and ν ∈ Θ( 1 )(Avg) for the rule of functions (see Appendix).

Our analysis respects the operational semantics of IoT-LySa. As usual, we prove a subject reduction result for our analysis. The proofs follow the usual schema and benefit from an instrumented denotational semantics for expressions, the values of which are pairs ⟨v, v⟩ where v is a concrete value and v is the corresponding abstract value. The formal definition can be found in Appendix. The store (Σ i with an undefined ⊥ value) is accordingly extended. Of course, we compute the second component of the pair by using the operators introduced in Definition 4.2, while, the semantics used in Table 2 uses the projection on the first component.

The following subject reduction theorem establishes the correctness of our CFA, by relying on the relation ⋈ that says when the concrete and the abstract stores agree. Its definition is immediate, since the analysis only considers the second component of the extended store, i.e. the abstract one: Σ

i ⋈ Σ iff w ∈ X ∪ I such that Σ i (w) ≠ ⊥ implies (Σ i (w)) ↓ 2 ∈ Σ (w). Theorem 4.4 (Subject reduction). If ( Σ, κ, Θ) ⊧ N and N → N ′ and ∀Σ i in N it is Σ i ⋈ Σ , then ( Σ, κ, Θ) ⊧ N ′ and ∀Σ i ′ in N ′ it is Σ i ′ ⋈ Σ .

Checking taint We now show that by inspecting the results of our CFA, we detect whether a variable receives a value coming from a tamperable source, and hence is not trustworthy. Then, we rephrase the confidentiality property of [3] in terms of propagation of sensitive contents, in order to prevent their leaks. Finally, we check when a computation, considered security critical by designers, depends on a tamperable source (a generalisation of the first property).

In the following, we denote with N ′ when the message ⟨⟨v 1 , . . . , v r ⟩⟩ is sent from the node 1 to the node 2 .

First we characterise when a variable x of a node is untrustworthy, i.e. when it stores a value propagated from a tamperable source, and therefore with taint label in { , }. To statically check this property we can simply inspect the abstract store Σ and the taint labels of the corresponding abstract values. Definition 4.5. Let N be a system of nodes with labels in L, and T = {T | ∈ L} be the set of tamperable sources. Then, the variable

x of a node ∈ L is untrustworthy w.r.t. T , if for all derivatives N ′ s.t. N → * N ′ it holds that Σ ′i (x)↓ 2 ∈ { , } where Σ ′i is the store of in N ′ .

Theorem 4.6. Let N be a system of nodes with labels in L, and let T = {T | ∈ L} the set of tamperable sources. Then a variable x of the node is untrustworthy w.r.t. T if ( Σ, κ, Θ) ⊧ N , and Σ (x) ↓ 2 ⊆ { , }.

We define the confidentiality property in terms of sensitive taint information. There are no leaks when messages do not expose values with taint label in { , }. We statically verify it, by inspecting the labels of the corresponding abstract values in the component κ. The last property characterises when computations considered security critical are not directly or indirectly affected by tainted data i.e. they are reached by untrustworthy data. We assume that the designer identify a set P ⊆ A of critical points in the application, i.e. points where possibly tainted data should flow, unless they are checked for validity. We statically verify this property, by inspecting the taint labels of the values in Θ for each critical point.

a i i ]] Σ i ) ↓ 2 ⊆ { , , }

for some i ∈ {1, . . . , r} and ∈ L. . By using the labelled version of abstract values and the operator ⊗, as defined on the right part in Table 1, we can determine where the possibly tampered data originally come from, i.e. which are the "bad sinks". In our example, we would obtain 1 and 2 , because Σ 3 (temp) ⊇ {ν 1 } and Σ 2 (db) ⊇ {ν 2 }. Similarly, we can trace back possible leakages.

Conclusions

Based on IoT-LySa, we developed a CFA for static taint analysis to track the propagation of sensitive data and data coming from possibly tampered sensors or variables, as illustrated by a motivating example that provides a simple but non-trivial application of our approach. Taint tracking is a relevant issue in security and can be used for checking both confidentiality and integrity of data (see [6,11,10] to cite only a few). The idea underlying our approach requires that the designer classify data sources as untainted, sensitive, tamperable. Static analysis has the advantage of giving hints on propagation of tainted data in a early phase of system design, thus guiding designers to understand the potential vulnerabilities and to direct their efforts towards the suitable modifications and validity checks. It is not meant to being a substitute of dynamic checks, but only as a supporting technique.

Here we relied on CFA approximations for checking various security properties about data propagation. In particular, we could detect if a variable is untrustworthy, because affected by possibly tampered data; if sensitive data are leaked; and if computations in critical point depend on tamperable data. Our CFA could also provide a prescriptive usage (along the lines of [1]), by embedding taints in data and forcing some data to be accepted only if they come with the expected taints.

In future we would like to extend our analysis to also deal with implicit flows of tainted data and to understand the relationships with the properties based on implicit flow (see [10]).

(S-store)

Σ ∥ i a ∶= v a ′ . S i ∥ B → Σ{v/i} ∥ S i ∥ B (Asgm) [[E]] Σ = v Σ ∥ x a ∶= E. P ∥ B → Σ{v/x} ∥ P ∥ B (Cond1) [[E]] Σ = true Σ ∥ E? P 1 ∶ P 2 ∥ B → Σ ∥ P 1 ∥ B (Cond2) [[E]] Σ = false Σ ∥ E? P 1 ∶ P 2 ∥ B → Σ ∥ P 2 ∥ B (A-com) γ ∈ Γ ⟨j, γ⟩. P ∥ (|j, Γ|). A ∥ B → P ∥ γ. A ∥ B (Act) γ.A → A (Int) τ. X → X (Decr) [[E]] Σ = {v 1 , ⋯, v r } k 0 ∧ ⋀ j i=1 v i = [[E ′ i ]] Σ Σ ∥ decrypt E as {E ′ 1 , ⋯, E ′ j ; x a j+1 j+1 , ⋯, x a r r } k 0 in P ∥ B → Σ{v j+1 /x j+1 , ⋯, v r /x r }∥ P ∥ B (Ev-out) ⋀ r i=1 v i = [[E i ]] Σ Σ ∥ ⟨⟨E 1 , ⋯, E r ⟩⟩ ▷ L. P ∥ B → Σ ∥ ⟨⟨v 1 , ⋯, v r ⟩⟩ ▷ L.0 ∥ P ∥ B (Multi-com) 2 ∈ L ∧ Comp( 1 , 2 ) ∧ ⋀ j i=1 v i = [[E i ]] Σ 2 1 ∶ [⟨⟨v 1 , ⋯, v r ⟩⟩ ▷ L. 0 ∥ B 1 ] | 2 ∶ [Σ 2 ∥ (E 1 , ⋯, E j ; x a j+1 j+1 , ⋯, x a r r ).Q ∥ B 2 ] → 1 ∶ [⟨⟨v 1 , ⋯, v r ⟩⟩ ▷ L \ { 2 }. 0 ∥ B 1 ] | 2 ∶ [Σ 2 {v j+1 /x j+1 , ⋯, v r /x r } ∥ Q ∥ B 2 ] (Node) B → B ′ ∶ [B] → ∶ [B ′ ] (ParN) N 1 → N ′ 1 N 1 |N 2 → N ′ 1 |N 2 (ParB) B 1 → B ′ 1 B 1 ∥B 2 → B ′ 1 ∥B 2 (CongrY) Y ′ 1 ≡ Y 1 → Y 2 ≡ Y ′ 2 Y ′ 1 → Y ′ 2

Table 2: Reduction semantics (the upper part on nodes, the lower one on node components), where X ∈ {S, A} and Y ∈ {N, B}.

one relation. We assume the standard denotational interpretation [[E]] Σ for evaluating terms, while for the proof of the subject reduction results we resort to an instrumented denotational semantics for expressions, the values of which are pairs ⟨v, v⟩ where v is a concrete value and v is the corresponding abstract value:

[[v a ]] i Σ i = (v a , ν ◇ ) [[i a ]] i Σ i = Σ i (i) [[x a ]] i Σ i = Σ i (x) [[f a (E 1 , ⋯, E r )]] i Σ i = (f a ([[E 1 ]] i Σ i ↓ 1 , ⋯, [[E r ]] i Σ i ↓ 1 ), F τ (f a , [[E 1 ]] i Σ i ↓ 2 , ⋯, [[E r ]] i Σ i ↓ 2 )) [[{E 1 , ⋯, E r } a k ]] i Σ i = ({[[E 1 ]] i Σ i ↓ 1 , ⋯, [[E r ]] i Σ i ↓ 1 } k , ⌊{[[E 1 ]] i Σ i ↓ 2 , ⋯, [[E r ]] i Σ i ↓ 2 } b ⌋ d ) where b = Enc τ ([[E 1 ]] i Σ i ↓ 2 , ⋯, [[E r ]] i Σ i ↓ 2 )

The first two semantic rules implement the (atomic) asynchronous update of shared variables inside nodes, by using the standard notation Σ{−/−}. According to (S-store), the i th sensor uploads the value v, gathered from the environment, into its store location i. According to (Asgm), a control process updates the variable x with the value of E. The rules for conditional (Cond1) and (Cond2) are as expected. In the rule (A-com) a process with prefix ⟨j, γ⟩ commands the j th actuator to perform the action γ, if it is one of its actions. The rule (Act) says that the actuator performs the action γ. Similarly, for the rules (Int) for internal actions for representing activities we are not interested in. The rules (Ev-out) and (Multi-com) that drive asynchronous multi-communications are discussed in Section 3. The rule (Decr) tries to decrypt the result {v 1 , ⋯, v r } k of the evaluation of E with the key k 0 , and matches it against the pattern {E ′ 1 , ⋯, E ′ j ; x j+1 , ⋯, x r } k 0 . As for communication, when this match succeeds the variables after the semicolon ";" are assigned to values resulting from the decryption. The last rules propagate reductions across parallel composition ((ParN) and (ParB)) and nodes (Node), while (CongrY) is the standard reduction rule for congruence for nodes and node components.

A.2 Control Flow Analysis of IoT-LySa

Our CFA is specified in a logical form through a set of inference rules expressing the validity of the analysis results, where the function ⌊−⌋ d to cut all the terms with a depth greater than a given threshold d (with the special abstract values ⊤ b ) is defined as follows. ⌊⊤ The result or estimate of our CFA is a triple ( Σ, κ, Θ) (a pair ( Σ, Θ) when analysing a term) that satisfies the judgements defined by the axioms and rules of Tables 3 and 4.

We do not comment the clauses discussed in Section 4. The judgement ( Σ, Θ) ⊧ M a , defined by the rules in Table 3, requires that Θ( )(a) includes all the abstract values v associated to M a . In the case of sensor identifiers i a and values v a , Θ( )(a) must include τ (i, ) and τ (v, ), respectively, i.e. the corresponding taint classifications decided by the designer. The rule for analysing compound terms requires that the components are in turn analysed. The penultimate rule deals with the application of an r-ary encryption. To do that (i) it analyses each term M

Formally, abstract values are pairs in V defined as follows, where b ∈ B. V ∋ v ∶∶= abstract terms (⊤, b) abstract value denoting cut (see below) (ν, b) abstract value for clear data ({v 1 , ⋯, vn } k 0 , b) abstract value for encrypted data For simplicity, hereafter we write them as ⊤ b , ν b , {v 1 , ⋯, vn } b k 0 , and indicate with ↓ i the projection function on the i th component of the pair. We naturally extend the projection to sets, i.e. V↓ i = {v ↓ i |v ∈ V }, where V ⊆ V. In the abstract value ν b , ν abstracts the concrete value from sensors or computed by a function in the concrete semantics, while the first value of the pair {v 1 , ⋯, vn } b k 0 abstracts encrypted data. Since the dynamic semantics may introduce encrypted terms with an arbitrarily nesting level, we have the special abstract values ⊤

M a 1 11,...,M ar r − −−−−−−−− → N ′ when all the terms M a i i are evaluated inside node , and with N ⟨⟨v 1 ,...,v r ⟩⟩ −−−−−−−→ 1 , 2 N

Definition 4 . 7 .47Let N be a system of nodes with labels in L, and S = {S | ∈ L} the set of its sensitive sensors. Then N has no leaks w.r.t. S if N → * N ′ and, for all 1 , 2 ∈ L, there isno transition N ′ ⟨⟨v 1 ,...,v n ⟩⟩ − −−−−−−− → 1 , 2 N ′′ such that v i ↓ 2 ∈ { , } for some i.

Theorem 4 . 8 .48Let N be a system of nodes with labels in L, and S = {S | ∈ L} the set of its sensitive sensors. Then N has no leaks w.r.t. S if ( Σ, κ, Θ) ⊧ N , and ∀ 1 , 2 ∈ L such that ( 2 , ⟨⟨v 1 , ⋯, vr ⟩⟩) ∈ κ( 2 ) we have that ∀i. vi↓ 2 ∈ {◇, }.

Definition 4 . 9 .M a 1 1491Let N be a system of nodes with labels in L, S = {S | ∈ L} the set of its sensitive sensors, T = {T | ∈ L} the set of its tamperable sources, and P a set of program critical points. Then N does not use tainted values in a critical point if N → * N ′ and there is no transition N ′ ,...,M ar r − −−−−−−−− → N ′′ s.t. a i ∈ P and ([[M

Theorem 4 . 10 .410Let N be a system of nodes with labels in L, S = {S | ∈ L} the set of its sensitive sensors, T = {T | ∈ L} the set of its tamperable sources, and P a set of program critical points. Then N does not use tainted values in a program critical point if ( Σ, κ, Θ) ⊧ N , and Θ( )(a) ↓ 2 = {◇} for all labels a ∈ P and ∈ L.

Example 4 . 11 .411Back to our example, consider the critical point (temp / ∈ validRange(db)) a in the process P k . Our CFA detects that this decision is based on possibly tainted values, since Θ( 3 )(a) ↓ 2 = { }. More in detail, the analysis of the condition depends on the analysis of temp and on the one of validRange(db), which in turn depends on the one of db (temp and db are both untrustworthy, see Ex.4.3)

τ 1 1 1 1Table 3 :113(i, ) ∈ Θ( )(a) ( Σ, Θ) ⊧ i a τ (v, ) ∈ Θ( )(a) ( Σ, Θ) ⊧ v a τ (x, ) ⊗ Σ (x) ⊆ Θ( )(a) ( Σ, Θ) ⊧ x a ⋀ k i=1 ( Σ, Θ) ⊧ M a i i ∧ ∀ v1 , .., vr ∶ ⋀ r i=1 vi ∈ Θ( )(a i ) ⇒ ⌊{v 1 , .., vr } Enc τ (v 1 ,..,v r ) k 0 ⌋ d ∈ Θ( )(a) ( Σ, Θ) ⊧ {M a , Θ) ⊧ M i ∧ ∀ v1 , .., vr ∶ ⋀ r i=1 vi ∈ Θ( )(a i ) ⇒ ν F τ (f,v 1 ,..,v r ) ∈ Θ( )(a) ( Σ, Θ) ⊧ f (M a Analysis of labelled terms ( Σ, Θ) ⊧ M a .

1 ,1⋯, vr } b k 0 ⌋ 0 = ⊤ b ⌊{v 1 , ⋯, vr } b k 0 ⌋ d = {⌊v 1 ⌋ d−1 , ⋯, ⌊v r ⌋ d−1 } b k 0

i, and (ii) for each r-tuple of values (v 1 , ⋯, vr ) in Θ( )(a 1 ) × ⋯ × Θ( )(a r ), it requires that the abstract structured value {v 1 , ⋯, vr } b k 0 , cut at depth d, belongs to Θ( )(a), where b = Enc τ (v 1 , .., vr ). The special abstract value ⊤ b will end up in Θ( )(a) if the depth of the term exceeds d. The last rule is for the application of an r-ary function f . Also in this case, (i) it analyses each term M a i i , and (ii) for all r-tuples of values (v 1 , ⋯, vr ) in Θ( )(a 1 )×⋯×Θ( )(a r ), it requires that the abstract value ν b belongs to Θ( )(a), where b = F τ (f, v1 , .., vr ).

Table 4: Analysis of nodes ( Σ, κ, Θ) ⊧ N , and of node components ( Σ, κ, Θ) ⊧ B.

The judgements for nodes with the form ( Σ, κ, Θ) ⊧ N are defined by the rules in Table 4. The rules for the inactive node and for parallel composition are standard. The rule for a single node ∶ [B] requires that its internal components B are in turn analysed; in this case we the use rules with judgements ( Σ, κ, Θ) ⊧ B, where is the label of the enclosing node. The rule connecting actual stores Σ with abstract ones Σ requires the locations of sensors to contain the corresponding abstract values. The rule for sensors is trivial, because we are only interested in the users of their values. The rule for actuators is equally trivial, because we model actuators as passive entities. The rules for processes require to analyse the immediate sub-processes. The rule for decryption is similar to the one for communication: it also requires that the keys coincide. The rule for assignment requires that all the values v in the estimate Θ( )(a) for M a belong to Σ (x), once combined with the variable taint with the operator ⊗. The rules for the inactive process, for parallel composition, and for iteration are standard (we assume that each iteration variable h is uniquely bound to the body P ).

Detecting and preventing type flaws at static time CBodei LBrodo PDegano HGao Journal of Computer Security 18 2 2010 Static validation of security protocols CBodei MBuchholtz PDegano FNielson HRiisNielson Journal of Computer Security 13 3 2005 A step towards checking security in IoT CBodei PDegano G-L.Ferrari LGalletta Procs. of ICE 2016 s. of ICE 2016 EPTCS 2016 223 Where do your IoT ingredients come from? CBodei PDegano G-L.Ferrari LGalletta Procs. of Coordination s. of Coordination Springer 2016. 2016 9686 The cost of securing IoT communications CBodei LGalletta Procs.of Italian Conference on Theoretical Computer Science CEUR Proceedings s.of Italian Conference on Theoretical Computer Science 2016 1720 Taintdroid: An information-flow tracking system for realtime privacy monitoring on smartphones WEnck PGilbert SHan VasantTendulkar Byung-GonChun LPCox JJung PMcdaniel ANSheth ACM Trans. Comput. Syst 32 2 29 2014 Wait-free synchronization MHerlihy ACM Trans. Program. Lang. Syst 13 1 1991 Internet of things: a process calculus approach ILanese LBedogni MDi Felice Procs of the 28th Annual ACM Symposium on Applied Computing, SAC '13 s of the 28th Annual ACM Symposium on Applied Computing, SAC '13 ACM 2013 A semantic theory of the Internet of Things RLanotte MMerro Procs. of Coordination 2016 s. of Coordination 2016 Springer 2016 9686 Explicit secrecy: A policy for taint tracking DSchoepe MBalliu BCPierce ASabelfeld IEEE European Symposium on Security and Privacy

EuroS&P

2016. 2016 All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask) EJSchwartz TAvgerinos DBrumley 31st IEEE Symposium on Security and Privacy

S&P

IEEE Computer Society 2010. 2010 multi-output with no receivers and the inactive process, and for the fact that inactive components of a node are all coalesced. (1) (N / ≡ , |, 0) and (B/ ≡ , ∥, 0) are commutative monoids (2) µ h . X ≡ TZillner Semantics of IoT-LySa Our reduction semantics is based on the following Structural congruence ≡ on nodes and node components 2015 ZigBee Exploited. X{µ h . X/h} for X ∈ {P, A, S} (3. ⟩⟩ ∶ ∅. 0 ≡ 0 least relation on nodes and its components satisfying the set of inference rules in Table 2. For the sake of simplicity, we use We have a two-level reduction relation → defined as the