Proceedings of the 2nd Workshop on Artificial Intelligence and Formal Verification, Logics, Automata and Synthesis (OVERLAY), September 25, 2020 On the Formalization of Decentralized Contact Tracing Protocols P. A. Abdulla1 , M. F. Atig2 , G. Delzanno3 , M. Montali4 , and A. Sangnier5 1 Uppsala University 2 Uppsala University 3 DIBRIS, University of Genova 4 Free University of Bozen 5 IRIF, Universitè Paris Denis Diderot Abstract We present a preliminary formalization based on transition systems of de- centralized contact tracing protocols for smart devices equipped with Bluetooth trasmitters. In our model the behaviour of individual users, via their app, is modelled as a timed automata with a local unbounded memory. Protocol config- urations consist of the current state of a shared server and a finite set of local states containing the states of individual users. The transition system models the interaction between devices in the same physical location and between a sigle device and the shared server. In the paper we address different research directions concerning semi-automated verification based on automated reasoning tools of the considered class of protocols, theoretical issues related to the expressiveness of the resulting class of formal models, and data-driven analysis of the logs collected on the server as well as on user devices. 1 Introduction The goal of contact tracing is to make people aware of possible contacts with positively diagnosed people so that they should possibly have an infection test [16]. For pandemic diseases, reporting a positive diagnosis is mandatory. Contact tracing methods, e.g., smartphone apps, are aimed at investigating who could have been contaminated by a positively diagnosed person and alert them. After the COVID-19 pandemic, several protocols have been proposed as an automated support for this task. The Pan-European Privacy-Preserving Proximity Tracing (PEPP-PT) proposed both centralized and decentralized solutions. As a decentralized solution, they propose the Decentralized Privacy-Preserving Proximity Tracing (DP3T), and the specifications of ROBERT4 and NTK5. Both centralized and non-centralized systems require a central server for alerts and typically differ in the way ephemeral identifiers are generated. In general the system architecture is based on a smart app, a notification server as in Publish/Subscribe architectures, and a possibly trusted server. An important notion here is that of ephemeral identifiers, i.e., identifiers that are frequently changed either by a trusted authority/server or by the app itself in order to reduce the risk of a malicious use of private user data from third parties (e.g. the Paparazzi’s attack [3, 16]). Ephemeral identifiers are kept locally in the smart app. In centralized versions of the protocol, ephemeral Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). identifiers are downloaded from a trusted server. In decentralized versions, ephemeral identifiers are created by the app. The smart app installed on the user device, e.g. the Immuni app in Italy, makes use of the Bluetooth interface to broadcast the currently valid ephemeral identifier. The communication range of the Bluetooth interface is tuned so as to reach the distance for a possible contagious contact. The app collects all identifiers received by other apps located in the same physical place and stores them locally. Identifiers whose temporal marks fall out of the incubation period are deleted from memory. When users are diagnosed to be infected, they can release a report to the shared server. The server is used to make the report (a log containing ephemeral identifiers) available to each user interested in checking potential contacts with infected users. This step can be implemented in several different ways depending on the system architecture and on the role of the authority in the whole process. One possibility is to send the report to a central log database. Users can then consult the database to check if they crossed their way with infected users, i.e., they share some ephemeral identifier with one of them. Taking inspiration from [1, 10, 11, 12, 9, 8], in this extended abstract we present a preliminary formalisation of decentralized contact tracing protocols based on transition systems, a mathematical tool typically used in abstract semantics of concurrent and distributed systems. In our specification, users apps are modelled as timed automata with a local memory. System configurations consist then of the current shared server state (a sort of global memory) and a multiset of local states of user instances. In the first part of the paper we describe a mathematical presentation of contact tracing protocols aimed at introducing a formal language to describe protocol properties and possible verification procedure for the resulting models. In the last part we highlight possible future directions concerning both semi-automated verification and more theoretical issues. 2 Contact Tracing Systems In this section we introduce a possible formalization of a decentralized contact tracing protocol. To specify the protocol rules and behavior (computation), we will use a transition system defined on configurations consisting of a set of states of individual users. User configurations contain a local memory needed to store the list of broadcasted and collected identifiers. System configurations contain a global memory that gathers user logs and a set of user states that corresponds to the current number of registered indivuals. To simplify the model, we fix the number of agents and simulate dynamic injection by randomly selecting the initial value of local clocks used to define validity of ephemeral identifiers (i.e. local clocks may have different values). We will use the standard notation ∪, ∩, \, ⊆, ⊂, and ∈ to indicate set operations. Furthermore, we will use P(D) to indicate the powerset of D. For simplicity, for an element e ∈ D and a set M , we will use e ∪ M as an abbreviation for {e} ∪ M . Finally, we will use he1 , . . . , en i to denote tuples of elements in D. In the rest of the paper we assume that users are uniquely identified via a denumerable set U of identifiers (user identifiers assigned upon registration). Similarly, we assume that ephemeral identifiers are represented via a denumerable set E of labels. Both sets come equipped with standard comparison operators such as = and 6=. Furthermore, we introduce a timestamp domain T needed to identify validity periods of ephemeral identifiers. For simplicity, we consider days as time units and model time elapsing transitions according to them. Elements in T are totally ordered and can be compared with equality = and with the less than ordering <. As a convention, we add a special value ⊥ (undefined time instant). We fix a parameter δ > 0 for defining validity periods of ephemeral identifiers. Namely, ephemeral identifiers are considered to be valid if and only if they have been generated in the period [ct − δ, ct], where ct is the current time. The parameter δ represents the estimated incubation period of the virus. Identifiers with time stamp ct − δ need not to be considered in the query step. We will also fix a second parameter γ with 0 < γ < δ to denote the frequency for rotating ephemeral in the user app. Rotation is needed for security reasons, e.g., to reduce the risk of data leakage (visited locations, contact details). During the validity time of an ephemeral identifier, the same user can broadcast several different identifiers (δ/γ fresh identifiers). All these identifiers must be maintained in memory to be used when crosschecking them with identifiers of other users. 70 2.1 Transition System Contact tracing systems are designed so as to be equipped with a shared (centralized or distributed) server that is used to propagate reports to all registered users (a sort of notification system as in Publish/Subscribe architectures). The server can be viewed as a (distributed) register in which reports of positive users are first collected and then made publicly available to other users for inquiries about their health status. To abstract away from possible implementations, we model the shared server as a global memory defined via the map G : U × E → T such that G(u, e) denotes the last time user u generated or received (depending on the type of adopted protocol) the ephemeral identifier e. We assume here that identifiers are unique and that they are generated in a given day (a timestamp included in the report). Users (or, better, user apps) can be modelled as individual processes in a concurrent system. Processes interact by sending and receiving messages (identifiers) only when they are in the same physical location. For the sake of this presentation, we abstract away from physical locations and model physical contact using non-determinism, i.e., at each step we non-deterministically select the set of users that receive the identifier broadcasted by a given app. In a sense, a receiver group can be viewed as an abstract representation of a physical location. Users need to maintain in local memory the current ephemeral identifiers, updated every γ time units, as well as the sets of broadcasted and received identifiers. Based on all the above mentioned assumptions, user states are tuples of the following form: hq, u, e, c, S, Ri where • q ∈ {0, 1, 2} is the user status: negative, possibly positive, and positive, respectively; • u ∈ U is the user id; • e ∈ E is the current valid ephemeral identifier that the user can broadcast to his/her neighbors; • c ∈ T is the current value of the timer used to update the identifier; • S : E → T and R : E → T are mappings that represent the local memory, namely the set of sent and received identifiers by user U . Timestamps are used here for checking their validity. – If S(e) > 0, than S(e) maintains the last time user u sent e; – if R(e) > 0, than R(e) maintains the last time user u received e. We will use C to denote the set of all possible user states defined on U and E, and P(C) the powerset of C. The global state of a contact tracing system can be defined as a tuple hG, ∆, ti, where G is the server state, ∆ ∈ P(C) is a set of user states, each one carrying a unique user identifier, and t ∈ T is the current (global) time. We assume here that in the initial global state t = 0, user states contain distinct ephemeral identifiers, and local timers are initialized with random values to simulate injections of users at different times, and the local mappings S and R of user states are such that S(x) = R(x) = ⊥ for all x ∈ E. We now define a transition system to describe the operational semantics of contact tracing protocols. In this preliminary study, we abstract away from cryptographic operations and focus on data management steps. To simplify the definitions, for ∆ ∈ P(C), we will use ∆ ⇐ x, t to denote the set obtained by updating every user state σ = hq, u, e, c, S, Ri ∈ ∆ into the new user state σ 0 = hq, u, e, c, S, R0 i s.t. R0 (x) = t and R0 (y) = R(y) for every y 6= x. Similarly, for σ = hq, u, e, c, S, Ri ∈ C we will use σ ⇒ x, t to denote the new user state hq, u, e, c, S 0 , Ri s.t. S 0 (x) = t and S 0 (y) = S(y) for y 6= x. Let us first consider the phase in which a device broadcasts ephemeral identifiers to other devices in the same location. Let ∆ ∈ P(C) be the current set of user states. We non-deterministically single out the sender user σ ∈ C and a group of receivers Π ∈ P(C), containing only users in state q < 2, by requiring that ∆ = σ ∪ Π ∪ Γ for σ ∈ ∆, Π, Γ ∈ P(C). The effect of the transitions is to update the state of sender and receivers: hG, σ ∪ Π ∪ Γ, ti → hG, σ 0 ∪ Π0 ∪ Γ, ti where σ = h0, u, e, c, S, Ri, Π contains only user states with local state q < 2, σ 0 = (σ ⇒ e, t), and Π0 = (Π ⇐ e, t). Global and local memory can be reset from time to time as specified in the following three rules. hG, ∆, ti → hG0 , ∆, ti 71 where, for every u ∈ U, e ∈ E, G0 is obtained by setting G0 (u, e) = ⊥ if G(u, e) < t−δ, and G0 (u, e) = G(u, e) otherwise. hG, σ ∪ ∆, ti → hG, σ 0 ∪ ∆, ti where σ = hq, u, e, c, S, Ri and σ 0 = hq, u, e, c, S 0 , R0 i s.t., for every e ∈ E, S 0 (e) = ⊥ if S(e) < t − δ, and S 0 (e) = S(e) otherwise, and R0 (e) = ⊥ if R(e) < t − δ, and R0 (e) = R(e) otherwise. We consider now the report phase in a protocol formulation in which users who are diagnosed as positive submit as a report a copy of the log of all broadcasted identifiers i.e. they copy S in the global memory G. Other formulations are possible, e.g., the report may consist of the mapping R instead of S. hG, σ ∪ ∆, ti → hG0 , σ 0 ∪ ∆, ti where σ = h0, u, e, c, S, Ri, σ 0 = h2, u, e, c, S, Ri and, G0 (u, x) = S(x) for every x ∈ E, and for every v 6= u and y ∈ E, G0 (v, y) = G(v, y). We assume here that users in state 2 need to register to re-enter, i.e., return to state 0, into the system. Let us now consider the query phase. In this phase users crosscheck the set of received identifiers with the global memory. If valid ephemeral identifiers are contained in the intersection with a time stamp compatible with the incubation period, than the user updates its state to 1. hG, σ ∪ ∆, ti → hG, σ 0 ∪ ∆, ti where σ = h0, u, e, c, S, Ri, there exists v, x s.t. G(v, x) = R(x) = s, t − δ ≤ s ≤ t and σ 0 = h1, u, e, c, S, Ri. We assume here that users in state 1 need to register again into the system to return to state 0. The final part of the specification is related to time elapsing steps and ephemeral identifiers update. We model this step by incrementing global time, local clocks, and controlling whether or not the validity time of identifiers expired. hG, ∆, ti → hG, ∆0 , t + 1i where ∆0 is obtained from ∆ as follows. Assume that ∆ = {σ1 , . . . , σk }. Furthermore, let F = {x1 , x2 , . . . , xk } be a set of fresh identifiers, i.e., such that they do not occur in ∆. We define ∆0 = {σ10 , . . . , σk0 } in such a way that for each i : 1, . . . , k if σi = hq, u, e, c, S, Ri, then • if q = 0 then – if c = γ − 1, then σi0 = hq, u, xi , 0, S, Ri, i.e., we update the current ephemeral identifier with a fresh value; – otherwise if c < γ − 1, then σi0 = hq, u, e, c + 1, S, Ri, i.e., we update the local counter. • if q > 0, then σi0 = hq, u, e, c, S, Ri, i.e., we leave the user state unchanged. A computation is a sequence of configurations σ0 σ1 . . . that, starting from an initial configuration σ0 with a certain number of users, contains configurations such that σi → σi+1 for i ≥ 0. The above described formal semantics can be extended in order to model different ways to handle fresh identifiers creation (via a shared server, etc) as well as to model explicitly physical locations and user movement from location to location. The proposed transitions system however can already by used as an abstract representation of a decentralized version of the protocol. As an example, consider a configurations σ0 = hG, ∆, 0i in which ∆ consists of three user states a, b, and c in their initial state (local state equal to 0, empty memory, distinct ephemeral identifiers ia , ib , and ic respectively). We can then apply a broadcast step to a sender a and a non-deterministically selected subset of ∆, e.g. {b, c}. User a will update Sa s.t. Sa (ia ) = 0 whereas b, c will update their R memories s.t. Rb (ia ) = Rc (ia ) = 0. Starting from the new configuration σ1 , we can apply k time elapsing steps and reach configuration σ2 with clock k. We can then apply another broadcast step to sender b and receiver a so that Sb (ib ) = Ra (ib ) = k. We can then apply a report step moving a to state 2 and copying its memory S to G, thus having G(a, ia ) = 0. In the query step user b will move to state 1, i.e., receives a notification by the server, only if k ≤ δ. 72 3 Verification and Data-driven Analysis The formalization of contact-tracing protocols has several interesting applications. First of all, we plan to apply Automated Reasoning techniques based on SMT solvers in order to validate a parameterized formulations of decentralized versions of the protocol in the input language of Infinite-state Model Checkers such as Cubicle as discussed in previous works on distributed algorithhms [6], and Publish/Subscribe architectures [8]. In this setting it is possibile to exploit the combination of the Theory of Arrays and Arithmetics to finitely express and then validate automatically such a protocol for an arbitrary number of users. In this context the idea is to represent the state of individual users as cells of an unbounded arrays possibly containing a subfields unbounded arrays representing their local memory. Indeed, in Cubicle it is possible to express transition systems operating on multidimensional unbounded arrays. In the considered case study, the set of user states can be represented as unbounded tables indexed on process identifiers and, for each process, the local memory can be represented again as an unbounded table e.g. containing the list of sent or received identifiers. In tools such as Cubicle, and more in general in several other examples of infinite-state model checkers, symbolic search procedures can be applied in order to validate the protocol specification againts safety and restricted types of liveness properties. As an example of correcteness property, let us consider the following correspondence property (typical of cryptograohic protocols): if user u is in state 1, then there exists a positive user v that emitted an identifier in state t received by u at time t0 s.t. t − t0 < δ or sim. This property can be formulated as a coverability problem Integrity properties can also be modelled by modeling malicious agents and equipping them with the possibility of sniffing data sent by the users, e.g., identifiers, reports, user credentials in order to obtain information on user position and contacts etc. Concerning computability issues, it seems possible to define further abstractions or restricted versions of the proposed formalization in order to exploit decidability and complexity results for ad hoc networks (they model receivers groups via broadcast messages) [10, 11, 1], and asynchronous broadcast networks (they model local memories) [13]. Furthermore, it would be interesting to consider abstractions based on Population Protocols [14] a class of concurrent systems that could be used to model a pandemic behaviour in a population. Another interesting connection arises from the fact that this class of protocols appears to naturally enjoy various boundedness notions that have been studied towards the verification of data-aware business processes [5]. On the one hand, ephemeral identifiers are subject to expiration, and consequently are akin to fading fluents in [7], and go well with the notion of recency bound in [2]. On the other hand, while a user can in principle encounter unboundedly many other users over the whole timeline, it is reasonable to assume that only boundedly many will be met in a fixed timespan, a property resembling that of bound-by-resources in [15]. Data-driven analysis of the user and server logs are also an interesting research direction more related to more classical Artifical Intelligence techniques. In particular data mining and prediction techniques could be applied here in order to infer potential clusters and links between them as well as to extract prediction models for the diffusion of the virus as proposed by Yoshua Bengio in [4]. References [1] P. A. Abdulla, G. D., O. Rezine, A. Sangnier, and R. Traverso. On the verification of timed ad hoc networks. In Formal Modeling and Analysis of Timed Systems - 9th International Conference, FORMATS 2011, Aalborg, Denmark, September 21-23, 2011. Proceedings, volume 6919 of Lecture Notes in Computer Science, pages 256–270. Springer, 2011. [2] Parosh Aziz Abdulla, C. Aiswarya, Mohamed Faouzi Atig, Marco Montali, and Othmane Rezine. Recency-bounded verification of dynamic database-driven systems. In Tova Milo and Wang-Chiew Tan, editors, Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 195–210. ACM, 2016. 73 [3] G. Avitabile, V. Botta, V. Iovino, and I. Visconti. Towards defeating mass surveillance and sars-cov-2: The pronto-c2 fully decentralized automatic contact tracing system. IACR Cryptol. ePrint Arch., 2020:493, 2020. [4] Y. Bengio. Peer-to-peer AI-tracing of COVID-19. https://yoshuabengio.org/2020/03/23/ peer-to-peer-ai-tracing-of-covid-19. [5] Diego Calvanese, Giuseppe De Giacomo, and Marco Montali. Foundations of data-aware process analysis: a database theory perspective. In Richard Hull and Wenfei Fan, editors, Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2013, New York, NY, USA - June 22 - 27, 2013, pages 1–12. ACM, 2013. [6] S. Conchon, G. Delzanno, and A. Ferrando. Declarative parameterized verification of distributed protocols via the cubicle model checker. Accepted for publication in Fund. Inf., 2020. [7] Giuseppe De Giacomo, Yves Lespérance, and Fabio Patrizi. Bounded situation calculus action theories. Artif. Intell., 237:172–203, 2016. [8] G. Delzanno. Towards the automated verification of publish/subscribe networks. In Proceedings of the 1st Workshop on Artificial Intelligence and Formal Verification, Logic, Automata, and Synthesis, co-located with the 18th International Conference of the Italian Association for Artificial Intelligence, OVERLAY@AI*IA 2019, Rende, Italy, November 19-20, 2019, volume 2509 of CEUR Workshop Proceedings, pages 35–40, 2019. [9] G. Delzanno, A. Sangnier, and R. Traverso. Parameterized verification of broadcast networks of register automata. In RP, pages 109–121, 2013. [10] G. Delzanno, A. Sangnier, and G. Zavattaro. Parameterized verification of ad hoc networks. In CONCUR 2010 - Concurrency Theory, 21th International Conference, CONCUR 2010, Paris, France, August 31-September 3, 2010. Proceedings, pages 313–327, 2010. [11] G. Delzanno, A. Sangnier, and G. Zavattaro. On the power of cliques in the parameterized verification of ad hoc networks. In Foundations of Software Science and Computational Structures - 14th International Conference, FOSSACS 2011, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2011, Saarbrücken, Germany, March 26-April 3, 2011. Proceedings, pages 441–455, 2011. [12] G. Delzanno, A. Sangnier, and G. Zavattaro. Verification of ad hoc networks with node and communication failures. In FORTE/FMOODS’12, volume 7273 of LNCS, pages 235–250. Springer, 2012. [13] G. Delzanno and R. Traverso. Decidability and complexity results for verification of asynchronous broadcast networks. In Language and Automata Theory and Applications - 7th International Con- ference, LATA 2013, Bilbao, Spain, April 2-5, 2013. Proceedings, volume 7810 of Lecture Notes in Computer Science, pages 238–249. Springer, 2013. [14] J. Esparza, P.Ganty, J. Leroux, and R. Majumdar. Verification of population protocols. Acta Informatica, 54(2):191–215, 2017. [15] Marco Montali and Andrey Rivkin. Model checking petri nets with names using data-centric dynamic systems. Formal Asp. Comput., 28(4):615–641, 2016. [16] Serge Vaudenay. Centralized or decentralized? the contact tracing dilemma. IACR Cryptol. ePrint Arch., 2020:531, 2020. 74