=Paper=
{{Paper
|id=Vol-2699/paper29
|storemode=property
|title=Graph Analytics on Proximity Data to Fight Contagion
|pdfUrl=https://ceur-ws.org/Vol-2699/paper29.pdf
|volume=Vol-2699
|authors=Snehasis Banerjee,Vivek Chandel,Avik Ghose
|dblpUrl=https://dblp.org/rec/conf/cikm/BanerjeeCG20
}}
==Graph Analytics on Proximity Data to Fight Contagion==
Graph Analytics on Proximity Data to Fight Contagion Snehasis Banerjeea , Vivek Chandela and Avik Ghosea a TCS Research & Innovation, Tata Consultancy Services, Kolkata, India Abstract As social distancing has become the new normal, it is eminent for enterprises and administrations to make decisions based on analytics on human proximity data. Graph Analytics can serve as an essential technology to understand human network behaviors and contact tracing inferences in pandemic and post-pandemic scenarios. A large amount of work has happened on epidemiology of pandemics. However, a formal application of graph analytics backed by visualization and mapping of graph algorithms has remained neglected in pandemic scenarios. This paper has addressed this important need backed by graph modeling, methods and analytics. Keywords graph analytics, contact tracing, COVID-19 pandemic, human proximity data person can be at a location for a certain time which is visited by another person after some time β this events will be missed in standard proximity detection solu- tions. Also, depending on the infection type (in this case COVID-19), different persons and places will have different levels of susceptibility to risks, based on the infectionβs properties and incubation period. Hence it is important to have: (a) a graph model to store and represent proximity events and related attributes, that is robust and scalable (b) fast techniques that can aid in contact tracing (c) methods that can help risk estima- tion by applying graph analytics. This paper presents a system and several methods to tackle the prior points. As seen in Fig. 1, the system intakes proximity re- Figure 1: System Architecture for enabling Graph Analytics lation data (either person to person or person to zone) from a multi-modal sensory input. It can be GPS (Global Positioning System) and/or Bluetooth Low Energy (BLE) 1. Background present in contact tracing smart devices (like phone and watches) or it may be a multitude of hard sen- Social distancing [1] has become an essential need to sors like BLE and LoRa (long range low-power net- manage pandemics like COVID-19. In recent times, work communication) beacons or Radio-frequency iden- the world has witnessed deployment of multiple apps tification (RFID) tags. Crowds (susceptible places of and systems by governments [2]. Contact tracing and high proximity events) at a specific location can be de- related methods are expected to remain prevalent in a tected by radar, microphone captured sound level, in- post pandemic scenario. Adoption in enterprise is ex- frared, Wifi signal as well as soft sensors like location pected to be of the order of mass scale. However, the tagged social media feeds. Sensor feeds are stored in enterprise level settings will be different from a gov- a relational database management systems (RDBMS, ernment monitored contact tracing scenario. As an ex- such as PostGres with PostGIS 1 having spatial index- ample, in a factory, same device may get shared across ing support) in form of proximity event tuples : multiple workers in rotation. Also, only person to per- { entity, entities detected, timestamp, optional attributes}. son proximity events will not be futile, as, an infected Based on pre-set business logic (like duration of aggre- Title of the Proceedings: Proceedings of the CIKM 2020 Workshops, gation), data from RDBMS is updated periodically in a October 19-20, Galway, Ireland. graph database (like OrientDB2 ). Complex queries and Editors of the Proceedings: Stefan Conrad, Ilaria Tiddi. graph analytics are run on the graph from the front- snehasis.banerjee@tcs.com (S. Banerjee) end application to get insights and visualization. 0000-0001-6497-2085 (S. Banerjee) Β© 2020 Copyright for this paper by its authors. Use permitted under Creative 1 https://www.postgresql.org ; https://postgis.net Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2 https://www.orientdb.org Proceedings 2. Graph Data Population While the individual events of proximity instances are stored in a RDBMS, the actual inference and analytics happens over the graph database. As any active sen- sor consume power, hence keeping a sensor device al- ways online which sends events to server is impracti- cal. Hence, there is a sleep and wake cycle, when prox- imity events are detected and sent to server database at intervals. However, to perform graph analytics, the data instances need to aggregated to populate the graph with meaningful information. Therefore, if there is no gap greater than a preset time threshold T in terms of two persons being in contact, the proximity event in- stances becomes a continuous event. When the gap is Figure 2: Modeling of nodes and edges in a proximity graph more than T, a new entry is entered in the graph in a similar fashion. Sometimes, due to sensor errors and signal reflection, false proximity events can be recorded. However, as this is two way detection (person 1 de- is how a person proximity event is generated. A per- tecting person 2 and vice versa when they are in close son can have multiple tagged sensors (like two smart contact of say 2 metres) [3], if there is only one event phones). Same sensor device (with a unique device ID) detected, that is dropped from graph insertion β this is can be used by various persons in disjoint time frames. a configurable setting as per sensitivity needs of spe- (3) Static zone tagged sensor: fixed sensors like bea- cific deployments. In order for wake and sleep cycles cons in elevators, office areas, cabins, canteens, gym, across sensors to be consistent, a standard time gap is shops, etc. helps in proximity event detection in two maintained across devices, which is typically a sleep ways: (a) Any detected proximity event confirms posi- of 30 seconds and a wake of 5 seconds synchronized to tively if two or more persons were already reported by 00:00:00 hours of UTC (Coordinated Universal Time). person tagged sensors. (b) Usually infectious diseases The data in relational database and graph is deleted stay active at a space for some period of time. There- based on some pre-set expiry timestamp to maintain fore, a proximity event with a person at time T1 and a sliding window based on business logic or infection another person with time T2 can trigger an indirect risk period (usually 30 days for COVID-19 scenario). proximity link between the two persons, if |π 1 β π 2| Geo-spatial queries may be directly run on spatially in- are within the infectious period relevant to that space. dexed graph database or be optionally run on the spa- (4) Dynamic zone tagged sensor: sensors tagged with tial layer of the relational database (to use spatial geo- transport can keep track of proximity events that oc- metric algorithms) by entangling with graph queries, cur when a person stays in that zone. It is important to however, in enterprise scenarios (small geo-location keep track of the time span a person is in this dynamic coverage), such needs are found to be rare. zone, as the entry and exit times will reveal the spatial zones of entry and disembarkment. This needs to be matched with other persons having overlapping prox- 2.1. Entities as Graph Nodes imity events but different entry or exit points. Conta- A necessary part of graph modeling is identification gion can be tracked in the related entry and exit spaces of entities and its representation for fast graph queries (like bus stops, airports, parking lots, stations) even if and traversal. Here, the nodes are of 4 types: those spaces have bad sensor coverage. (1) Person: humans having features (attributes) like Locations can be sensed from GPS or cellular tower age, risk factors that remain static over a period of estimates in outdoor settings. In indoors, various lo- time. Dynamic attributes are derived or calculated over calization techniques leveraging radar, RFID, Wifi, BLE an interval β risk estimation, proximity activity and can be used. There are other possible node types specif- health estimate. A necessary attribute of a person in a ically mobile objects (like parcels, posts, shopping items, contagion scenario is the personβs current health sta- currency notes) that can play a part in the infection tus - infected, healthy or susceptible. spread. However, as proper sanitization practices can (2) Person tagged sensor: sensors like wearable smart mitigate these occasional object based contagion risks, devices and smartphones are tagged to a person β that it is kept out of scope of the paper. 2.2. Proximity Events as Graph Edges As presented in Fig. 2, a new proximity event is en- tered into the graph as an edge with start time (prefer- ably UTC timestamp) and time duration (say seconds) as necessary attributes. In the graph, relations are main- tained between persons and not devices (bearing sen- sors) as same device may be used by different persons over a period of time. So person and zone resolution at the time of graph population is of high importance β proximity event edge keeps an attribute source (like Figure 3: Visualization of proximity relations (with number device sensor) to handle any future backtrack queries. of interactions and duration) given a person and depth level Indirect proximity events are inferred using transitive relation of person and zones with timestamp constraints. By nature humans are mobile, hence it makes sense to duration of contact, the number of instances of con- keep a private record of all other nodes (places and tact, number of infected visiting a specific zone and the other persons) visited within a sliding interval of time time spent there. Other queries can detect crowds at period as edges. Because, two nodes in this graph can locations based on large number of temporally collid- have multiple edges (proximity instances) and prox- ing proximity events β those zones need to be marked imity events are by definition both way, the resulting as βat riskβ zones. Crowd detection will be based on graph is an undirected Multigraph. Due to social dis- spatial queries on spatio-relational database (say Post- tancing measures, this graph will be sparse and have GIS) if dealing with latitude and longitude values (GPS). local community clusters. Hence, adjacency list is the best way to represent the relations between |V| ver- tices (nodes) and |E| edges, resulting in 2*|E| space com- 3.2. Centrality Measures plexity. A hash index on the nodes will aid in O(1) lookup and fast adjacent node traversals. In a sparse In managing pandemics and enforcing social distanc- graph with community clusters, graph traversal us- ing, it becomes crucial to find the specific zones that ing breadth first search is the optimal approach, which are hotspots and which persons have highest capabil- takes time around (ππ + πΈπ ) β ππ /π·π , for a depth up to ity to spread infection. Usage of graph centrality mea- level ππ . ππ and πΈπ are respective vertices and edges at sures helps in identification of critical nodes. a community cluster in the graph of cluster diameter Degree centrality, measured by the number of unique π·π (largest edge distance or number of nodes in path edge links of a node to other nodes can be thought of between two nodes in that cluster or sub-graph). the relative risk of getting infected if the local commu- nity sees an infection outbreak. Closeness centrality of a node is: C(i) = β ππ π , where 3. Graph Analytics ππ π is the geodesic distance from node i to node j (num- ber of links in the shortest path from node i to node Graph analytics deals with pairwise relationship be- j). Closeness reveals how long it takes for infection to tween two entities at a time; as well as structural char- flow from one node to others in the graph. acteristics of the graph as a whole. Graph Analyt- Betweenness centrality signifies that a node is cru- ics also includes visualization aspects, insights gained cial or forms a bridge for spreading of infection. Exam- from complex query execution and graph algorithms. ples are person nodes who visit many other nodes like doctor, courier and security staff. Examples of bridge 3.1. Graph Queries and Visualization space nodes are market and transport zones. Between- ness centrality is measured as, b(i) = β ππ π π /ππ π , where In contact tracing scenario, one of the primary require- π is number of shortest paths from node j to node k ment is to find the chain of persons an infected person π π (j and k β i) and gπ π π is number of shortest paths from has come across. This can be done by graph traver- node j to node k passing through node i. sal and the results can be showed in front-end visual- There are other graph centrality measures [4] like ization as shown in Fig. 3. For privacy reasons, the Eigenvector Centrality. However, it requires calcula- exact identity of a person is not disclosed and an en- tions over adjacency matrix where as the graph model crypted alias is used. Other related graph queries are shown is based on adjacency list β hence not optimal. time overlap between visits to specific zones, the total 3.3. Graph Structure Measures Proximity events between 2 persons represent a Dyad in graph, where as proximity event between 2 persons and a mutually visited zone forms a Triad. This are the simplest forms of graph structures. Eccentricity is the maximum distance from a given node to all other nodes in a graph. The diameter of Figure 4: Importance of Community Clusters in Contagion a graph is the maximum eccentricity value. Eccen- tricity calculation aids in understanding contagionβs spread from one node to the others. If infection prop- divisive, agglomerative, and optimization algorithms. agates from a node A towards some specific nodes in Some popular algorithms [5] are K-cliques, hierarchi- the graph in the smallest number of steps, it means cal clustering, spectral bisection, graph partioning, mod- that the node A has better propagation efficiency. ularity optimization and link communities. One of the Density serves as an important graph structure mea- methods suitable for large graphs is Louvain Method sure, as more dense a graph is (more connected nodes), [6] that uses graph modularity and has time complex- the more the risk of infection spreading. Density is cal- ity as O(|V|log|V|). Modularity is a measure of the struc- culated as: D(G) = 2 β |πΈ|/(|π | β (|π | β 1)). ture of the graph and is defined as, Q = ( |Eπ π | - |Eπ πβ A clique is a graph (or subgraph) in which every π | ) / E , where |Eπ π | is the number of links connect- node is connected to every other node. Finding maxi- ing nodes that belong to the same community and |Eπ mum cliques in a graph provides the largest set of com- πππ | is the estimated |Eπ π |, if links were random. The mon links β hence, strong communities can be discov- Louvain Method works as follows: (1) By optimizing ered with the following risk β if any one is infected, modularity locally on all the nodes, small communities it will result in a high probability of others also get- of nodes are found. (2) Small communities are then ting infected. Another structure, K-cores, which are grouped into one node, and the first step is repeated. not necessarily cohesive groups, but indicate areas of a graph which contain clique-like structures. Clustering coefficient measures the drift of nodes to 3.5. Contagion and Risk Propagation form dense subgraphs. The clustering coefficient C of Some work has been done on contagion and risk es- a node i shows how its neighbors are connected with timation [7] [8]. However, they are dependent mostly each other. The local (node) clustering coefficient is on external factors to estimate the risk. Here, a method calculated as follows: Cπ = 2*Tπ / ( Kπ * (kπ - 1) ), where is presented that uses a combination of graph model- Tπ is the number of distance triangles with node i and ing, graph structure measures and infection properties Kπ * (kπ - 1) is the maximum number of possible con- to estimate risk for a person or community. Conta- nections in neighbors of i. A large C implies that the gion is dependent on physical proximity, hence net- graph is well connected locally to form a cluster. Av- work diffusion techniques can be helpful here. A re- erage graph clustering coefficient is: C = 1/n * β πΆπ . lated problem is Link Prediction: the task here is to predict whether there will be a link between two nodes, 3.4. Community Detection provided that link does not pre-exist. Node similari- ties can be found using measures [9] for analyzing the In contact tracing and related analytics, it is important proximity of nodes in a graph such as a degree distri- to discover communities from the proximity events so bution, common neighbors, preferential attachment, that at risk communities can be identified and infected Jaccard coefficient and Leicht-Holme-Newman Index. communities can be kept in isolation. As seen from Highly influential nodes (hubs) in the graph (as high- Fig. 4, number of communities increase with the num- lighted in Section 3.2) enable a swift spread of an in- ber of proximity event instances. Also the density of fection through the graph. On the other hand, poorly the communities increases reflecting closely connected connected or periphery nodes would relax the spread- contacts. The task of community detection (or graph ing of infection and permit only a tiny portion of the clustering) is to discover subsets of nodes (clusters) of graph to get exposed to the diffusion. Graphs with lots connected communities in which nodes have many in- of localized clusters (in case of highly fragmented low- ternal edges and few external edges. Detecting com- density graph) may limit the spread of an infection munities in a graph is NP-complete. Community de- and will face hurdles in establishing any momentum, tection methods can be divided into three categories: whereas dense graphs with few gaps (or few boundary spanners) would assist the dissemination of contagion. Result: Risk Propagation to a specific depth Earlier discussions in the paper were around find- under specified time constraint. ing risks in the graph structure, nodes and communi- Parameters: ties. Certain techniques like belief propagation and la- V β set of vertices or nodes in the graph G; bel propagation are popular in general graph analytics, E β set of edges or links in G; however in this scenario, what is needed is a knowl- A β external inputs like node health (Aβ ) and edge wrapper on breadth first algorithm which can attributes like degree; specified with subscript; match the scale of incoming proximity event streams. Aπ β attribute to store last risk propagation Finally the paper presents an approach to propagate node label; risk across proximity graph (Algorithm 1). Due to stor- software β (.) link to software modules and age of last propagation update attribute at each con- libraries; nected node, if there is a sudden stop due to server load M β diameter of graph, i.e. maximum or time constraints, the risk propagation can start from eccentricity value; last checkpoint. This avoids updating the same node x, y, z β damping factors to balance risk risk attribute more than once. The risk estimate propa- propagation bias; gation deals with three damping factors: x β takes care Begin: of contagion speed and is determined by disease stage D β 0 // current graph traversal depth level; of infected person (incubation, prodromal, illness, de- v β V(i) β node i of set V is infected; hence cline and convalescence), y β to handle influence of Aβ (i) = infect; selected centrality measure, z β to handle relative com- v(Aπ ) β 1 * x * w // infected person with munity densityβs impact on risk. weighted risk w; In conclusion, the paper described various aspects Q β queue is created and intitialized with V(i) of Graph Analytics for proximity graph based contact as root node; tracing in pandemic scenarios. R β currrent root node at depth level D; while Q β empty And time constraint = satisfied do References v( Aπ ) β V(i) label //helps recover from sudden halts ; [1] M. J. M. Chowdhury et. al., Covid-19 contact trac- v β visited //if node is not visited earlier; ing: Challenges and future directions (2020). v β Q.dequeue( ) /* Removing vertex v [2] N. e. a. Ahmed, A survey of covid-19 contact trac- from queue whose neighbour will be ing apps, arXiv preprint arXiv:2006.10306 (2020). visited next */; [3] V. Chandel, S. Banerjee, A. Ghose, Proxitrak: A D β D + 1 //increment depth ; robust solution to enforce real-time social distanc- v( Aπ ) β R( Aπ ) * (1 - D / M) //risk ing & contact tracing in enterprise scenario, CPD propagation; Workshop, Ubicomp 2020 (2020). if v has relative Low Centrality Measure [4] K. Das, S. Samanta, M. Pal, Study on centrality then measures in social networks: a survey, Social net- v( Aπ ) β y * v( Aπ ) //y is a damping work analysis and mining 8 (2018) 13. factor; [5] Y. Zhao, A survey on theoretical advances of end community detection in networks, Computational if v is part of a relative Large Community Statistics 9 (2017) e1403. then [6] S. Ghosh et. al., Distributed louvain algorithm for v( Aπ ) β z * v( Aπ ) //z is a damping graph community detection, in: IPDPS 2018, IEEE, factor; 2018, pp. 885β895. end [7] N. Fenton et. al., A privacy-preserving bayesian foreach neighbors W of v β G do network model for personalised covid19 risk as- if w β visited And w( Aπ ) β v( Aπ ) sessment and contact tracing, medRxiv (2020). //checks infected node label matches for [8] T. Bhattasali, Pandemic analytics to assess risk of previous halts; covid-19 outbreak (2020). then [9] V. MartΓnez et. al., A survey of link prediction Q.enqueue( w ) //Stores w in Q; in complex networks, ACM computing surveys end (CSUR) 49 (2016) 1β33. end end Algorithm 1: Contagion Risk Propagation