-

Graph Analytics on Proximity Data to Fight Contagion

Snehasis Banerjee

0 1

Vivek Chandel

0 1

Avik Ghose

0 1 0 TCS Research & Innovation, Tata Consultancy Services , Kolkata , India 1 [4] K. Das , S. Samanta, M. Pal, Study on centrality

As social distancing has become the new normal, it is eminent for enterprises and administrations to make decisions based on analytics on human proximity data. Graph Analytics can serve as an essential technology to understand human network behaviors and contact tracing inferences in pandemic and post-pandemic scenarios. A large amount of work has happened on epidemiology of pandemics. However, a formal application of graph analytics backed by visualization and mapping of graph algorithms has remained neglected in pandemic scenarios. This paper has addressed this important need backed by graph modeling, methods and analytics.

eol>graph analytics contact tracing COVID-19 pandemic human proximity data

Title of the Proceedings: Proceedings of the CIKM 2020 Workshops, October 19-20, Galway, Ireland.

Editors of the Proceedings: Stefan Conrad, Ilaria Tiddi. snehasis.banerjee@tcs.com (S. Banerjee) 0000-0001-6497-2085 (S. Banerjee)

© 2020 Copyright for this paper by its authors. Use permitted under Creative CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmUmoRns WLiceonrsekAsthtriobuptioPnr4o.0cIneteerdnaitniognasl ((CCC EBYU4R.0)-.WS.org)

1https://www.postgresql.org ; https://postgis.net 2https://www.orientdb.org

2. Graph Data Population While the individual events of proximity instances are stored in a RDBMS, the actual inference and analytics happens over the graph database. As any active sensor consume power, hence keeping a sensor device always online which sends events to server is impractical. Hence, there is a sleep and wake cycle, when proximity events are detected and sent to server database at intervals. However, to perform graph analytics, the data instances need to aggregated to populate the graph with meaningful information. Therefore, if there is no gap greater than a preset time threshold T in terms of two persons being in contact, the proximity event instances becomes a continuous event. When the gap is more than T, a new entry is entered in the graph in a Figure 2: Modeling of nodes and edges in a proximity graph similar fashion. Sometimes, due to sensor errors and signal reflection, false proximity events can be recorded.

However, as this is two way detection (person 1 de- is how a person proximity event is generated. A pertecting person 2 and vice versa when they are in close son can have multiple tagged sensors (like two smart contact of say 2 metres) [3], if there is only one event phones). Same sensor device (with a unique device ID) detected, that is dropped from graph insertion – this is can be used by various persons in disjoint time frames. a configurable setting as per sensitivity needs of spe- ( 3 ) Static zone tagged sensor: fixed sensors like beacific deployments. In order for wake and sleep cycles cons in elevators, ofice areas, cabins, canteens, gym, across sensors to be consistent, a standard time gap is shops, etc. helps in proximity event detection in two maintained across devices, which is typically a sleep ways: (a) Any detected proximity event confirms posiof 30 seconds and a wake of 5 seconds synchronized to tively if two or more persons were already reported by 00:00:00 hours of UTC (Coordinated Universal Time). person tagged sensors. (b) Usually infectious diseases The data in relational database and graph is deleted stay active at a space for some period of time. Therebased on some pre-set expiry timestamp to maintain fore, a proximity event with a person at time T1 and a sliding window based on business logic or infection another person with time T2 can trigger an indirect risk period (usually 30 days for COVID-19 scenario). proximity link between the two persons, if | 1 − 2| Geo-spatial queries may be directly run on spatially in- are within the infectious period relevant to that space. dexed graph database or be optionally run on the spa- (4) Dynamic zone tagged sensor: sensors tagged with tial layer of the relational database (to use spatial geo- transport can keep track of proximity events that ocmetric algorithms) by entangling with graph queries, cur when a person stays in that zone. It is important to however, in enterprise scenarios (small geo-location keep track of the time span a person is in this dynamic coverage), such needs are found to be rare. zone, as the entry and exit times will reveal the spatial zones of entry and disembarkment. This needs to be matched with other persons having overlapping prox2.1. Entities as Graph Nodes imity events but diferent entry or exit points. ContaA necessary part of graph modeling is identification gion can be tracked in the related entry and exit spaces of entities and its representation for fast graph queries (like bus stops, airports, parking lots, stations) even if and traversal. Here, the nodes are of 4 types: those spaces have bad sensor coverage.

( 1 ) Person: humans having features (attributes) like Locations can be sensed from GPS or cellular tower age, risk factors that remain static over a period of estimates in outdoor settings. In indoors, various lotime. Dynamic attributes are derived or calculated over calization techniques leveraging radar, RFID, Wifi, BLE an interval – risk estimation, proximity activity and can be used. There are other possible node types specifhealth estimate. A necessary attribute of a person in a ically mobile objects (like parcels, posts, shopping items, contagion scenario is the person’s current health sta- currency notes) that can play a part in the infection tus - infected, healthy or susceptible. spread. However, as proper sanitization practices can ( 2 ) Person tagged sensor: sensors like wearable smart mitigate these occasional object based contagion risks, devices and smartphones are tagged to a person – that it is kept out of scope of the paper.

2.2. Proximity Events as Graph Edges

As presented in Fig. 2, a new proximity event is entered into the graph as an edge with start time (preferably UTC timestamp) and time duration (say seconds) as necessary attributes. In the graph, relations are maintained between persons and not devices (bearing sensors) as same device may be used by diferent persons over a period of time. So person and zone resolution at the time of graph population is of high importance – proximity event edge keeps an attribute source (like Figure 3: Visualization of proximity relations (with number device sensor) to handle any future backtrack queries. of interactions and duration) given a person and depth level Indirect proximity events are inferred using transitive relation of person and zones with timestamp constraints.

By nature humans are mobile, hence it makes sense to duration of contact, the number of instances of conkeep a private record of all other nodes (places and tact, number of infected visiting a specific zone and the other persons) visited within a sliding interval of time time spent there. Other queries can detect crowds at period as edges. Because, two nodes in this graph can locations based on large number of temporally collidhave multiple edges (proximity instances) and prox- ing proximity events – those zones need to be marked imity events are by definition both way, the resulting as ‘at risk’ zones. Crowd detection will be based on graph is an undirected Multigraph. Due to social dis- spatial queries on spatio-relational database (say Posttancing measures, this graph will be sparse and have GIS) if dealing with latitude and longitude values (GPS). local community clusters. Hence, adjacency list is the best way to represent the relations between |V| vertices (nodes) and |E| edges, resulting in 2*|E| space com- 3.2. Centrality Measures plexity. A hash index on the nodes will aid in O( 1 ) lookup and fast adjacent node traversals. In a sparse In managing pandemics and enforcing social distancgraph with community clusters, graph traversal us- ing, it becomes crucial to find the specific zones that ing breadth first search is the optimal approach, which are hotspots and which persons have highest capabiltakes time around ( + ) ∗ / , for a depth up to ity to spread infection. Usage of graph centrality mealevel . and are respective vertices and edges at sures helps in identification of critical nodes. a community cluster in the graph of cluster diameter Degree centrality, measured by the number of unique (largest edge distance or number of nodes in path edge links of a node to other nodes can be thought of between two nodes in that cluster or sub-graph). the relative risk of getting infected if the local community sees an infection outbreak.

Closeness centrality of a node is: C(i) = ∑ , where 3. Graph Analytics is the geodesic distance from node i to node j (number of links in the shortest path from node i to node Graph analytics deals with pairwise relationship be- j). Closeness reveals how long it takes for infection to tween two entities at a time; as well as structural char- lfow from one node to others in the graph. acteristics of the graph as a whole. Graph Analyt- Betweenness centrality signifies that a node is cruics also includes visualization aspects, insights gained cial or forms a bridge for spreading of infection. Examfrom complex query execution and graph algorithms. ples are person nodes who visit many other nodes like doctor, courier and security staf. Examples of bridge 3.1. Graph Queries and Visualization space nodes are market and transport zones. Betweenness centrality is measured as, b(i) = ∑ / , where is number of shortest paths from node j to node k (j and k ≠ i) and g is number of shortest paths from node j to node k passing through node i.

There are other graph centrality measures [4] like Eigenvector Centrality. However, it requires calculations over adjacency matrix where as the graph model shown is based on adjacency list – hence not optimal.

In contact tracing scenario, one of the primary requirement is to find the chain of persons an infected person has come across. This can be done by graph traversal and the results can be showed in front-end visualization as shown in Fig. 3. For privacy reasons, the exact identity of a person is not disclosed and an encrypted alias is used. Other related graph queries are time overlap between visits to specific zones, the total

3.3. Graph Structure Measures Proximity events between 2 persons represent a Dyad in graph, where as proximity event between 2 persons and a mutually visited zone forms a Triad. This are the simplest forms of graph structures.

Eccentricity is the maximum distance from a given a graph is the maximum eccentricity value. Eccentricity calculation aids in understanding contagion’s spread from one node to the others. If infection propagates from a node A towards some specific nodes in the graph in the smallest number of steps, it means that the node A has better propagation eficiency.

Density serves as an important graph structure mea

sure, as more dense a graph is (more connected nodes), the more the risk of infection spreading. Density is calculated as: D(G) = 2 ∗ | |/(| | ∗ (| | − 1)).

A clique is a graph (or subgraph) in which every node is connected to every other node. Finding maxi- mum cliques in a graph provides the largest set of common links – hence, strong communities can be discov- node to all other nodes in a graph. The diameter of Figure 4: Importance of Community Clusters in Contagion ered with the following risk – if any one is infected, it will result in a high probability of others also getting infected. Another structure, K-cores, which are not necessarily cohesive groups, but indicate areas of a graph which contain clique-like structures.

Clustering coeficient measures the drift of nodes to form dense subgraphs. The clustering coeficient C of a node i shows how its neighbors are connected with each other. The local (node) clustering coeficient is

calculated as follows: C = 2*T / ( K * (k - 1) ), where

T is the number of distance triangles with node i and K * (k - 1) is the maximum number of possible con

nections in neighbors of i. A large C implies that the graph is well connected locally to form a cluster. Average graph clustering coeficient is: C = 1/n * ∑ .

3.4. Community Detection In contact tracing and related analytics, it is important

to discover communities from the proximity events so that at risk communities can be identified and infected communities can be kept in isolation. As seen from tection methods can be divided into three categories: divisive, agglomerative, and optimization algorithms. Some popular algorithms [5] are K-cliques, hierarchical clustering, spectral bisection, graph partioning, modularity optimization and link communities. One of the methods suitable for large graphs is Louvain Method [6] that uses graph modularity and has time complexity as O(|V|log|V|). Modularity is a measure of the struc ture of the graph and is defined as, Q = ( |E | ) / E , where |E | is the number of links connect| - |E ing nodes that belong to the same community and |E | is the estimated |E

|, if links were random. The Louvain Method works as follows: ( 1 ) By optimizing modularity locally on all the nodes, small communities of nodes are found. ( 2 ) Small communities are then grouped into one node, and the first step is repeated. −

3.5. Contagion and Risk Propagation

Some work has been done on contagion and risk estimation [7] [8]. However, they are dependent mostly on external factors to estimate the risk. Here, a method is presented that uses a combination of graph modeling, graph structure measures and infection properties to estimate risk for a person or community. Contagion is dependent on physical proximity, hence network difusion techniques can be helpful here. A related problem is Link Prediction: the task here is to predict whether there will be a link between two nodes, provided that link does not pre-exist. Node similarities can be found using measures [9] for analyzing the proximity of nodes in a graph such as a degree distribution, common neighbors, preferential attachment,

Jaccard coeficient and Leicht-Holme-Newman Index.

Highly influential nodes (hubs) in the graph (as highlighted in Section 3.2) enable a swift spread of an infection through the graph. On the other hand, poorly connected or periphery nodes would relax the spreading of infection and permit only a tiny portion of the graph to get exposed to the difusion. Graphs with lots of localized clusters (in case of highly fragmented lowdensity graph) may limit the spread of an infection and will face hurdles in establishing any momentum, whereas dense graphs with few gaps (or few boundary of Graph Analytics for proximity graph based contact as root node; tracing in pandemic scenarios. spanners) would assist the dissemination of contagion.

Earlier discussions in the paper were around finding risks in the graph structure, nodes and communities. Certain techniques like belief propagation and label propagation are popular in general graph analytics, however in this scenario, what is needed is a knowledge wrapper on breadth first algorithm which can match the scale of incoming proximity event streams. Finally the paper presents an approach to propagate risk across proximity graph (Algorithm 1). Due to storage of last propagation update attribute at each connected node, if there is a sudden stop due to server load or time constraints, the risk propagation can start from last checkpoint. This avoids updating the same node risk attribute more than once. The risk estimate propagation deals with three damping factors: x – takes care of contagion speed and is determined by disease stage of infected person (incubation, prodromal, illness, decline and convalescence), y – to handle influence of selected centrality measure, z – to handle relative community density’s impact on risk.

In conclusion, the paper described various aspects

ing & contact tracing in enterprise scenario, CPD measures in social networks: a survey, Social network analysis and mining 8 (2018) 13.

A survey on theoretical advances of community detection in networks, Computational [6] S. Ghosh et. al., Distributed louvain algorithm for graph community detection, in: IPDPS 2018, IEEE, 2018, pp. 885–895. [7] N. Fenton et. al., A privacy-preserving bayesian network model for personalised covid19 risk assessment and contact tracing, medRxiv (2020). [8] T. Bhattasali, Pandemic analytics to assess risk of covid-19 outbreak (2020).

A survey of link prediction in complex networks, ACM computing surveys node label;

Result: Risk Propagation to a specific depth

under specified time constraint. Parameters:

E ← set of edges or links in G; V ← set of vertices or nodes in the graph G; A ← external inputs like node health (Aℎ) and

attributes like degree; specified with subscript; A

← attribute to store last risk propagation software ← (.) link to software modules and Begin:

D ← 0 // current graph traversal depth level;

v ← V(i) ← node i of set V is infected; hence v(A ) ← 1 * x * w // infected person with

Q ← queue is created and intitialized with V(i)

satisfied do R ← currrent root node at depth level D; while Q ≠ empty And time constraint = sudden halts ; v( A ) ← V(i) label //helps recover from v ← visited //if node is not visited earlier; v ← Q.dequeue( ) /* Removing vertex v from queue whose neighbour will be visited next */; v( A ) ← R( A ) * (1 - D / M) //risk if v has relative Low Centrality Measure v( A ) ← y * v( A ) //y is a damping then end then end v( A ) ← z * v( A ) //z is a damping if w ≠ visited And w( A ) ≠ v( A ) previous halts; end

Q.enqueue( w ) //Stores w in Q;

[1] M. J. M. Chowdhury et . al., Covid-19 contact tracing: Challenges and future directions ( 2020 ).

[2]

N. e. a.

Ahmed , A survey of covid-19 contact tracing apps , arXiv preprint arXiv: 2006 . 10306 ( 2020 ).

[3]

Chandel ,

Banerjee ,

Ghose , Proxitrak: A robust solution to enforce real-time social distanc [5]

Zhao , [9]

Martínez et. al., foreach neighbors W of v ∈ G do / /checks infected node label matches for end