=Paper= {{Paper |id=Vol-2699/paper29 |storemode=property |title=Graph Analytics on Proximity Data to Fight Contagion |pdfUrl=https://ceur-ws.org/Vol-2699/paper29.pdf |volume=Vol-2699 |authors=Snehasis Banerjee,Vivek Chandel,Avik Ghose |dblpUrl=https://dblp.org/rec/conf/cikm/BanerjeeCG20 }} ==Graph Analytics on Proximity Data to Fight Contagion== https://ceur-ws.org/Vol-2699/paper29.pdf
Graph Analytics on Proximity Data to Fight Contagion
Snehasis Banerjeea , Vivek Chandela and Avik Ghosea
a TCS Research & Innovation, Tata Consultancy Services, Kolkata, India



                                          Abstract
                                          As social distancing has become the new normal, it is eminent for enterprises and administrations to make decisions based
                                          on analytics on human proximity data. Graph Analytics can serve as an essential technology to understand human network
                                          behaviors and contact tracing inferences in pandemic and post-pandemic scenarios. A large amount of work has happened
                                          on epidemiology of pandemics. However, a formal application of graph analytics backed by visualization and mapping of
                                          graph algorithms has remained neglected in pandemic scenarios. This paper has addressed this important need backed by
                                          graph modeling, methods and analytics.

                                          Keywords
                                          graph analytics, contact tracing, COVID-19 pandemic, human proximity data


                                                                                                                   person can be at a location for a certain time which is
                                                                                                                   visited by another person after some time – this events
                                                                                                                   will be missed in standard proximity detection solu-
                                                                                                                   tions. Also, depending on the infection type (in this
                                                                                                                   case COVID-19), different persons and places will have
                                                                                                                   different levels of susceptibility to risks, based on the
                                                                                                                   infection’s properties and incubation period. Hence it
                                                                                                                   is important to have: (a) a graph model to store and
                                                                                                                   represent proximity events and related attributes, that
                                                                                                                   is robust and scalable (b) fast techniques that can aid in
                                                                                                                   contact tracing (c) methods that can help risk estima-
                                                                                                                   tion by applying graph analytics. This paper presents a
                                                                                                                   system and several methods to tackle the prior points.
                                                                                                                      As seen in Fig. 1, the system intakes proximity re-
Figure 1: System Architecture for enabling Graph Analytics                                                         lation data (either person to person or person to zone)
                                                                                                                   from a multi-modal sensory input. It can be GPS (Global
                                                                                                                   Positioning System) and/or Bluetooth Low Energy (BLE)
1. Background                                                                                                      present in contact tracing smart devices (like phone
                                                                                                                   and watches) or it may be a multitude of hard sen-
Social distancing [1] has become an essential need to                                                              sors like BLE and LoRa (long range low-power net-
manage pandemics like COVID-19. In recent times,                                                                   work communication) beacons or Radio-frequency iden-
the world has witnessed deployment of multiple apps                                                                tification (RFID) tags. Crowds (susceptible places of
and systems by governments [2]. Contact tracing and                                                                high proximity events) at a specific location can be de-
related methods are expected to remain prevalent in a                                                              tected by radar, microphone captured sound level, in-
post pandemic scenario. Adoption in enterprise is ex-                                                              frared, Wifi signal as well as soft sensors like location
pected to be of the order of mass scale. However, the                                                              tagged social media feeds. Sensor feeds are stored in
enterprise level settings will be different from a gov-                                                            a relational database management systems (RDBMS,
ernment monitored contact tracing scenario. As an ex-                                                              such as PostGres with PostGIS 1 having spatial index-
ample, in a factory, same device may get shared across                                                             ing support) in form of proximity event tuples :
multiple workers in rotation. Also, only person to per-                                                            { entity, entities detected, timestamp, optional attributes}.
son proximity events will not be futile, as, an infected                                                           Based on pre-set business logic (like duration of aggre-
Title of the Proceedings: Proceedings of the CIKM 2020 Workshops,                                                  gation), data from RDBMS is updated periodically in a
October 19-20, Galway, Ireland.                                                                                    graph database (like OrientDB2 ). Complex queries and
Editors of the Proceedings: Stefan Conrad, Ilaria Tiddi.                                                           graph analytics are run on the graph from the front-
   snehasis.banerjee@tcs.com (S. Banerjee)                                                                         end application to get insights and visualization.
 0000-0001-6497-2085 (S. Banerjee)
                                    Β© 2020 Copyright for this paper by its authors. Use permitted under Creative      1 https://www.postgresql.org ; https://postgis.net
                                    Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)                                           2 https://www.orientdb.org
 Proceedings
2. Graph Data Population
While the individual events of proximity instances are
stored in a RDBMS, the actual inference and analytics
happens over the graph database. As any active sen-
sor consume power, hence keeping a sensor device al-
ways online which sends events to server is impracti-
cal. Hence, there is a sleep and wake cycle, when prox-
imity events are detected and sent to server database
at intervals. However, to perform graph analytics, the
data instances need to aggregated to populate the graph
with meaningful information. Therefore, if there is no
gap greater than a preset time threshold T in terms of
two persons being in contact, the proximity event in-
stances becomes a continuous event. When the gap is
                                                           Figure 2: Modeling of nodes and edges in a proximity graph
more than T, a new entry is entered in the graph in a
similar fashion. Sometimes, due to sensor errors and
signal reflection, false proximity events can be recorded.
However, as this is two way detection (person 1 de- is how a person proximity event is generated. A per-
tecting person 2 and vice versa when they are in close son can have multiple tagged sensors (like two smart
contact of say 2 metres) [3], if there is only one event phones). Same sensor device (with a unique device ID)
detected, that is dropped from graph insertion – this is can be used by various persons in disjoint time frames.
a configurable setting as per sensitivity needs of spe-       (3) Static zone tagged sensor: fixed sensors like bea-
cific deployments. In order for wake and sleep cycles      cons   in elevators, office areas, cabins, canteens, gym,
across sensors to be consistent, a standard time gap is    shops,    etc. helps in proximity event detection in two
maintained across devices, which is typically a sleep      ways:    (a) Any detected proximity event confirms posi-
of 30 seconds and a wake of 5 seconds synchronized to      tively  if two  or more persons were already reported by
00:00:00 hours of UTC (Coordinated Universal Time).        person    tagged    sensors. (b) Usually infectious diseases
The data in relational database and graph is deleted       stay  active   at  a  space for some period of time. There-
based on some pre-set expiry timestamp to maintain         fore,  a  proximity     event with a person at time T1 and
a sliding window based on business logic or infection      another     person     with  time T2 can trigger an indirect
risk period (usually 30 days for COVID-19 scenario).       proximity     link   between    the two persons, if |𝑇 1 βˆ’ 𝑇 2|
Geo-spatial queries may be directly run on spatially in-   are within    the   infectious   period relevant to that space.
dexed graph database or be optionally run on the spa-         (4) Dynamic       zone  tagged   sensor: sensors tagged with
tial layer of the relational database (to use spatial geo- transport    can   keep   track  of  proximity events that oc-
metric algorithms) by entangling with graph queries,       cur when     a person    stays  in that zone. It is important to
however, in enterprise scenarios (small geo-location       keep   track  of  the  time  span   a person is in this dynamic
coverage), such needs are found to be rare.                zone,   as the  entry   and   exit times will reveal the spatial
                                                           zones of entry and disembarkment. This needs to be
                                                           matched with other persons having overlapping prox-
2.1. Entities as Graph Nodes                               imity events but different entry or exit points. Conta-
A necessary part of graph modeling is identification gion can be tracked in the related entry and exit spaces
of entities and its representation for fast graph queries (like bus stops, airports, parking lots, stations) even if
and traversal. Here, the nodes are of 4 types:             those spaces have bad sensor coverage.
   (1) Person: humans having features (attributes) like       Locations can be sensed from GPS or cellular tower
age, risk factors that remain static over a period of estimates in outdoor settings. In indoors, various lo-
time. Dynamic attributes are derived or calculated over calization techniques leveraging radar, RFID, Wifi, BLE
an interval – risk estimation, proximity activity and can be used. There are other possible node types specif-
health estimate. A necessary attribute of a person in a ically mobile objects (like parcels, posts, shopping items,
contagion scenario is the person’s current health sta- currency notes) that can play a part in the infection
tus - infected, healthy or susceptible.                    spread. However, as proper sanitization practices can
   (2) Person tagged sensor: sensors like wearable smart mitigate these occasional object based contagion risks,
devices and smartphones are tagged to a person – that it is kept out of scope of the paper.
2.2. Proximity Events as Graph Edges
As presented in Fig. 2, a new proximity event is en-
tered into the graph as an edge with start time (prefer-
ably UTC timestamp) and time duration (say seconds)
as necessary attributes. In the graph, relations are main-
tained between persons and not devices (bearing sen-
sors) as same device may be used by different persons
over a period of time. So person and zone resolution
at the time of graph population is of high importance
– proximity event edge keeps an attribute source (like Figure 3: Visualization of proximity relations (with number
device sensor) to handle any future backtrack queries. of interactions and duration) given a person and depth level
Indirect proximity events are inferred using transitive
relation of person and zones with timestamp constraints.
By nature humans are mobile, hence it makes sense to duration of contact, the number of instances of con-
keep a private record of all other nodes (places and tact, number of infected visiting a specific zone and the
other persons) visited within a sliding interval of time time spent there. Other queries can detect crowds at
period as edges. Because, two nodes in this graph can locations based on large number of temporally collid-
have multiple edges (proximity instances) and prox- ing proximity events – those zones need to be marked
imity events are by definition both way, the resulting as β€˜at risk’ zones. Crowd detection will be based on
graph is an undirected Multigraph. Due to social dis- spatial queries on spatio-relational database (say Post-
tancing measures, this graph will be sparse and have GIS) if dealing with latitude and longitude values (GPS).
local community clusters. Hence, adjacency list is the
best way to represent the relations between |V| ver-
tices (nodes) and |E| edges, resulting in 2*|E| space com-
                                                           3.2. Centrality Measures
plexity. A hash index on the nodes will aid in O(1)
lookup and fast adjacent node traversals. In a sparse In managing pandemics and enforcing social distanc-
graph with community clusters, graph traversal us- ing, it becomes crucial to find the specific zones that
ing breadth first search is the optimal approach, which are hotspots and which persons have highest capabil-
takes time around (𝑉𝑐 + 𝐸𝑐 ) βˆ— 𝑑𝑙 /𝐷𝑐 , for a depth up to ity to spread infection. Usage of graph centrality mea-
level 𝑑𝑙 . 𝑉𝑐 and 𝐸𝑐 are respective vertices and edges at sures helps in identification of critical nodes.
a community cluster in the graph of cluster diameter           Degree centrality, measured by the number of unique
𝐷𝑐 (largest edge distance or number of nodes in path edge links of a node to other nodes can be thought of
between two nodes in that cluster or sub-graph).           the relative risk of getting infected if the local commu-
                                                           nity sees an infection outbreak.
                                                               Closeness centrality of a node is: C(i) = βˆ‘ 𝑑𝑖 𝑗 , where
3. Graph Analytics                                         𝑑𝑖 𝑗 is the geodesic distance from node i to node j (num-
                                                           ber of links in the shortest path from node i to node
Graph analytics deals with pairwise relationship be-
                                                           j). Closeness reveals how long it takes for infection to
tween two entities at a time; as well as structural char-
                                                           flow from one node to others in the graph.
acteristics of the graph as a whole. Graph Analyt-
                                                               Betweenness centrality signifies that a node is cru-
ics also includes visualization aspects, insights gained
                                                           cial or forms a bridge for spreading of infection. Exam-
from complex query execution and graph algorithms.
                                                           ples are person nodes who visit many other nodes like
                                                           doctor, courier and security staff. Examples of bridge
3.1. Graph Queries and Visualization                       space nodes are market and transport zones. Between-
                                                           ness centrality is measured as, b(i) = βˆ‘ 𝑔𝑗 𝑖 π‘˜ /𝑔𝑗 π‘˜ , where
In contact tracing scenario, one of the primary require-
                                                           𝑔 is number of shortest paths from node j to node k
ment is to find the chain of persons an infected person 𝑗 π‘˜
                                                           (j and k β‰  i) and g𝑗 𝑖 π‘˜ is number of shortest paths from
has come across. This can be done by graph traver-
                                                           node j to node k passing through node i.
sal and the results can be showed in front-end visual-
                                                               There are other graph centrality measures [4] like
ization as shown in Fig. 3. For privacy reasons, the
                                                           Eigenvector Centrality. However, it requires calcula-
exact identity of a person is not disclosed and an en-
                                                           tions over adjacency matrix where as the graph model
crypted alias is used. Other related graph queries are
                                                           shown is based on adjacency list – hence not optimal.
time overlap between visits to specific zones, the total
3.3. Graph Structure Measures
Proximity events between 2 persons represent a Dyad
in graph, where as proximity event between 2 persons
and a mutually visited zone forms a Triad. This are the
simplest forms of graph structures.
   Eccentricity is the maximum distance from a given
node to all other nodes in a graph. The diameter of           Figure 4: Importance of Community Clusters in Contagion
a graph is the maximum eccentricity value. Eccen-
tricity calculation aids in understanding contagion’s
spread from one node to the others. If infection prop-        divisive, agglomerative, and optimization algorithms.
agates from a node A towards some specific nodes in           Some popular algorithms [5] are K-cliques, hierarchi-
the graph in the smallest number of steps, it means           cal clustering, spectral bisection, graph partioning, mod-
that the node A has better propagation efficiency.            ularity optimization and link communities. One of the
   Density serves as an important graph structure mea-        methods suitable for large graphs is Louvain Method
sure, as more dense a graph is (more connected nodes),        [6] that uses graph modularity and has time complex-
the more the risk of infection spreading. Density is cal-     ity as O(|V|log|V|). Modularity is a measure of the struc-
culated as: D(G) = 2 βˆ— |𝐸|/(|𝑉 | βˆ— (|𝑉 | βˆ’ 1)).               ture of the graph and is defined as, Q = ( |E𝑖 𝑛 | - |E𝑖 π‘›βˆ’
   A clique is a graph (or subgraph) in which every
                                                              𝑅 | ) / E , where |E𝑖 𝑛 | is the number of links connect-
node is connected to every other node. Finding maxi-          ing nodes that belong to the same community and |E𝑖
mum cliques in a graph provides the largest set of com-
                                                              π‘›π‘Žπ‘… | is the estimated |E𝑖 𝑛 |, if links were random. The
mon links – hence, strong communities can be discov-          Louvain Method works as follows: (1) By optimizing
ered with the following risk – if any one is infected,        modularity locally on all the nodes, small communities
it will result in a high probability of others also get-      of nodes are found. (2) Small communities are then
ting infected. Another structure, K-cores, which are          grouped into one node, and the first step is repeated.
not necessarily cohesive groups, but indicate areas of
a graph which contain clique-like structures.
   Clustering coefficient measures the drift of nodes to      3.5. Contagion and Risk Propagation
form dense subgraphs. The clustering coefficient C of         Some work has been done on contagion and risk es-
a node i shows how its neighbors are connected with           timation [7] [8]. However, they are dependent mostly
each other. The local (node) clustering coefficient is        on external factors to estimate the risk. Here, a method
calculated as follows: C𝑖 = 2*T𝑖 / ( K𝑖 * (k𝑖 - 1) ), where   is presented that uses a combination of graph model-
T𝑖 is the number of distance triangles with node i and        ing, graph structure measures and infection properties
K𝑖 * (k𝑖 - 1) is the maximum number of possible con-          to estimate risk for a person or community. Conta-
nections in neighbors of i. A large C implies that the        gion is dependent on physical proximity, hence net-
graph is well connected locally to form a cluster. Av-        work diffusion techniques can be helpful here. A re-
erage graph clustering coefficient is: C = 1/n * βˆ‘ 𝐢𝑖 .       lated problem is Link Prediction: the task here is to
                                                              predict whether there will be a link between two nodes,
3.4. Community Detection                                      provided that link does not pre-exist. Node similari-
                                                              ties can be found using measures [9] for analyzing the
In contact tracing and related analytics, it is important     proximity of nodes in a graph such as a degree distri-
to discover communities from the proximity events so          bution, common neighbors, preferential attachment,
that at risk communities can be identified and infected       Jaccard coefficient and Leicht-Holme-Newman Index.
communities can be kept in isolation. As seen from               Highly influential nodes (hubs) in the graph (as high-
Fig. 4, number of communities increase with the num-          lighted in Section 3.2) enable a swift spread of an in-
ber of proximity event instances. Also the density of         fection through the graph. On the other hand, poorly
the communities increases reflecting closely connected        connected or periphery nodes would relax the spread-
contacts. The task of community detection (or graph           ing of infection and permit only a tiny portion of the
clustering) is to discover subsets of nodes (clusters) of     graph to get exposed to the diffusion. Graphs with lots
connected communities in which nodes have many in-            of localized clusters (in case of highly fragmented low-
ternal edges and few external edges. Detecting com-           density graph) may limit the spread of an infection
munities in a graph is NP-complete. Community de-             and will face hurdles in establishing any momentum,
tection methods can be divided into three categories:         whereas dense graphs with few gaps (or few boundary
spanners) would assist the dissemination of contagion.      Result: Risk Propagation to a specific depth
   Earlier discussions in the paper were around find-                  under specified time constraint.
ing risks in the graph structure, nodes and communi-        Parameters:
ties. Certain techniques like belief propagation and la-    V ← set of vertices or nodes in the graph G;
bel propagation are popular in general graph analytics,     E ← set of edges or links in G;
however in this scenario, what is needed is a knowl-        A ← external inputs like node health (Aβ„Ž ) and
edge wrapper on breadth first algorithm which can            attributes like degree; specified with subscript;
match the scale of incoming proximity event streams.        A𝑙 ← attribute to store last risk propagation
Finally the paper presents an approach to propagate          node label;
risk across proximity graph (Algorithm 1). Due to stor-     software ← (.) link to software modules and
age of last propagation update attribute at each con-        libraries;
nected node, if there is a sudden stop due to server load   M ← diameter of graph, i.e. maximum
or time constraints, the risk propagation can start from     eccentricity value;
last checkpoint. This avoids updating the same node         x, y, z ← damping factors to balance risk
risk attribute more than once. The risk estimate propa-      propagation bias;
gation deals with three damping factors: x – takes care     Begin:
of contagion speed and is determined by disease stage       D ← 0 // current graph traversal depth level;
of infected person (incubation, prodromal, illness, de-     v ← V(i) ← node i of set V is infected; hence
cline and convalescence), y – to handle influence of         Aβ„Ž (i) = infect;
selected centrality measure, z – to handle relative com-    v(Aπ‘Ÿ ) ← 1 * x * w // infected person with
munity density’s impact on risk.                             weighted risk w;
   In conclusion, the paper described various aspects       Q ← queue is created and intitialized with V(i)
of Graph Analytics for proximity graph based contact         as root node;
tracing in pandemic scenarios.                              R ← currrent root node at depth level D;
                                                            while Q β‰  empty And time constraint =
                                                             satisfied do
References                                                       v( A𝑙 ) ← V(i) label //helps recover from
                                                                  sudden halts ;
[1] M. J. M. Chowdhury et. al., Covid-19 contact trac-           v ← visited //if node is not visited earlier;
    ing: Challenges and future directions (2020).                v ← Q.dequeue( ) /* Removing vertex v
[2] N. e. a. Ahmed, A survey of covid-19 contact trac-            from queue whose neighbour will be
    ing apps, arXiv preprint arXiv:2006.10306 (2020).             visited next */;
[3] V. Chandel, S. Banerjee, A. Ghose, Proxitrak: A              D ← D + 1 //increment depth ;
    robust solution to enforce real-time social distanc-         v( Aπ‘Ÿ ) ← R( Aπ‘Ÿ ) * (1 - D / M) //risk
    ing & contact tracing in enterprise scenario, CPD             propagation;
    Workshop, Ubicomp 2020 (2020).                               if v has relative Low Centrality Measure
[4] K. Das, S. Samanta, M. Pal, Study on centrality               then
    measures in social networks: a survey, Social net-               v( Aπ‘Ÿ ) ← y * v( Aπ‘Ÿ ) //y is a damping
    work analysis and mining 8 (2018) 13.                              factor;
[5] Y. Zhao, A survey on theoretical advances of                 end
    community detection in networks, Computational               if v is part of a relative Large Community
    Statistics 9 (2017) e1403.                                    then
[6] S. Ghosh et. al., Distributed louvain algorithm for              v( Aπ‘Ÿ ) ← z * v( Aπ‘Ÿ ) //z is a damping
    graph community detection, in: IPDPS 2018, IEEE,                   factor;
    2018, pp. 885–895.                                           end
[7] N. Fenton et. al., A privacy-preserving bayesian             foreach neighbors W of v ∈ G do
    network model for personalised covid19 risk as-                  if w β‰  visited And w( A𝑙 ) β‰  v( A𝑙 )
    sessment and contact tracing, medRxiv (2020).                    //checks infected node label matches for
[8] T. Bhattasali, Pandemic analytics to assess risk of                previous halts;
    covid-19 outbreak (2020).                                         then
[9] V. MartΓ­nez et. al., A survey of link prediction                      Q.enqueue( w ) //Stores w in Q;
    in complex networks, ACM computing surveys                       end
    (CSUR) 49 (2016) 1–33.                                       end
                                                            end
                                                             Algorithm 1: Contagion Risk Propagation