=Paper=
{{Paper
|id=Vol-2083/paper-16
|storemode=property
|title=Topic-aware Network Visualisation to Explore Large Email Corpora
|pdfUrl=https://ceur-ws.org/Vol-2083/paper-16.pdf
|volume=Vol-2083
|authors=Tim Repke,Ralf Krestel
|dblpUrl=https://dblp.org/rec/conf/edbt/RepkeK18
}}
==Topic-aware Network Visualisation to Explore Large Email Corpora==
Topic-aware Network Visualisation to
Explore Large Email Corpora
Tim Repke Ralf Krestel
Hasso Plattner Institute Hasso Plattner Institute
Potsdam, Germany Potsdam, Germany
tim.repke@hpi.de ralf.krestel@hpi.de
ABSTRACT one person as the root, thus an overview of the entire network
Nowadays, more and more large datasets exhibit an intrinsic is lost. CactusTrees [6] on the other hand represent hierarchical
graph structure. While there exist special graph databases to han- structures with the goal of untangling overlaid bundles of inter-
dle ever increasing amounts of nodes and edges, visualising this secting edges, making distant connections more apparent. As
data becomes infeasible quickly with growing data. In addition, higher order dependencies may get lost in traditional visualisa-
looking at its structure is not sufficient to get an overview of a tions, HoNVis [21] adds nodes to encode dependencies in chains
graph dataset. Indeed, visualising additional information about of interactions. Usually, a communication network has many
nodes or edges without cluttering the screen is essential. In this nodes and overlapping connections already, so Yang et al. [23]
paper, we propose an interactive visualisation for social networks rather focus on discovering overlapping cores to improve the
that positions individuals (nodes) on a two-dimensional canvas identification of community boundaries to highlight global latent
such that communities defined by social links (edges) are easily structures. Similarly, Gronemann et al. [11] use the metaphor of
recognisable. Furthermore, we visualise topical relatedness be- islands and hills to visualise clustered graphs, making densely
tween individuals by analysing information about social links, in connected communities clearly noticeable. The edges are bundled
our case email communication. To this end, we utilise document and follow valleys of the resulting topology, thus making rela-
embeddings, which project the content of an email message into tionships between other communities hard to follow. MapSets [7]
a high dimensional semantic space and graph embeddings, which assume a graph that was laid out using embeddings reflecting
project nodes in a network graph into a latent space reflecting communities. An algorithm then draws regions around clusters
their relatedness. of nodes, such that the bounding shapes are contiguous and
non-overlapping, but yet abstract. Another approach to visualise
networks at full scale is to aggregate nodes based on their spa-
1 INTRODUCTION tial distribution and thereby allowing for a simple exploration
In our modern information society we produce substantial amounts with contour lines and heatmap overlays to emphasise latent
of data each day. A large portion of it comes from the communi- structures as proposed by Hildenbrand et al. [13].
cation on social media platforms or through emails. Special graph Document visualisation aims to visualise the content, such that
databases enable the efficient storage of these large communica- users gain quick insights into topics, latent phrases, or trends.
tion networks and provide interfaces to query or analyse the data. Tiara [22] extracts topics and derives time-sensitive keywords
Visualising networks in their entirety on the other hand is a very to depict evolving subjects over time as stacked plots. Other
challenging task. Users investigating a communication network approaches project documents into a latent space, using topic
want to find information about when does who communicate models or embeddings. Creating scatter-plots of embedded docu-
with whom about what. These kind of networks can be found in ments of a large corpus may result in a very dense and unclear
many different shapes. Modern social networks, such as Twitter layout, so Chen et al. [4] developed an algorithm to reduce over-
or Facebook exhibit similar structures as classic, offline social full visualisations by picking representative documents. A dif-
networks [20]. We investigate another type of social network: a ferent approach is taken by Fortuna et al. [8], who do not show
collection of emails. documents directly, but generate a heatmap of the populated
Given the communication data over a year or more, it is prac- canvas and overlay it with salient phrases at more densely popu-
tically impossible to gain an overview or quick insights into lated areas from the underlying documents in that region. Friedl
the latent network structure with a basic approach as shown in et al. [10] extend that concept by drawing clear lines between
1. Also, in such a traditional network visualisation, information regions and colouring them. They also add edges between salient
about the content of messages sent between individuals is lost. phrases based on co-occurrences in the texts. Most recently Car-
Besides these traditional systems, more exotic approaches use tograph [19] was proposed, which is visually very similar to
the metaphor of geographical maps [17] to visualise networks, previous approaches, but uses pre-rendered information of dif-
for example using topology to reflect connectivity of densely ferent resolution and map technology to enable a responsive
connected social communities. The map analogy can also be used interactive visualisation. Regions are coloured based on underly-
to visualise the contents of documents by embedding them into ing ontologies from a knowledge-base.
a high dimensional semantic space [15] and projecting it on the Our goal is to merge approaches for network and document
map as a document landscape. In order to highlight how relation- visualisations in one interactive user interface. This means to
ships form and change based on the interactions, the metaphor integrate multiple dimensions of email datasets including time,
of a growing tree ca be used (ContactTrees [18]). Although this interactions, users and topics into a 2D map representation. Giv-
reflects temporal aspects of dynamic networks well, it focuses on ing an overview over latent structures and topics in one map
© 2018 Copyright held by the owner/author(s). Published in the Workshop may significantly improve the exploration of a corpus by users
Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna,
Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted
under the terms of the Creative Commons license CC-by-nc-nd 4.0.
1
104
Figure 1: Traditional basic visualisation of a communication graph from 2000 emails using force layout
unfamiliar with the domain and terminology. Also domain ex- of documents without any prior knowledge about its content and
perts could benefit from such an overview, e.g. by easily being individuals involved.
able to identify global patterns in the data. In our system, we use the names and email addresses of senders
A specific application scenario that could benefit from such and recipients (individuals), communication network, semantic
integrated, interactive visualisations is the analysis of large, un- vector representations of email contents, and as part of an overlay
structured, heterogeneous data collections. Data-driven journal- the timestamps of emails and propose a graph layout over a
ism [5] often has to deal with leaked, unstructured, very hetero- document landscape that visually describes who talks with whom
geneous data, e.g. in the context of the Panama Papers, where about what at a given time period.
journalists needed to untangle and order huge amounts of infor-
mation, search entities, and visualise found patterns [3]. Similar 3 SYSTEM ARCHITECTURE
datasets are of interest in the context of computational foren-
Visualising communication networks in a topic-aware fashion to
sics [9]. Auditing firms and law enforcement need to sift through
explore documents and salient structures is not straightforward
huge amounts of data to gather evidence of criminal activity,
Different layout objectives may produce contradicting results and
often involving communication networks and documents [14].
the challenges of processing big data need to be addressed [2]. In
this section, we describe algorithmic approaches behind the sys-
2 INTERACTIVE VISUALISATION tem we are working on. For a discussion of engineering aspects
Systems for document exploration largely vary in what they on how to store, serve, and render the map-like data, we refer to
display and how users interact with them. This depends partly the Cartograph stack [19], as we will focus on the process how
on the available raw data, but also on information extracted from to get the information that the map is generated from.
pre-processing or enrichment with external sources. 1 shows We visualise the embedded emails as dots in a two-dimensional
a basic visualisation of the network graph extracted from an landscape in which individuals are placed as nodes connected by
email corpus. Although it is an improvement over only listing edges. All emails between two individuals are reduced into one
connections, large densely connected graphs quickly become edge reducing visual complexity and making it easier to detect
hard to read and information about the email contents is lost. salient structures. However, that comes with the trade-off that
Exploring document collections can be seen as a top-down nodes and edges cannot be perfectly placed in the landscape to
approach, where the system provides abstract overviews of the cover all semantic aspects of the communication between them,
entire document collection and users incrementally refine the but rather an estimate. Our very early prototype placed some
search, narrowing the results to just a few documents of interest. individuals with no dominant topic in a crowded area in the
Such a top-down approach may help users without prior knowl- centre of the landscape as shown in 2, where colours of opaque
edge to get a sense for the data by visualising high level latent dots for emails correspond to that of the sender. Although the
structures of communication networks or the topical distribu- network visualisation at this point does not make connections
tions. more clear than in 1, users can already distinguish individuals
In the scope of this work we primarily consider documents to with similar or unrelated topics.
be emails or data attached to them. The sender, recipients, time, Our proposed algorithm to find a stable network layout has
and content can directly be extracted from the raw data. We call three stages, namely an (i) initialisation phase which creates
these – and results from further processing – dimensions that can the landscape and roughly places nodes and connections, an
be visualised. From the contents one may infer named entities, (ii) update phase which iteratively updates the node placement
topics, embeddings, or salient phrases, while the communication towards a better fit, and finally a (iii) post processing phase where
network spanned by sender-recipient pairs can be used to detect edges become splines to make latent structures more clear and a
salient structures and hierarchies. The temporal information map topology is added.
enables the previously mentioned data to be analysed over time
to detect evolving or changing patterns. Initialising the Landscape. To generate the document land-
There are numerous ways to visualise each dimension on scape, we first process the network graph to roughly determine
its own or in combination with others. The requirement of a regions, where documents will be placed. Therefore we apply
dimension and its priority in a visualisation is dictated by the node2vec [12] to the communication network and embed each in-
system objective. From the wide range of possibilities, we strive dividual’s node. We separate the graph into communities Pi ⊂ P
for a system which supports the exploration of a large collection using the kernel density of the resulting populated space at
105
645-645
1301-1301
951-951
1708-1708
474-474
1033-1033 1193-1193
2093-2093
2256-2256
1362-1362 1719-1719
878-878 1553-1553
1730-1730
2370-2370
1603-1603
1143-1143
2012-2012 1276-1276 1768-1768
995-995 1166-1166
1263-1263 1689
209-209 1110-1110
2490-2490 127-127
676-676 725-725
1195-1195
722-722
1120-1120 794-794
580-580 374
848-848 1402-1402
118-118
1119-1119
34-34
2208-2208
641-641 976-976
825-825 1452-1452
768-768
2577-2577 2622-262
1581-1581 2523-2523
1652-1652
2198-2198
210-210 1770-1770
1515-1515
1717-1717
1076-1076 509-509
1949-1949
970-970
2375-2375 1956-1956 2648-2648
2419-2419 1993-1993
85-85
442-442
124-124 2250-2250 1764-1764
1920-1920 2183-2183
2665-2665
2092-2092
111-111 525-525
2450-2450
2412-2412 63-63
2629-2629 1027-1027
2281-2281 1352-1352
Figure 2: Rendered prototype output after landscape initialisation without prior community segmentation
448-448 1490-1490
1941-1941
2253-2253 1897-1897 2520-2520
1104-1104
978-978 2600-2600
307-307
1100-1100 2283-2283
2070-2070 619-619
667-667 1703-1703
1454-1454
1648-1648 1514-1514 964-964
727-727 752-752 2460-2460
2405-2405 1192-1192 528-528
1380-1380 1502-1502
1615-1615 1498-1498 95-952302-2302
145-145
20
threshold κ, where a higher κ results in more, but smaller com- where d (·, ·) is the distance between two nodes (zero if no con-
953-953 1876-1876 1096-1096
1387-1387 1134-1134 1323-1323
2163-2163
2303-2303 1481-1481 787-787 2487-2487
munities. For each community Pi , pairwise neighbourhood simi- nection exists) or shortest distance from a node to its ideal line.
1369-1369
231-231
519-519 2185-2185
2440-2440 1474-147
2680-2680
1890-1890 267-267 844-844
larities are calculated using euclidean distance between nodes, To adjust the layout towards either a better semantic or structural
1257-1257 2411-2411
1059-1059 753-753
312-312 2663-2663
434-434 253-253 33-33
1425-1425
1720-1720 1286-1286
14-14 1684-1684 175-175 1558-1558
1147-1147 327-327
1736-1736
forming the triangular matrix Si , where skl is the similarity be- fit, we introduce parameters θ and η.
2168-2168 2454-2454
1578-1578 1631-1631
2625-2625 1554-1554
1637-1637 2108
1339-1339
439-439 1599-1599 520-520 2678-2678
tween pk , pl ∈ Pi . In order to minimise 1, we use stochastic gradient descent. In
2200-2200 322-322 1683-1683
1565-1565 1007-1007
1901-1901 218-218 2124-2124
131-131 1171-1171
2691-2691
539-539 1847-1847
Furthermore, we train document embeddings [15, 25] on all each iteration step, we can derive the direction and magnitude
45-45 875-875
1137-1137 108-108
876-876 1673-1673
801-801 264-264
704-704
663-663
98-98 1068-1068
191-191 1701-1701
501-501 1038-1038
904-904
emails and use them to infer high dimensional semantic vec- each node should be moved towards a better semantic fit and
1071-1071
2096-2096
2465-2465 1478-1478 1072-1072
2588-2588 294-294
tor representations. Let Mi be the set of emails that originated closer proximity to it’s neighbourhood in the network. 511-511
1704-1704
2469-2469
1078-1078
in community Pi . For each email m ∈ Mi , the dimensionality The semantic gradient for p j ∈ P is defined by
1405-1405 983-983
1573-1573 266-266
221-221
719-719
732-732
176-176
2659-2659
is reduced using t-SNE [16], which retains possible semantic
1253-1253
1285-1285
2673-2673
1549-1549
clusterings of documents in the higher dimensional space. The δ⃗js := (pm m
(2)
213-
2030
j − p j ) ∥p j − p j ∥
2428-2
2620-26
resulting two-dimensional vectors are then placed as dots on the
map using the centre of embedded network communities as the where pmj is the closest point on mM p j to p j and ∥·∥ denotes the
respective origin, whereas the size is determined by the number euclidean norm, while neighbourhood gradient is defined by 602-602
of related individuals. X
We also initialise communication network’s layout. Thereby, δ⃗jn := (p j − pk ) ∥p j − pk ∥ − s jk (3)
the staring position of a node representing an individual is de- pk ∈P 247-247
termined by the normalised sum of two-dimensional vectors of p j ∼pk
all emails he or she has sent or received. This way, we implic-
where p j ∼ pk denotes that an edge exists between p j and pk .
itly group semantically related individuals into communities as
With the definitions in 2 and 3, we can formulate the update
frequent communication biases this normalised sum. Straight
vector δ⃗j for node p j ∈ P as
edges are added between the nodes if the respective individuals
exchanged emails. Note, that many edges may only represent
a small number of emails. Applying a variable threshold σ can δ⃗j := ξ θ δ⃗jn + ηδ⃗js (4)
reduce the computational load in later stages, as these edges will
where ξ is the learning rate and θ, η as before parameters to
not impact the overall layout very much. They can be added
weight between a better semantic or neighbourhood fit.
again as the user requests a detailed visualisation by zooming in
Most likely, complex network structures might prevent the
or through other interactions.
stochastic gradient descent to find a stable minimum, so the score
In the algorithm’s second stage, we iteratively try to improve
of the objective function should be monitored or intermediate
the layout of the communication network by finding a balance
layouts be visually evaluated to determine a satisfactory result.
between the closeness of nodes to semantic context and densely
connected neighbourhoods a node belongs to. Therefore, for Post Processing. Lastly, we use the post processing stage to
M
each individual p j ∈ P we use linear regression to fit a line m pj enhance the readability of our visualisation. Densely connected
though all two-dimensional vectors of emails he or she has sent communities in the graph are potentially hard to read, thus we
or received. As a node is placed near this line, it remains in a apply edge bundling [1] to visually clear latent structures. We
semantically good position. also apply MapSets [7] to separate the regions for each commu-
Adjusting the Network Layout. The first stage of our proposed nity. Since semantically similar emails may appear in different
algorithm produces a fixed document landscape and roughly fits communities, we apply colouring based on clusters in the original
the communication network on top. We now aim to incrementally global document embedding space to retain this aspect. Choosing
adapt the layout of the graph to better reflect salient structures the colours depends on the number of latent topics that should
in the network while keeping each individual’s node close to the be depicted [24]. If the topic number exceeds 25-30 topics, group-
reflective semantic area in the landscape. ing topics and allowing for zooming within a two-level topic-
Therefore we define a score quantifying how well the current hierarchy ensures distinguishable colors for up to 10 subtopics
layout fits these objectives: (25 × 10 = 25). In order to represent temporal aspects of the data,
X " X #
we calculate the kernel density of the document landscape for
M
η d (pi , m pi ) + θ s ij − d (p ,
i jp ) (1) fixed time-intervals, which can be used to add heat-map overlays
pi ∈P p j ∈P that users can select later on.
106
Figure 3: Semantic landscape of email contents and dominant communication patterns (drawn mockup)
4 CONCLUSION AND VISION [5] Mark Coddington. 2015. Clarifying journalismâĂŹs quantitative turn: A typol-
ogy for evaluating data journalism, computational journalism, and computer-
In this paper, we described an algorithm to lay out a commu- assisted reporting. Digital Journalism 3, 3 (2015), 331–348.
nication network on top of a landscape of semantically embed- [6] Tommy Dang and Angus Forbes. 2017. CactusTree: A tree drawing approach
for hierarchical edge bundling. In Proc. of the Pacific Visualization Symposium.
ded emails. This is still work in progress, thus 3 shows only a IEEE, 210–214.
manually drawn mock-up of the visualisation we envision. In it, [7] Alon Efrat, Yifan Hu, Stephen G Kobourov, and Sergey Pupyrev. 2015. MapSets:
individuals are represented as nodes positioned such that densely Visualizing Embedded and Clustered Graphs. Journal of Graph Algorithms
and Applications 19, 2 (2015), 571–593.
connected communities are visually clustered. Edges describe the [8] Blaz Fortuna, Marko Grobelnik, and Dunja Mladenic. 2005. Visualization of
email traffic, where the opacity and thickness is used to indicate text document corpus. Informatica 29, 4 (2005), 497–502.
the frequency of messages between the nodes they connect. [9] Katrin Franke and Sargur N Srihari. 2007. Computational forensics: Towards
hybrid-intelligent crime investigation. In International Symposium on Infor-
The semantic representations of emails are used to place dots mation Assurance and Security. IEEE, 383–386.
on a background layer which we call the document landscape. [10] Daniel Fried and Stephen G Kobourov. 2014. Maps of computer science. In
Proc. of the Pacific Visualization Symposium. IEEE, 113–120.
This landscape is used as additional input to the graph layout [11] Martin Gronemann and Michael Jünger. 2012. Drawing clustered graphs as
algorithm, aiming to place a node within corresponding semantic topographic maps. In Proc. of the Symposium on Graph Drawing and Network
regions. The colouring of regions in the landscape is derived from Visualization. Springer, 426–438.
[12] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning
densely connected communities in the communication graph. Op- for networks. In Proc. of the Conference on Knowledge Discovery and Data
tionally, representative words are selected for densely populated Mining. ACM, 855–864.
areas in the landscape, so that users get a rough idea about sub- [13] Jan Hildenbrand, Arlind Nocaj, and Ulrik Brandes. 2016. Flexible Level-of-
Detail Rendering for Large Graphs. In Proc. of the Symposium on Graph Draw-
jects in that area. The aforementioned timestamps of emails can ing and Network Visualization. Springer, 625–627.
be used to generate a heatmap overlay to show the activity in a [14] Mukundan Karthik, Mariappan Marikkannan, and Arputharaj Kannan. 2008.
An intelligent system for semantic information retrieval information from
certain time interval which is controlled by a slider. Similar to textual web documents. In International Workshop on Computational Forensics.
modern geographical maps, zooming into a region reveals more Springer, 135–146.
details. In our case, less prominent individuals and their connec- [15] Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences
and documents. In Proc. of the International Conference on Machine Learning.
tions are shown along with additional salient phrases from the PMLR, 1188–1196.
document landscape. Selecting a node will not only highlight con- [16] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using
nected edges but may also temporarily show more edges which t-SNE. Journal of Machine Learning Research 9, 11 (2008), 2579–2605.
[17] Patrick Cheong-Iao Pang, Robert P Biuk-Aghai, Muye Yang, and Bin Pang.
were previously hidden at that zoom level. The user will also be 2017. Creating realistic map-like visualisations: Results from user studies.
able to retrieve documents with the help of a selection rectangle Journal of Visual Languages and Computing 43 (2017), 60–70.
[18] Arnaud Sallaberry, Yang-chih Fu, Hwai-Chung Ho, and Kwan-Liu Ma. 2016.
or clicking dots in the document landscape. Contact trees: Network visualization beyond nodes and edges. PLOS ONE 11,
In future work, we hope to evaluate this system using full- 1 (2016), 1–23.
scale real-world data as well as practitioners from journalism and [19] Shilad Sen, Anja Beth Swoap, Qisheng Li, Brooke Boatman, Ilse Dippenaar,
Rebecca Gold, Monica Ngo, Sarah Pujol, Bret Jackson, and Brent Hecht. 2017.
auditing. It may also be interesting to experiment with embedding Cartograph: Unlocking Spatial Visualization Through Semantic Enhancement.
methods, which take both the emails and the network graph as In Proc. of Conference on Intelligence User Interfaces. ACM, 179–190.
input and directly project the inferred representations into the [20] Kaveri Subrahmanyam, Stephanie M Reich, Natalia Waechter, and Guadalupe
Espinoza. 2008. Online and offline social networks: Use of social networking
two-dimensional landscape to simplify the proposed algorithm. sites by emerging adults. Journal of Applied Developmental Psychology 29, 6
(2008), 420–433.
[21] Jun Tao, Jian Xu, Chaoli Wang, and Nitesh V Chawla. 2017. HoNVis: Visualiz-
ing and Exploring Higher-Order Networks. In Proc. of the Pacific Visualization
REFERENCES Symposium. IEEE, 1–10.
[1] Benjamin Bach, Nathalie Henry Riche, Christophe Hurter, Kim Marriott, and [22] Furu Wei, Shixia Liu, Yangqiu Song, Shimei Pan, Michelle X Zhou, Weihong
Tim Dwyer. 2017. Towards unambiguous edge bundling: Investigating con- Qian, Lei Shi, Li Tan, and Qiang Zhang. 2010. TIARA: a visual exploratory
fluent drawings for network visualization. Transactions on Visualization and text analytic system. In Proc. of the Conference on Knowledge Discovery and
Computer Graphics 23, 1 (2017), 541–550. Data Mining. ACM, 153–162.
[2] Nikos Bikakis and Timos Sellis. 2016. Exploration and visualization in the [23] Jaewon Yang and Jure Leskovec. 2014. Overlapping communities explain
web of big linked data: A survey of the state of the art. In Proceedings of the core–periphery publisher of networks. Proc. IEEE 102, 12 (2014), 1892–1902.
Workshops of the EDBT/ICDT 2016 Joint Conference, Vol. 1558. CEUR-WS.org. [24] Achim Zeileis, Kurt Hornik, and Paul Murrell. 2009. Escaping RGBland: se-
[3] Marie-Anne Chabin. 2017. Panama papers: a case study for records manage- lecting colors for statistical graphics. Computational Statistics & Data Analysis
ment? Brazilian Journal of Information Science: Research Trends 11, 4 (2017). 53, 9 (2009), 3259–3270.
[4] Yanhua Chen, Lijun Wang, Ming Dong, and Jing Hua. 2009. Exemplar-based [25] Zhaocheng Zhu and Junfeng Hu. 2017. Context Aware Document Embedding.
visualization of large document corpus. Transactions on Visualization and CoRR abs/1707.01521 (2017).
Computer Graphics 15, 6 (2009), 1161–1168.
107