Topic-aware Network Visualisation to Explore Large Email Corpora Tim Repke Ralf Krestel Hasso Plattner Institute Hasso Plattner Institute Potsdam, Germany Potsdam, Germany tim.repke@hpi.de ralf.krestel@hpi.de ABSTRACT one person as the root, thus an overview of the entire network Nowadays, more and more large datasets exhibit an intrinsic is lost. CactusTrees [6] on the other hand represent hierarchical graph structure. While there exist special graph databases to han- structures with the goal of untangling overlaid bundles of inter- dle ever increasing amounts of nodes and edges, visualising this secting edges, making distant connections more apparent. As data becomes infeasible quickly with growing data. In addition, higher order dependencies may get lost in traditional visualisa- looking at its structure is not sufficient to get an overview of a tions, HoNVis [21] adds nodes to encode dependencies in chains graph dataset. Indeed, visualising additional information about of interactions. Usually, a communication network has many nodes or edges without cluttering the screen is essential. In this nodes and overlapping connections already, so Yang et al. [23] paper, we propose an interactive visualisation for social networks rather focus on discovering overlapping cores to improve the that positions individuals (nodes) on a two-dimensional canvas identification of community boundaries to highlight global latent such that communities defined by social links (edges) are easily structures. Similarly, Gronemann et al. [11] use the metaphor of recognisable. Furthermore, we visualise topical relatedness be- islands and hills to visualise clustered graphs, making densely tween individuals by analysing information about social links, in connected communities clearly noticeable. The edges are bundled our case email communication. To this end, we utilise document and follow valleys of the resulting topology, thus making rela- embeddings, which project the content of an email message into tionships between other communities hard to follow. MapSets [7] a high dimensional semantic space and graph embeddings, which assume a graph that was laid out using embeddings reflecting project nodes in a network graph into a latent space reflecting communities. An algorithm then draws regions around clusters their relatedness. of nodes, such that the bounding shapes are contiguous and non-overlapping, but yet abstract. Another approach to visualise networks at full scale is to aggregate nodes based on their spa- 1 INTRODUCTION tial distribution and thereby allowing for a simple exploration In our modern information society we produce substantial amounts with contour lines and heatmap overlays to emphasise latent of data each day. A large portion of it comes from the communi- structures as proposed by Hildenbrand et al. [13]. cation on social media platforms or through emails. Special graph Document visualisation aims to visualise the content, such that databases enable the efficient storage of these large communica- users gain quick insights into topics, latent phrases, or trends. tion networks and provide interfaces to query or analyse the data. Tiara [22] extracts topics and derives time-sensitive keywords Visualising networks in their entirety on the other hand is a very to depict evolving subjects over time as stacked plots. Other challenging task. Users investigating a communication network approaches project documents into a latent space, using topic want to find information about when does who communicate models or embeddings. Creating scatter-plots of embedded docu- with whom about what. These kind of networks can be found in ments of a large corpus may result in a very dense and unclear many different shapes. Modern social networks, such as Twitter layout, so Chen et al. [4] developed an algorithm to reduce over- or Facebook exhibit similar structures as classic, offline social full visualisations by picking representative documents. A dif- networks [20]. We investigate another type of social network: a ferent approach is taken by Fortuna et al. [8], who do not show collection of emails. documents directly, but generate a heatmap of the populated Given the communication data over a year or more, it is prac- canvas and overlay it with salient phrases at more densely popu- tically impossible to gain an overview or quick insights into lated areas from the underlying documents in that region. Friedl the latent network structure with a basic approach as shown in et al. [10] extend that concept by drawing clear lines between 1. Also, in such a traditional network visualisation, information regions and colouring them. They also add edges between salient about the content of messages sent between individuals is lost. phrases based on co-occurrences in the texts. Most recently Car- Besides these traditional systems, more exotic approaches use tograph [19] was proposed, which is visually very similar to the metaphor of geographical maps [17] to visualise networks, previous approaches, but uses pre-rendered information of dif- for example using topology to reflect connectivity of densely ferent resolution and map technology to enable a responsive connected social communities. The map analogy can also be used interactive visualisation. Regions are coloured based on underly- to visualise the contents of documents by embedding them into ing ontologies from a knowledge-base. a high dimensional semantic space [15] and projecting it on the Our goal is to merge approaches for network and document map as a document landscape. In order to highlight how relation- visualisations in one interactive user interface. This means to ships form and change based on the interactions, the metaphor integrate multiple dimensions of email datasets including time, of a growing tree ca be used (ContactTrees [18]). Although this interactions, users and topics into a 2D map representation. Giv- reflects temporal aspects of dynamic networks well, it focuses on ing an overview over latent structures and topics in one map © 2018 Copyright held by the owner/author(s). Published in the Workshop may significantly improve the exploration of a corpus by users Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna, Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0. 1 104 Figure 1: Traditional basic visualisation of a communication graph from 2000 emails using force layout unfamiliar with the domain and terminology. Also domain ex- of documents without any prior knowledge about its content and perts could benefit from such an overview, e.g. by easily being individuals involved. able to identify global patterns in the data. In our system, we use the names and email addresses of senders A specific application scenario that could benefit from such and recipients (individuals), communication network, semantic integrated, interactive visualisations is the analysis of large, un- vector representations of email contents, and as part of an overlay structured, heterogeneous data collections. Data-driven journal- the timestamps of emails and propose a graph layout over a ism [5] often has to deal with leaked, unstructured, very hetero- document landscape that visually describes who talks with whom geneous data, e.g. in the context of the Panama Papers, where about what at a given time period. journalists needed to untangle and order huge amounts of infor- mation, search entities, and visualise found patterns [3]. Similar 3 SYSTEM ARCHITECTURE datasets are of interest in the context of computational foren- Visualising communication networks in a topic-aware fashion to sics [9]. Auditing firms and law enforcement need to sift through explore documents and salient structures is not straightforward huge amounts of data to gather evidence of criminal activity, Different layout objectives may produce contradicting results and often involving communication networks and documents [14]. the challenges of processing big data need to be addressed [2]. In this section, we describe algorithmic approaches behind the sys- 2 INTERACTIVE VISUALISATION tem we are working on. For a discussion of engineering aspects Systems for document exploration largely vary in what they on how to store, serve, and render the map-like data, we refer to display and how users interact with them. This depends partly the Cartograph stack [19], as we will focus on the process how on the available raw data, but also on information extracted from to get the information that the map is generated from. pre-processing or enrichment with external sources. 1 shows We visualise the embedded emails as dots in a two-dimensional a basic visualisation of the network graph extracted from an landscape in which individuals are placed as nodes connected by email corpus. Although it is an improvement over only listing edges. All emails between two individuals are reduced into one connections, large densely connected graphs quickly become edge reducing visual complexity and making it easier to detect hard to read and information about the email contents is lost. salient structures. However, that comes with the trade-off that Exploring document collections can be seen as a top-down nodes and edges cannot be perfectly placed in the landscape to approach, where the system provides abstract overviews of the cover all semantic aspects of the communication between them, entire document collection and users incrementally refine the but rather an estimate. Our very early prototype placed some search, narrowing the results to just a few documents of interest. individuals with no dominant topic in a crowded area in the Such a top-down approach may help users without prior knowl- centre of the landscape as shown in 2, where colours of opaque edge to get a sense for the data by visualising high level latent dots for emails correspond to that of the sender. Although the structures of communication networks or the topical distribu- network visualisation at this point does not make connections tions. more clear than in 1, users can already distinguish individuals In the scope of this work we primarily consider documents to with similar or unrelated topics. be emails or data attached to them. The sender, recipients, time, Our proposed algorithm to find a stable network layout has and content can directly be extracted from the raw data. We call three stages, namely an (i) initialisation phase which creates these – and results from further processing – dimensions that can the landscape and roughly places nodes and connections, an be visualised. From the contents one may infer named entities, (ii) update phase which iteratively updates the node placement topics, embeddings, or salient phrases, while the communication towards a better fit, and finally a (iii) post processing phase where network spanned by sender-recipient pairs can be used to detect edges become splines to make latent structures more clear and a salient structures and hierarchies. The temporal information map topology is added. enables the previously mentioned data to be analysed over time to detect evolving or changing patterns. Initialising the Landscape. To generate the document land- There are numerous ways to visualise each dimension on scape, we first process the network graph to roughly determine its own or in combination with others. The requirement of a regions, where documents will be placed. Therefore we apply dimension and its priority in a visualisation is dictated by the node2vec [12] to the communication network and embed each in- system objective. From the wide range of possibilities, we strive dividual’s node. We separate the graph into communities Pi ⊂ P for a system which supports the exploration of a large collection using the kernel density of the resulting populated space at 105 645-645 1301-1301 951-951 1708-1708 474-474 1033-1033 1193-1193 2093-2093 2256-2256 1362-1362 1719-1719 878-878 1553-1553 1730-1730 2370-2370 1603-1603 1143-1143 2012-2012 1276-1276 1768-1768 995-995 1166-1166 1263-1263 1689 209-209 1110-1110 2490-2490 127-127 676-676 725-725 1195-1195 722-722 1120-1120 794-794 580-580 374 848-848 1402-1402 118-118 1119-1119 34-34 2208-2208 641-641 976-976 825-825 1452-1452 768-768 2577-2577 2622-262 1581-1581 2523-2523 1652-1652 2198-2198 210-210 1770-1770 1515-1515 1717-1717 1076-1076 509-509 1949-1949 970-970 2375-2375 1956-1956 2648-2648 2419-2419 1993-1993 85-85 442-442 124-124 2250-2250 1764-1764 1920-1920 2183-2183 2665-2665 2092-2092 111-111 525-525 2450-2450 2412-2412 63-63 2629-2629 1027-1027 2281-2281 1352-1352 Figure 2: Rendered prototype output after landscape initialisation without prior community segmentation 448-448 1490-1490 1941-1941 2253-2253 1897-1897 2520-2520 1104-1104 978-978 2600-2600 307-307 1100-1100 2283-2283 2070-2070 619-619 667-667 1703-1703 1454-1454 1648-1648 1514-1514 964-964 727-727 752-752 2460-2460 2405-2405 1192-1192 528-528 1380-1380 1502-1502 1615-1615 1498-1498 95-952302-2302 145-145 20 threshold κ, where a higher κ results in more, but smaller com- where d (·, ·) is the distance between two nodes (zero if no con- 953-953 1876-1876 1096-1096 1387-1387 1134-1134 1323-1323 2163-2163 2303-2303 1481-1481 787-787 2487-2487 munities. For each community Pi , pairwise neighbourhood simi- nection exists) or shortest distance from a node to its ideal line. 1369-1369 231-231 519-519 2185-2185 2440-2440 1474-147 2680-2680 1890-1890 267-267 844-844 larities are calculated using euclidean distance between nodes, To adjust the layout towards either a better semantic or structural 1257-1257 2411-2411 1059-1059 753-753 312-312 2663-2663 434-434 253-253 33-33 1425-1425 1720-1720 1286-1286 14-14 1684-1684 175-175 1558-1558 1147-1147 327-327 1736-1736 forming the triangular matrix Si , where skl is the similarity be- fit, we introduce parameters θ and η. 2168-2168 2454-2454 1578-1578 1631-1631 2625-2625 1554-1554 1637-1637 2108 1339-1339 439-439 1599-1599 520-520 2678-2678 tween pk , pl ∈ Pi . In order to minimise 1, we use stochastic gradient descent. In 2200-2200 322-322 1683-1683 1565-1565 1007-1007 1901-1901 218-218 2124-2124 131-131 1171-1171 2691-2691 539-539 1847-1847 Furthermore, we train document embeddings [15, 25] on all each iteration step, we can derive the direction and magnitude 45-45 875-875 1137-1137 108-108 876-876 1673-1673 801-801 264-264 704-704 663-663 98-98 1068-1068 191-191 1701-1701 501-501 1038-1038 904-904 emails and use them to infer high dimensional semantic vec- each node should be moved towards a better semantic fit and 1071-1071 2096-2096 2465-2465 1478-1478 1072-1072 2588-2588 294-294 tor representations. Let Mi be the set of emails that originated closer proximity to it’s neighbourhood in the network. 511-511 1704-1704 2469-2469 1078-1078 in community Pi . For each email m ∈ Mi , the dimensionality The semantic gradient for p j ∈ P is defined by 1405-1405 983-983 1573-1573 266-266 221-221 719-719 732-732 176-176 2659-2659 is reduced using t-SNE [16], which retains possible semantic 1253-1253 1285-1285 2673-2673 1549-1549 clusterings of documents in the higher dimensional space. The δ⃗js := (pm m (2) 213- 2030 j − p j ) ∥p j − p j ∥ 2428-2 2620-26 resulting two-dimensional vectors are then placed as dots on the map using the centre of embedded network communities as the where pmj is the closest point on mM p j to p j and ∥·∥ denotes the respective origin, whereas the size is determined by the number euclidean norm, while neighbourhood gradient is defined by 602-602 of related individuals. X   We also initialise communication network’s layout. Thereby, δ⃗jn := (p j − pk ) ∥p j − pk ∥ − s jk (3) the staring position of a node representing an individual is de- pk ∈P 247-247 termined by the normalised sum of two-dimensional vectors of p j ∼pk all emails he or she has sent or received. This way, we implic- where p j ∼ pk denotes that an edge exists between p j and pk . itly group semantically related individuals into communities as With the definitions in 2 and 3, we can formulate the update frequent communication biases this normalised sum. Straight vector δ⃗j for node p j ∈ P as edges are added between the nodes if the respective individuals exchanged emails. Note, that many edges may only represent   a small number of emails. Applying a variable threshold σ can δ⃗j := ξ θ δ⃗jn + ηδ⃗js (4) reduce the computational load in later stages, as these edges will where ξ is the learning rate and θ, η as before parameters to not impact the overall layout very much. They can be added weight between a better semantic or neighbourhood fit. again as the user requests a detailed visualisation by zooming in Most likely, complex network structures might prevent the or through other interactions. stochastic gradient descent to find a stable minimum, so the score In the algorithm’s second stage, we iteratively try to improve of the objective function should be monitored or intermediate the layout of the communication network by finding a balance layouts be visually evaluated to determine a satisfactory result. between the closeness of nodes to semantic context and densely connected neighbourhoods a node belongs to. Therefore, for Post Processing. Lastly, we use the post processing stage to M each individual p j ∈ P we use linear regression to fit a line m pj enhance the readability of our visualisation. Densely connected though all two-dimensional vectors of emails he or she has sent communities in the graph are potentially hard to read, thus we or received. As a node is placed near this line, it remains in a apply edge bundling [1] to visually clear latent structures. We semantically good position. also apply MapSets [7] to separate the regions for each commu- Adjusting the Network Layout. The first stage of our proposed nity. Since semantically similar emails may appear in different algorithm produces a fixed document landscape and roughly fits communities, we apply colouring based on clusters in the original the communication network on top. We now aim to incrementally global document embedding space to retain this aspect. Choosing adapt the layout of the graph to better reflect salient structures the colours depends on the number of latent topics that should in the network while keeping each individual’s node close to the be depicted [24]. If the topic number exceeds 25-30 topics, group- reflective semantic area in the landscape. ing topics and allowing for zooming within a two-level topic- Therefore we define a score quantifying how well the current hierarchy ensures distinguishable colors for up to 10 subtopics layout fits these objectives: (25 × 10 = 25). In order to represent temporal aspects of the data, X " X  # we calculate the kernel density of the document landscape for M η d (pi , m pi ) + θ s ij − d (p , i jp ) (1) fixed time-intervals, which can be used to add heat-map overlays pi ∈P p j ∈P that users can select later on. 106 Figure 3: Semantic landscape of email contents and dominant communication patterns (drawn mockup) 4 CONCLUSION AND VISION [5] Mark Coddington. 2015. Clarifying journalismâĂŹs quantitative turn: A typol- ogy for evaluating data journalism, computational journalism, and computer- In this paper, we described an algorithm to lay out a commu- assisted reporting. Digital Journalism 3, 3 (2015), 331–348. nication network on top of a landscape of semantically embed- [6] Tommy Dang and Angus Forbes. 2017. CactusTree: A tree drawing approach for hierarchical edge bundling. In Proc. of the Pacific Visualization Symposium. ded emails. This is still work in progress, thus 3 shows only a IEEE, 210–214. manually drawn mock-up of the visualisation we envision. In it, [7] Alon Efrat, Yifan Hu, Stephen G Kobourov, and Sergey Pupyrev. 2015. MapSets: individuals are represented as nodes positioned such that densely Visualizing Embedded and Clustered Graphs. Journal of Graph Algorithms and Applications 19, 2 (2015), 571–593. connected communities are visually clustered. Edges describe the [8] Blaz Fortuna, Marko Grobelnik, and Dunja Mladenic. 2005. Visualization of email traffic, where the opacity and thickness is used to indicate text document corpus. Informatica 29, 4 (2005), 497–502. the frequency of messages between the nodes they connect. [9] Katrin Franke and Sargur N Srihari. 2007. Computational forensics: Towards hybrid-intelligent crime investigation. In International Symposium on Infor- The semantic representations of emails are used to place dots mation Assurance and Security. IEEE, 383–386. on a background layer which we call the document landscape. [10] Daniel Fried and Stephen G Kobourov. 2014. Maps of computer science. In Proc. of the Pacific Visualization Symposium. IEEE, 113–120. This landscape is used as additional input to the graph layout [11] Martin Gronemann and Michael Jünger. 2012. Drawing clustered graphs as algorithm, aiming to place a node within corresponding semantic topographic maps. In Proc. of the Symposium on Graph Drawing and Network regions. The colouring of regions in the landscape is derived from Visualization. Springer, 426–438. [12] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning densely connected communities in the communication graph. Op- for networks. In Proc. of the Conference on Knowledge Discovery and Data tionally, representative words are selected for densely populated Mining. ACM, 855–864. areas in the landscape, so that users get a rough idea about sub- [13] Jan Hildenbrand, Arlind Nocaj, and Ulrik Brandes. 2016. Flexible Level-of- Detail Rendering for Large Graphs. In Proc. of the Symposium on Graph Draw- jects in that area. The aforementioned timestamps of emails can ing and Network Visualization. Springer, 625–627. be used to generate a heatmap overlay to show the activity in a [14] Mukundan Karthik, Mariappan Marikkannan, and Arputharaj Kannan. 2008. An intelligent system for semantic information retrieval information from certain time interval which is controlled by a slider. Similar to textual web documents. In International Workshop on Computational Forensics. modern geographical maps, zooming into a region reveals more Springer, 135–146. details. In our case, less prominent individuals and their connec- [15] Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proc. of the International Conference on Machine Learning. tions are shown along with additional salient phrases from the PMLR, 1188–1196. document landscape. Selecting a node will not only highlight con- [16] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using nected edges but may also temporarily show more edges which t-SNE. Journal of Machine Learning Research 9, 11 (2008), 2579–2605. [17] Patrick Cheong-Iao Pang, Robert P Biuk-Aghai, Muye Yang, and Bin Pang. were previously hidden at that zoom level. The user will also be 2017. Creating realistic map-like visualisations: Results from user studies. able to retrieve documents with the help of a selection rectangle Journal of Visual Languages and Computing 43 (2017), 60–70. [18] Arnaud Sallaberry, Yang-chih Fu, Hwai-Chung Ho, and Kwan-Liu Ma. 2016. or clicking dots in the document landscape. Contact trees: Network visualization beyond nodes and edges. PLOS ONE 11, In future work, we hope to evaluate this system using full- 1 (2016), 1–23. scale real-world data as well as practitioners from journalism and [19] Shilad Sen, Anja Beth Swoap, Qisheng Li, Brooke Boatman, Ilse Dippenaar, Rebecca Gold, Monica Ngo, Sarah Pujol, Bret Jackson, and Brent Hecht. 2017. auditing. It may also be interesting to experiment with embedding Cartograph: Unlocking Spatial Visualization Through Semantic Enhancement. methods, which take both the emails and the network graph as In Proc. of Conference on Intelligence User Interfaces. ACM, 179–190. input and directly project the inferred representations into the [20] Kaveri Subrahmanyam, Stephanie M Reich, Natalia Waechter, and Guadalupe Espinoza. 2008. Online and offline social networks: Use of social networking two-dimensional landscape to simplify the proposed algorithm. sites by emerging adults. Journal of Applied Developmental Psychology 29, 6 (2008), 420–433. [21] Jun Tao, Jian Xu, Chaoli Wang, and Nitesh V Chawla. 2017. HoNVis: Visualiz- ing and Exploring Higher-Order Networks. In Proc. of the Pacific Visualization REFERENCES Symposium. IEEE, 1–10. [1] Benjamin Bach, Nathalie Henry Riche, Christophe Hurter, Kim Marriott, and [22] Furu Wei, Shixia Liu, Yangqiu Song, Shimei Pan, Michelle X Zhou, Weihong Tim Dwyer. 2017. Towards unambiguous edge bundling: Investigating con- Qian, Lei Shi, Li Tan, and Qiang Zhang. 2010. TIARA: a visual exploratory fluent drawings for network visualization. Transactions on Visualization and text analytic system. In Proc. of the Conference on Knowledge Discovery and Computer Graphics 23, 1 (2017), 541–550. Data Mining. ACM, 153–162. [2] Nikos Bikakis and Timos Sellis. 2016. Exploration and visualization in the [23] Jaewon Yang and Jure Leskovec. 2014. Overlapping communities explain web of big linked data: A survey of the state of the art. In Proceedings of the core–periphery publisher of networks. Proc. IEEE 102, 12 (2014), 1892–1902. Workshops of the EDBT/ICDT 2016 Joint Conference, Vol. 1558. CEUR-WS.org. [24] Achim Zeileis, Kurt Hornik, and Paul Murrell. 2009. Escaping RGBland: se- [3] Marie-Anne Chabin. 2017. Panama papers: a case study for records manage- lecting colors for statistical graphics. Computational Statistics & Data Analysis ment? Brazilian Journal of Information Science: Research Trends 11, 4 (2017). 53, 9 (2009), 3259–3270. [4] Yanhua Chen, Lijun Wang, Ming Dong, and Jing Hua. 2009. Exemplar-based [25] Zhaocheng Zhu and Junfeng Hu. 2017. Context Aware Document Embedding. visualization of large document corpus. Transactions on Visualization and CoRR abs/1707.01521 (2017). Computer Graphics 15, 6 (2009), 1161–1168. 107