=Paper= {{Paper |id=Vol-2083/paper-16 |storemode=property |title=Topic-aware Network Visualisation to Explore Large Email Corpora |pdfUrl=https://ceur-ws.org/Vol-2083/paper-16.pdf |volume=Vol-2083 |authors=Tim Repke,Ralf Krestel |dblpUrl=https://dblp.org/rec/conf/edbt/RepkeK18 }} ==Topic-aware Network Visualisation to Explore Large Email Corpora== https://ceur-ws.org/Vol-2083/paper-16.pdf
                                 Topic-aware Network Visualisation to
                                     Explore Large Email Corpora
                                  Tim Repke                                                                     Ralf Krestel
                          Hasso Plattner Institute                                                         Hasso Plattner Institute
                           Potsdam, Germany                                                                 Potsdam, Germany
                            tim.repke@hpi.de                                                                ralf.krestel@hpi.de

ABSTRACT                                                                                  one person as the root, thus an overview of the entire network
Nowadays, more and more large datasets exhibit an intrinsic                               is lost. CactusTrees [6] on the other hand represent hierarchical
graph structure. While there exist special graph databases to han-                        structures with the goal of untangling overlaid bundles of inter-
dle ever increasing amounts of nodes and edges, visualising this                          secting edges, making distant connections more apparent. As
data becomes infeasible quickly with growing data. In addition,                           higher order dependencies may get lost in traditional visualisa-
looking at its structure is not sufficient to get an overview of a                        tions, HoNVis [21] adds nodes to encode dependencies in chains
graph dataset. Indeed, visualising additional information about                           of interactions. Usually, a communication network has many
nodes or edges without cluttering the screen is essential. In this                        nodes and overlapping connections already, so Yang et al. [23]
paper, we propose an interactive visualisation for social networks                        rather focus on discovering overlapping cores to improve the
that positions individuals (nodes) on a two-dimensional canvas                            identification of community boundaries to highlight global latent
such that communities defined by social links (edges) are easily                          structures. Similarly, Gronemann et al. [11] use the metaphor of
recognisable. Furthermore, we visualise topical relatedness be-                           islands and hills to visualise clustered graphs, making densely
tween individuals by analysing information about social links, in                         connected communities clearly noticeable. The edges are bundled
our case email communication. To this end, we utilise document                            and follow valleys of the resulting topology, thus making rela-
embeddings, which project the content of an email message into                            tionships between other communities hard to follow. MapSets [7]
a high dimensional semantic space and graph embeddings, which                             assume a graph that was laid out using embeddings reflecting
project nodes in a network graph into a latent space reflecting                           communities. An algorithm then draws regions around clusters
their relatedness.                                                                        of nodes, such that the bounding shapes are contiguous and
                                                                                          non-overlapping, but yet abstract. Another approach to visualise
                                                                                          networks at full scale is to aggregate nodes based on their spa-
1    INTRODUCTION                                                                         tial distribution and thereby allowing for a simple exploration
In our modern information society we produce substantial amounts                          with contour lines and heatmap overlays to emphasise latent
of data each day. A large portion of it comes from the communi-                           structures as proposed by Hildenbrand et al. [13].
cation on social media platforms or through emails. Special graph                             Document visualisation aims to visualise the content, such that
databases enable the efficient storage of these large communica-                          users gain quick insights into topics, latent phrases, or trends.
tion networks and provide interfaces to query or analyse the data.                        Tiara [22] extracts topics and derives time-sensitive keywords
Visualising networks in their entirety on the other hand is a very                        to depict evolving subjects over time as stacked plots. Other
challenging task. Users investigating a communication network                             approaches project documents into a latent space, using topic
want to find information about when does who communicate                                  models or embeddings. Creating scatter-plots of embedded docu-
with whom about what. These kind of networks can be found in                              ments of a large corpus may result in a very dense and unclear
many different shapes. Modern social networks, such as Twitter                            layout, so Chen et al. [4] developed an algorithm to reduce over-
or Facebook exhibit similar structures as classic, offline social                         full visualisations by picking representative documents. A dif-
networks [20]. We investigate another type of social network: a                           ferent approach is taken by Fortuna et al. [8], who do not show
collection of emails.                                                                     documents directly, but generate a heatmap of the populated
   Given the communication data over a year or more, it is prac-                          canvas and overlay it with salient phrases at more densely popu-
tically impossible to gain an overview or quick insights into                             lated areas from the underlying documents in that region. Friedl
the latent network structure with a basic approach as shown in                            et al. [10] extend that concept by drawing clear lines between
1. Also, in such a traditional network visualisation, information                         regions and colouring them. They also add edges between salient
about the content of messages sent between individuals is lost.                           phrases based on co-occurrences in the texts. Most recently Car-
Besides these traditional systems, more exotic approaches use                             tograph [19] was proposed, which is visually very similar to
the metaphor of geographical maps [17] to visualise networks,                             previous approaches, but uses pre-rendered information of dif-
for example using topology to reflect connectivity of densely                             ferent resolution and map technology to enable a responsive
connected social communities. The map analogy can also be used                            interactive visualisation. Regions are coloured based on underly-
to visualise the contents of documents by embedding them into                             ing ontologies from a knowledge-base.
a high dimensional semantic space [15] and projecting it on the                               Our goal is to merge approaches for network and document
map as a document landscape. In order to highlight how relation-                          visualisations in one interactive user interface. This means to
ships form and change based on the interactions, the metaphor                             integrate multiple dimensions of email datasets including time,
of a growing tree ca be used (ContactTrees [18]). Although this                           interactions, users and topics into a 2D map representation. Giv-
reflects temporal aspects of dynamic networks well, it focuses on                         ing an overview over latent structures and topics in one map
© 2018 Copyright held by the owner/author(s). Published in the Workshop                   may significantly improve the exploration of a corpus by users
Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna,
Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted
under the terms of the Creative Commons license CC-by-nc-nd 4.0.
                                                                                     1




                                                                                    104
          Figure 1: Traditional basic visualisation of a communication graph from 2000 emails using force layout


unfamiliar with the domain and terminology. Also domain ex-                  of documents without any prior knowledge about its content and
perts could benefit from such an overview, e.g. by easily being              individuals involved.
able to identify global patterns in the data.                                   In our system, we use the names and email addresses of senders
   A specific application scenario that could benefit from such              and recipients (individuals), communication network, semantic
integrated, interactive visualisations is the analysis of large, un-         vector representations of email contents, and as part of an overlay
structured, heterogeneous data collections. Data-driven journal-             the timestamps of emails and propose a graph layout over a
ism [5] often has to deal with leaked, unstructured, very hetero-            document landscape that visually describes who talks with whom
geneous data, e.g. in the context of the Panama Papers, where                about what at a given time period.
journalists needed to untangle and order huge amounts of infor-
mation, search entities, and visualise found patterns [3]. Similar           3    SYSTEM ARCHITECTURE
datasets are of interest in the context of computational foren-
                                                                             Visualising communication networks in a topic-aware fashion to
sics [9]. Auditing firms and law enforcement need to sift through
                                                                             explore documents and salient structures is not straightforward
huge amounts of data to gather evidence of criminal activity,
                                                                             Different layout objectives may produce contradicting results and
often involving communication networks and documents [14].
                                                                             the challenges of processing big data need to be addressed [2]. In
                                                                             this section, we describe algorithmic approaches behind the sys-
2   INTERACTIVE VISUALISATION                                                tem we are working on. For a discussion of engineering aspects
Systems for document exploration largely vary in what they                   on how to store, serve, and render the map-like data, we refer to
display and how users interact with them. This depends partly                the Cartograph stack [19], as we will focus on the process how
on the available raw data, but also on information extracted from            to get the information that the map is generated from.
pre-processing or enrichment with external sources. 1 shows                      We visualise the embedded emails as dots in a two-dimensional
a basic visualisation of the network graph extracted from an                 landscape in which individuals are placed as nodes connected by
email corpus. Although it is an improvement over only listing                edges. All emails between two individuals are reduced into one
connections, large densely connected graphs quickly become                   edge reducing visual complexity and making it easier to detect
hard to read and information about the email contents is lost.               salient structures. However, that comes with the trade-off that
   Exploring document collections can be seen as a top-down                  nodes and edges cannot be perfectly placed in the landscape to
approach, where the system provides abstract overviews of the                cover all semantic aspects of the communication between them,
entire document collection and users incrementally refine the                but rather an estimate. Our very early prototype placed some
search, narrowing the results to just a few documents of interest.           individuals with no dominant topic in a crowded area in the
Such a top-down approach may help users without prior knowl-                 centre of the landscape as shown in 2, where colours of opaque
edge to get a sense for the data by visualising high level latent            dots for emails correspond to that of the sender. Although the
structures of communication networks or the topical distribu-                network visualisation at this point does not make connections
tions.                                                                       more clear than in 1, users can already distinguish individuals
   In the scope of this work we primarily consider documents to              with similar or unrelated topics.
be emails or data attached to them. The sender, recipients, time,                Our proposed algorithm to find a stable network layout has
and content can directly be extracted from the raw data. We call             three stages, namely an (i) initialisation phase which creates
these – and results from further processing – dimensions that can            the landscape and roughly places nodes and connections, an
be visualised. From the contents one may infer named entities,               (ii) update phase which iteratively updates the node placement
topics, embeddings, or salient phrases, while the communication              towards a better fit, and finally a (iii) post processing phase where
network spanned by sender-recipient pairs can be used to detect              edges become splines to make latent structures more clear and a
salient structures and hierarchies. The temporal information                 map topology is added.
enables the previously mentioned data to be analysed over time
to detect evolving or changing patterns.                                        Initialising the Landscape. To generate the document land-
   There are numerous ways to visualise each dimension on                    scape, we first process the network graph to roughly determine
its own or in combination with others. The requirement of a                  regions, where documents will be placed. Therefore we apply
dimension and its priority in a visualisation is dictated by the             node2vec [12] to the communication network and embed each in-
system objective. From the wide range of possibilities, we strive            dividual’s node. We separate the graph into communities Pi ⊂ P
for a system which supports the exploration of a large collection            using the kernel density of the resulting populated space at




                                                                       105
                                                                                                645-645




                                                       1301-1301

                                                                                                                                                              951-951



                                                                                                                         1708-1708


                                                                   474-474
                                                                                                                                                                                                                                        1033-1033                                                1193-1193



                                                                                                                                                                                                              2093-2093
                                                    2256-2256
                                                                                                                                                                                                                                                                                 1362-1362                                              1719-1719




                                                                                                  878-878                                                                                                                                                                                                   1553-1553
                                                                                      1730-1730


                                                                                                                                                                                                                                                       2370-2370
                                                                                                                                                                                                                                                                                                                                                                                                                       1603-1603


                                                                                                                                                                                                                                                                          1143-1143
                                                                                                                                                                                     2012-2012                                                                     1276-1276                                                                                                                                    1768-1768
        995-995                                                                                                                                                                                                                                                                                                                                                                       1166-1166
                                                                                                                                                                                                                       1263-1263                                                                                                                                                                                                                                                                                1689

                                                                                                                                                                                                                                                                                                                             209-209                    1110-1110

                                                                                                                                                                                                                                                                                                                                               2490-2490                                          127-127
                                                                                                                                         676-676                                                                                                                                                       725-725
                                                                                                                                                                                                                                                                                                                                                                    1195-1195
                                                                                                                                                                                                                                                                                       722-722
                                                                                                                                                                                                                                                                                           1120-1120                                                                            794-794

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                580-580                          374
                                                                                                                                                                                                                                                                                                                                  848-848                                                                                            1402-1402
                                                                                                                                                                                                                                                                                                           118-118
                                                                                                                                                                                                        1119-1119
                                                                                                                                                                                          34-34
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       2208-2208
                                                                        641-641                                                                                                                                                                                            976-976
                                                                                                                                                                                                                              825-825                         1452-1452
                                                                                                                                                                                                                                                               768-768
                                                                                                                                     2577-2577                                                                                                                                                                                                                                                                                                                                                              2622-262
                                                       1581-1581                                                                                                                                                                                                                                                                                                                                                                                                               2523-2523
                     1652-1652
                                                                                                                                                                                                                                                                                                   2198-2198
                                                                                                                                                                                                                                                                               210-210                                                                                                                        1770-1770
                                                                                                                                                                                                                                                                                                                                                                                                              1515-1515
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       1717-1717
                                                                                                                                                                                                                                                     1076-1076                                                                       509-509
                                                                                                                                                                                                                                                                                                                                     1949-1949
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           970-970
                                                                                                                                                 2375-2375                                                                                                                      1956-1956                                                                                                                                                                                                       2648-2648
                                                                                                                                                                                                                                                                                                                2419-2419                                                            1993-1993
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       85-85
                                                                                                                                                                                                                                                                                                                                                            442-442
                                                                                                                                                                                                                                                                                                                                                                                                          124-124                2250-2250                                 1764-1764
                                               1920-1920                                                                                                                                                                                                                                       2183-2183
                                                                                                                                                                                                                                                                                                                                                                                                                                                                               2665-2665
                                                                                                                                                                                                                                                                                                                     2092-2092
                                          111-111                                                                                                                                                                                                                                                                                                                                                        525-525
                                                                                                                                                                                                                                                                                                                                                                                                         2450-2450
                                                                                                                                                                                                                                                                                                                                            2412-2412                                                                            63-63
                                                                                                                                                                                                                                                                                                                                                                         2629-2629          1027-1027
                                                                                                                                                                                                                                                                                                                                                                2281-2281                                                                                                           1352-1352


        Figure 2: Rendered prototype output after landscape initialisation without prior community segmentation
                                                                                                            448-448                                                                                                                                                                                                                                                                                                                                1490-1490
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 1941-1941
                                                                                                       2253-2253                                                                                                                                                                                                                                                                                                                                           1897-1897                 2520-2520
                                                                                                                                                                                                                                                                               1104-1104
                                                                                                                                                                                                                    978-978                                                                                                                                                                                                                                   2600-2600
                                                                                                                      307-307
                                                                                                                                                                                                                                                                                                                                                                       1100-1100                                                                          2283-2283
                                                                                                                                                                                                                                                                                                                                                                                          2070-2070 619-619
                                                                                             667-667                                                                                                                                                                                                                                                                                                                                                                   1703-1703
                                                                                                                                                                                                                                                                                                                                                                                                                                                                             1454-1454
                                                                                                                                                                                                                                                                                                                                                                                   1648-1648             1514-1514                                                                  964-964
                                                                                                                                       727-727                                                                                                                                                                                                                                                      752-752                      2460-2460
                                                                                                                                                                                                                                                                                                                            2405-2405               1192-1192                                                                                                528-528
                                                                                                                                                                                                  1380-1380                                                                                                                                                                                                                                           1502-1502
                                                                                                                                                                                                                                                      1615-1615                                                                                                                       1498-1498                                                                            95-952302-2302
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               145-145
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  20



threshold κ, where a higher κ results in more, but smaller com-                                                                                               where d (·, ·) is the distance between two nodes (zero if no con-
                                                                                                                                                                                                                                                                                                                953-953                                        1876-1876                                     1096-1096
                                                                                                                                                                                                                                                                                               1387-1387                                                   1134-1134     1323-1323

                                                                                                                                                                                                       2163-2163
                                                                                                                                                                                                    2303-2303                  1481-1481                                        787-787                                                                                                                                                       2487-2487



munities. For each community Pi , pairwise neighbourhood simi-                                                                                                nection exists) or shortest distance from a node to its ideal line.
                                                                                                                                                                         1369-1369
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   231-231
                                                                                                                                                                                                                                                                                                                        519-519                                 2185-2185
                                                                                                                                                                                                                                                                                                                                                                                                                                                                               2440-2440                    1474-147
                                                                                                                                                                                                                                                  2680-2680
                                                                                                                                                                                                                                                                                                                                                                                                    1890-1890               267-267                                                               844-844



larities are calculated using euclidean distance between nodes,                                                                                               To adjust the layout towards either a better semantic or structural
                                                                                                                                                                                                                                                                                                                                                                                     1257-1257                                                                                                       2411-2411
                                                                                                                                                                                                                                                                 1059-1059                                                                                                                                                                                                                                 753-753
                                                                                                                                                                                                                                                                                                                                                                                                                                         312-312                                    2663-2663
                                                                                                                                                                        434-434                                                                                                                                                                                     253-253 33-33
                                                                                                                                                                                                                                                                                                                                  1425-1425
                                                                                                                                                                                                                                                                                                                                  1720-1720                                                                                                               1286-1286
                                                                                     14-14                                                        1684-1684                                                                                                                   175-175                                                                                                                                    1558-1558
                                                                                                                                                                                                                      1147-1147                                                                                  327-327
                                                                                                                                                                                                                                                                                                                                                                                                                    1736-1736



forming the triangular matrix Si , where skl is the similarity be-                                                                                            fit, we introduce parameters θ and η.
                                                                                                                                                                                                                        2168-2168                                                                                                                                                                                                                                                               2454-2454
                                                                                                                                                                                                                                                                                                       1578-1578                                                                                                                                                                             1631-1631
                                                                                                                                                                                                                                                                                                                                                  2625-2625                                                                                                                          1554-1554
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   1637-1637              2108
                                                                                                                                                                                                                                                                                                                                                                                                              1339-1339
                                                                                                                                                                                                                                                                                                                                  439-439                                       1599-1599                                                          520-520                                           2678-2678




tween pk , pl ∈ Pi .                                                                                                                                              In order to minimise 1, we use stochastic gradient descent. In
                                                                                                                                                                                                                                                                                                              2200-2200                                                                                                322-322                                                                      1683-1683
                                                                                                                                                                                                        1565-1565                                                                              1007-1007
                                                                                                                                                                                                                                                                                               1901-1901            218-218                                     2124-2124
                                                                                                                                                                                                                                               131-131 1171-1171
                                                                                                                                                                                                                                             2691-2691
                                                                                                                                                                                                                                                                                                                           539-539            1847-1847



   Furthermore, we train document embeddings [15, 25] on all                                                                                                  each iteration step, we can derive the direction and magnitude
                                                                                                                                                                                  45-45                                                                                                                                                                    875-875
                                                                                                                                                                                                                                                                                                                                                               1137-1137                                        108-108
                                                                                                                                                                                                                                        876-876                                      1673-1673
                                                                                                                                                                                                                                                                                     801-801                                                                                                                                                                                                       264-264
                                                                                                                                                                                                                                                                                                                                                                                                                                 704-704
                                                                                                                                                                                                                                                                                                                                                                                                                                 663-663
                                                                                                                                                                                                                                                                                                                            98-98                                           1068-1068
                                                                                                                                                                                                                                                                                                           191-191                                                          1701-1701
                                                                                                                                                 501-501                                                                                                                                                                                            1038-1038
                                                                                  904-904


emails and use them to infer high dimensional semantic vec-                                                                                                   each node should be moved towards a better semantic fit and
                                                                                                                                                                                                                                                                                                                                                                                                                     1071-1071
                                                                                   2096-2096

                                                                                                                                                                                                                                                                                                                                       2465-2465                                                     1478-1478           1072-1072
                                                                                                                                                                                                                                                                                                                                                                                                           2588-2588                                                                             294-294



tor representations. Let Mi be the set of emails that originated                                                                                              closer proximity to it’s neighbourhood in the network.                                                                                       511-511
                                                                                                                                                                                                                                                                                                                                   1704-1704
                                                                                                                                                                                                                                                                                                                                                                      2469-2469
                                                                                                                                                                                                                                                                                                                                                                          1078-1078




in community Pi . For each email m ∈ Mi , the dimensionality                                                                                                      The semantic gradient for p j ∈ P is defined by
                                                                                                                                                                                                                                                                                                            1405-1405                     983-983
                                                                                                                                                                                                                                                                                                                                            1573-1573                                          266-266
                                                                                                                                                                                                                                                                                                                                                                                               221-221
                                                                                                                                                                                                                                                                                                                                                                                               719-719
                                                                                                                                                                                                                                                                              732-732
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             176-176
                                                                                                                                                                                                                                                                                                                                                                                                                     2659-2659



is reduced using t-SNE [16], which retains possible semantic
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    1253-1253
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             1285-1285

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  2673-2673
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  1549-1549



clusterings of documents in the higher dimensional space. The                                                                                                                                                                           δ⃗js := (pm           m
                                                                                                                                                                                                                                                                                                                                                                                                                                      (2)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                213-
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                2030



                                                                                                                                                                                                                                                  j − p j ) ∥p j − p j ∥
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              2428-2
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             2620-26




resulting two-dimensional vectors are then placed as dots on the
map using the centre of embedded network communities as the                                                                                                   where pmj is the closest point on mM  p j to p j and ∥·∥ denotes the
respective origin, whereas the size is determined by the number                                                                                               euclidean norm, while neighbourhood gradient is defined by                                                             602-602




of related individuals.                                                                                                                                                           X                                 
   We also initialise communication network’s layout. Thereby,                                                                                                            δ⃗jn :=     (p j − pk ) ∥p j − pk ∥ − s jk            (3)
the staring position of a node representing an individual is de-                                                                                                                                                                    pk ∈P                        247-247




termined by the normalised sum of two-dimensional vectors of                                                                                                                                                                        p j ∼pk
all emails he or she has sent or received. This way, we implic-
                                                                                                                                                              where p j ∼ pk denotes that an edge exists between p j and pk .
itly group semantically related individuals into communities as
                                                                                                                                                                With the definitions in 2 and 3, we can formulate the update
frequent communication biases this normalised sum. Straight
                                                                                                                                                              vector δ⃗j for node p j ∈ P as
edges are added between the nodes if the respective individuals
exchanged emails. Note, that many edges may only represent                                                                                                                                                                                                                  
a small number of emails. Applying a variable threshold σ can                                                                                                                                                                                        δ⃗j := ξ θ δ⃗jn + ηδ⃗js                                                                                                                                                          (4)
reduce the computational load in later stages, as these edges will
                                                                                                                                                              where ξ is the learning rate and θ, η as before parameters to
not impact the overall layout very much. They can be added
                                                                                                                                                              weight between a better semantic or neighbourhood fit.
again as the user requests a detailed visualisation by zooming in
                                                                                                                                                                 Most likely, complex network structures might prevent the
or through other interactions.
                                                                                                                                                              stochastic gradient descent to find a stable minimum, so the score
   In the algorithm’s second stage, we iteratively try to improve
                                                                                                                                                              of the objective function should be monitored or intermediate
the layout of the communication network by finding a balance
                                                                                                                                                              layouts be visually evaluated to determine a satisfactory result.
between the closeness of nodes to semantic context and densely
connected neighbourhoods a node belongs to. Therefore, for                                                                                                       Post Processing. Lastly, we use the post processing stage to
                                                               M
each individual p j ∈ P we use linear regression to fit a line m pj                                                                                           enhance the readability of our visualisation. Densely connected
though all two-dimensional vectors of emails he or she has sent                                                                                               communities in the graph are potentially hard to read, thus we
or received. As a node is placed near this line, it remains in a                                                                                              apply edge bundling [1] to visually clear latent structures. We
semantically good position.                                                                                                                                   also apply MapSets [7] to separate the regions for each commu-
   Adjusting the Network Layout. The first stage of our proposed                                                                                              nity. Since semantically similar emails may appear in different
algorithm produces a fixed document landscape and roughly fits                                                                                                communities, we apply colouring based on clusters in the original
the communication network on top. We now aim to incrementally                                                                                                 global document embedding space to retain this aspect. Choosing
adapt the layout of the graph to better reflect salient structures                                                                                            the colours depends on the number of latent topics that should
in the network while keeping each individual’s node close to the                                                                                              be depicted [24]. If the topic number exceeds 25-30 topics, group-
reflective semantic area in the landscape.                                                                                                                    ing topics and allowing for zooming within a two-level topic-
   Therefore we define a score quantifying how well the current                                                                                               hierarchy ensures distinguishable colors for up to 10 subtopics
layout fits these objectives:                                                                                                                                 (25 × 10 = 25). In order to represent temporal aspects of the data,
            X "                     X                    #
                                                                                                                                                              we calculate the kernel density of the document landscape for
                           M
                 η d (pi , m pi ) +  θ s ij − d (p  ,
                                                   i jp )      (1)                                                                                            fixed time-intervals, which can be used to add heat-map overlays
             pi ∈P               p j ∈P                                                                                                                       that users can select later on.




                                                                                                                                       106
           Figure 3: Semantic landscape of email contents and dominant communication patterns (drawn mockup)


4    CONCLUSION AND VISION                                                                    [5] Mark Coddington. 2015. Clarifying journalismâĂŹs quantitative turn: A typol-
                                                                                                  ogy for evaluating data journalism, computational journalism, and computer-
In this paper, we described an algorithm to lay out a commu-                                      assisted reporting. Digital Journalism 3, 3 (2015), 331–348.
nication network on top of a landscape of semantically embed-                                 [6] Tommy Dang and Angus Forbes. 2017. CactusTree: A tree drawing approach
                                                                                                  for hierarchical edge bundling. In Proc. of the Pacific Visualization Symposium.
ded emails. This is still work in progress, thus 3 shows only a                                   IEEE, 210–214.
manually drawn mock-up of the visualisation we envision. In it,                               [7] Alon Efrat, Yifan Hu, Stephen G Kobourov, and Sergey Pupyrev. 2015. MapSets:
individuals are represented as nodes positioned such that densely                                 Visualizing Embedded and Clustered Graphs. Journal of Graph Algorithms
                                                                                                  and Applications 19, 2 (2015), 571–593.
connected communities are visually clustered. Edges describe the                              [8] Blaz Fortuna, Marko Grobelnik, and Dunja Mladenic. 2005. Visualization of
email traffic, where the opacity and thickness is used to indicate                                text document corpus. Informatica 29, 4 (2005), 497–502.
the frequency of messages between the nodes they connect.                                     [9] Katrin Franke and Sargur N Srihari. 2007. Computational forensics: Towards
                                                                                                  hybrid-intelligent crime investigation. In International Symposium on Infor-
   The semantic representations of emails are used to place dots                                  mation Assurance and Security. IEEE, 383–386.
on a background layer which we call the document landscape.                                  [10] Daniel Fried and Stephen G Kobourov. 2014. Maps of computer science. In
                                                                                                  Proc. of the Pacific Visualization Symposium. IEEE, 113–120.
This landscape is used as additional input to the graph layout                               [11] Martin Gronemann and Michael Jünger. 2012. Drawing clustered graphs as
algorithm, aiming to place a node within corresponding semantic                                   topographic maps. In Proc. of the Symposium on Graph Drawing and Network
regions. The colouring of regions in the landscape is derived from                                Visualization. Springer, 426–438.
                                                                                             [12] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning
densely connected communities in the communication graph. Op-                                     for networks. In Proc. of the Conference on Knowledge Discovery and Data
tionally, representative words are selected for densely populated                                 Mining. ACM, 855–864.
areas in the landscape, so that users get a rough idea about sub-                            [13] Jan Hildenbrand, Arlind Nocaj, and Ulrik Brandes. 2016. Flexible Level-of-
                                                                                                  Detail Rendering for Large Graphs. In Proc. of the Symposium on Graph Draw-
jects in that area. The aforementioned timestamps of emails can                                   ing and Network Visualization. Springer, 625–627.
be used to generate a heatmap overlay to show the activity in a                              [14] Mukundan Karthik, Mariappan Marikkannan, and Arputharaj Kannan. 2008.
                                                                                                  An intelligent system for semantic information retrieval information from
certain time interval which is controlled by a slider. Similar to                                 textual web documents. In International Workshop on Computational Forensics.
modern geographical maps, zooming into a region reveals more                                      Springer, 135–146.
details. In our case, less prominent individuals and their connec-                           [15] Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences
                                                                                                  and documents. In Proc. of the International Conference on Machine Learning.
tions are shown along with additional salient phrases from the                                    PMLR, 1188–1196.
document landscape. Selecting a node will not only highlight con-                            [16] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using
nected edges but may also temporarily show more edges which                                       t-SNE. Journal of Machine Learning Research 9, 11 (2008), 2579–2605.
                                                                                             [17] Patrick Cheong-Iao Pang, Robert P Biuk-Aghai, Muye Yang, and Bin Pang.
were previously hidden at that zoom level. The user will also be                                  2017. Creating realistic map-like visualisations: Results from user studies.
able to retrieve documents with the help of a selection rectangle                                 Journal of Visual Languages and Computing 43 (2017), 60–70.
                                                                                             [18] Arnaud Sallaberry, Yang-chih Fu, Hwai-Chung Ho, and Kwan-Liu Ma. 2016.
or clicking dots in the document landscape.                                                       Contact trees: Network visualization beyond nodes and edges. PLOS ONE 11,
   In future work, we hope to evaluate this system using full-                                    1 (2016), 1–23.
scale real-world data as well as practitioners from journalism and                           [19] Shilad Sen, Anja Beth Swoap, Qisheng Li, Brooke Boatman, Ilse Dippenaar,
                                                                                                  Rebecca Gold, Monica Ngo, Sarah Pujol, Bret Jackson, and Brent Hecht. 2017.
auditing. It may also be interesting to experiment with embedding                                 Cartograph: Unlocking Spatial Visualization Through Semantic Enhancement.
methods, which take both the emails and the network graph as                                      In Proc. of Conference on Intelligence User Interfaces. ACM, 179–190.
input and directly project the inferred representations into the                             [20] Kaveri Subrahmanyam, Stephanie M Reich, Natalia Waechter, and Guadalupe
                                                                                                  Espinoza. 2008. Online and offline social networks: Use of social networking
two-dimensional landscape to simplify the proposed algorithm.                                     sites by emerging adults. Journal of Applied Developmental Psychology 29, 6
                                                                                                  (2008), 420–433.
                                                                                             [21] Jun Tao, Jian Xu, Chaoli Wang, and Nitesh V Chawla. 2017. HoNVis: Visualiz-
                                                                                                  ing and Exploring Higher-Order Networks. In Proc. of the Pacific Visualization
REFERENCES                                                                                        Symposium. IEEE, 1–10.
 [1] Benjamin Bach, Nathalie Henry Riche, Christophe Hurter, Kim Marriott, and               [22] Furu Wei, Shixia Liu, Yangqiu Song, Shimei Pan, Michelle X Zhou, Weihong
     Tim Dwyer. 2017. Towards unambiguous edge bundling: Investigating con-                       Qian, Lei Shi, Li Tan, and Qiang Zhang. 2010. TIARA: a visual exploratory
     fluent drawings for network visualization. Transactions on Visualization and                 text analytic system. In Proc. of the Conference on Knowledge Discovery and
     Computer Graphics 23, 1 (2017), 541–550.                                                     Data Mining. ACM, 153–162.
 [2] Nikos Bikakis and Timos Sellis. 2016. Exploration and visualization in the              [23] Jaewon Yang and Jure Leskovec. 2014. Overlapping communities explain
     web of big linked data: A survey of the state of the art. In Proceedings of the              core–periphery publisher of networks. Proc. IEEE 102, 12 (2014), 1892–1902.
     Workshops of the EDBT/ICDT 2016 Joint Conference, Vol. 1558. CEUR-WS.org.               [24] Achim Zeileis, Kurt Hornik, and Paul Murrell. 2009. Escaping RGBland: se-
 [3] Marie-Anne Chabin. 2017. Panama papers: a case study for records manage-                     lecting colors for statistical graphics. Computational Statistics & Data Analysis
     ment? Brazilian Journal of Information Science: Research Trends 11, 4 (2017).                53, 9 (2009), 3259–3270.
 [4] Yanhua Chen, Lijun Wang, Ming Dong, and Jing Hua. 2009. Exemplar-based                  [25] Zhaocheng Zhu and Junfeng Hu. 2017. Context Aware Document Embedding.
     visualization of large document corpus. Transactions on Visualization and                    CoRR abs/1707.01521 (2017).
     Computer Graphics 15, 6 (2009), 1161–1168.




                                                                                       107