Toward Visual Interactive Exploration of Heterogeneous Graphs Irène Burger Ioana Manolescu irene.burger@polytechnique.edu ioana.manolescu@inria.fr Institut Polytechnique de Paris, Inria Inria, Institut Polytechnique de Paris Palaiseau, France Palaiseau, France Emmanuel Pietriga Fabian Suchanek emmanuel.pietriga@inria.fr suchanek@telecom-paris.fr Univ. Paris-Saclay, CNRS, Inria Télécom Paris, Institut Polytechnique de Paris Orsay, France Palaiseau, France ABSTRACT An interesting class of heterogeneous datasets, encountered for instance in data journalism applications, results from the inter- connection of data sources of different data models, ranging from very structured (e.g., relational or graphs) to semistructured (e.g., JSON, HTML, XML) to completely unstructured (text). Such het- erogeneous graphs can be exploited e.g., by keyword search, to uncover connection between search keywords [1]. In this paper, we present a vision toward making such graphs easily comprehensible by human users, such as journalists seek- ing to understand and explore them. Our proposal is twofold: (i) abstracting the graph by recognizing structured entities; this Figure 1: Sample screen shot of ConnectionLens GUI [1] simplifies the graph without information loss; (ii) relying on showing an answer to a three-keyword query. data visualization techniques to help users grasp the graph con- tents. Our work in this area continues; we present preliminary encouraging results. As no single query language can be used on such heteroge- neous data, ConnectionLens supports keyword queries, asking for all the connections that exist, in one or across several data sources, 1 INTRODUCTION between these keywords. This problem has been raised by our col- Data journalists often have to handle sets of different data struc- laboration with Les Décodeurs, Le Monde’s fact-checking team, tures, which they obtain from official organizations or from their with whom we collaborate within the ContentCheck research sources, extract from social media, receive via email or create project. The novelty of ConnectionLens search wrt the literature themselves (typically Excel or Word-style) etc. For instance, jour- on keyword search in databases is to seek answers which may nalists from the Le Monde newspaper want to retrieve connections span over several datasets of different data models, with very dif- between elected people at Assemblée Nationale and companies that ferent or even absent internal structure (the latter is true for text have outposts outside of France; such a query can be answered cur- data). rently at a high human effort cost, by inspecting e.g. a JSON list For instance, Figure 1 shows an answer to the three-keyword of Assemblée elected officials (available from NosDeputes.fr) and query: {“Macron”, “Kohler”, “Costa”}. Different colors indicate manually connecting the names with those found in a national nodes from data sources of different data models (JSON, RDF and registry of companies. This huge effort may still miss connections text, respectively); the nodes filled with solid color are those that that could be found if one added information about politicians’ match the query keywords. Nodes are either derived from the and business people’s spouses, information sometimes available original dataset, e.g., a node for each tuple and attribute from a in public knowledge bases such as DBpedia, or in a journalists’ relational dataset, a node for each map or list in JSON, for each personal notes. node in an RDF graph etc., or they may be entity occurrences, ConnectionLens heterogeneous graphs. The ConnectionLens identified by a Named Entity Recognition module capable for project [1] aims to enable journalists to work with data sources as now to identify People, Organizations and Locations. We extract they come, and quickly due to the speed of the news cycle. This entity occurrences from any text, whether a document or a phrase precludes the time to understand, analyze, and extract the data or a name appearing in a JSON node. The dataset interconnection into a single unified data model. Instead, in ConnectionLens we is materialized by the two red edges labeled same-as between consider that the data model of a given dataset is “an accident” pairs of entity occurrences. related to its creation history, and should not be a barrier toward The problem: ConnectionLens graph exploration. While exploiting it. such a raw node-link diagram visualization allows users to find some results in heterogeneous ConnectionLens graphs, the sup- © 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed- port they provide for exploring and understanding the graph to ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen, non-expert users, such as journalists, is quite limited. Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At- tribution 4.0 International (CC BY 4.0) (1.) The visualization only shows the answer tree; the full graph is much larger, and a simple node-link diagram as this one does not scale beyond a dozen edges, as illustrated on a Con- between users and datasets, e.g., user Alice hasAuthored dataset nectionLens graph1 of barely more than 200 edges. This problem article1, on which user Bob commented etc. is common to any large data graph (not specific to Connection- (C) Containers are rather uninteresting. Here, containers denote: Lens) [6]. Many techniques have been developed to address this a table (or set of tuples) if the data is relational, or a JSON array. issue, going from a simple navigable node-centric GUI2 for a Intuitively, containers serve to group together several more-or- large RDF knowledge base [7], to more elaborate visualization less comparable “things”, such as albums of a singer, or members techniques, e.g., [4]. of a committee. The container node itself is less useful. (2.) While some ConnectionLens nodes are meaningful for (D) Rich entity nodes are desirable for data exploration. Here, human users, e.g., the entity occurrences recognized in Figure 1, an entity node (EN, in short) denotes an instance of an Entity, in others are not, as they merely correspond to internal containers, the traditional Entity-Relationship sense known from conceptual such as JSON maps or arrays, relational tuples etc., which do design. For instance, “the person François Ruffin, having the not carry human-understandable significance; examples are the birthplace Amiens and the twitter handle @François_Ruffin” is nodes labeled 2|1 and 2|1.3 in the figure. Showing these nodes at an EN. Observe that an EN is a larger and more complex notion the same level, therefore, is not appropriate. This problem is spe- than an entity occurrence currently identified by the extraction; cific to ConnectionLens: it originates in its very goal of connecting in particular, an entity occurrence is always a leaf (added as a information, no matter what format it comes from. child of the text node where the extractor found it), and has no In this paper, we outline a vision for dramatically increasing attributes. the usability of ConnectionLens graphs, in particular through As customary in the Entity-Relationship model, we assume novel interactive visualizations based on recent advances in the that some of the nodes surrounding an EN n, and reachable from n, area of expressive node-link diagram authoring [5] for multi- e.g., the twitter handle above, only serve to describe n and are not variate networks [3]. We identified two steps toward realizing standalone nodes. However, we depart from the classical notion that vision: graph abstraction (Section 2), and graph visualization of an Entity used in relational database design, by allowing ENs (Section 3).We present some preliminary results, and perspectives of a given type to have different sets of attributes, and/or to have for our work. more than one value for a given attribute. Note also that in some datasets, Amiens may also be an EN, e.g., if the dataset specifies 2 ABSTRACTING CONNECTIONLENS things about it such as its population, geographical coordinates GRAPHS etc.; in other datasets, e.g., one centered around individuals, or We present our analysis of the ways in which ConnectionLens around shops distributed across a country, Amiens may be just a graphs can be simplified, or abstracted, based on a set of obser- dimension characterizing the ENs (people, respectively, shops). vations made by journalists (from Le Monde and TF1) and data We believe ENs are useful paradigms for exploring the dataset journalists / data visualization professionals (from WeDoData, a because: French company) with whom we shared them. • They correspond to a natural paradigm of “things” character- ized by their properties (attributes); 2.1 Abstraction Principles • They group together pieces of information related to a common We present a set of guiding principles, which we identify by a resource, e.g., the twitter and Facebook handles of a given capital letter to be able to refer to them in the paper. person; (A) Entities are interesting. Users want to know who, or what, • They lay the foundation for many interesting visualizations, a given dataset, or multi-dataset graph, is about. It is natural to where attributes can be used e.g. to place shops on a map be interested in people (e.g., a politician or a businessperson), according to their location, or events on a timeline etc. organizations (e.g., an army or a company), or a place (e.g., the (E) Relationships between ENs are interesting. Here, we extend the city where one lives, or a country such as Panama). Depending on intuition that just like ENs, relationships are intuitive constructs the dataset and the application, of course, other kinds of entities that allow to structure and analyze a CL graph. For instance, it may be considered, e.g., a Web site, a Facebook or Twitter account, is interesting to find that Alice (an EN) supervised Bob (another a specific kind of organization (e.g., companies from Panama) EN). etc. Based on the above principles, below we describe an algorith- (B) Datasets may or may not be interesting. Conceptually, we mic approach which, starting from a ConnectionLens graph, (1) like to think of “data” as an abstract set of information, say, a identifies Entity Nodes, (2) assigns them attributes among the large graph. From a practical viewpoint, however, data comes nodes in their neighborhood, and (3) connects them through in datasets, which delimit “original subsets” of the global graph. relationships. The contours of a dataset are sometimes clear, e.g., a tweet, a Web page, or a speech; in other cases, they are more fuzzy. For instance, if a relational database holds three tables, should we view this as 2.2 Abstraction Algorithm one or three datasets? From a user perspective, one can choose to As an example, Figure 2 shows a ConnectionLens graph resulting ignore the datasets (make their boundaries invisible); this may be from a JSON document describing Nobel laureates. The red node appropriate, for instance, if in a social network graph, we want is the dataset; blue nodes are maps or arrays, while green nodes to focus just on connections between users and/or hash tags. Al- correspond to values (literals). Further, named entity occurrences ternatively, datasets can play a very significant role: (i) if entities identified by the extraction are outlined by a black contour, e.g., co-occurring in a dataset denote an interesting association, e.g., Jean S., Inria, CNRS etc. These form the basis of the entity nodes two recipients of the same email; or (ii) if we identify connections we want to identify. The figure also shows that ConnectionLens creates a single node for all the nodes with the same label found 1 http://pages.saclay.inria.fr/ioana.manolescu/DOT-obtained-image.pdf 2 https://gate.d5.mpi-inf.mpg.de/webyago3spotlx/SvgBrowser on the same root-to-leaf path in a dataset: thus, there is a single node EN (o) we are trying to build. For each type τ ∈ T , we compare λ, using Embedded Word Distances [2], to a manually chosen set of keywords Wτ , and select the type τλ to which λ is closest. When λ is “Laureates”, τλ is Person. Then: • If τλ , τ (o), the parent p comprising o is “not about o”; p likely describes something else. In our example, where τ (o) is Person, if τλ is Organization, p may describe, e.g., an organization in which o plays some role, but not o itself, therefore the siblings of o are not its attributes. In this case, from o, we can find no more attributes of EN (o). • On the contrary, if τλ = τ (o), we consider that the p may be about o, and try to find p children (siblings of o) to attach to EN (o) as attributes. For that purpose, we Figure 2: Input graph G(d) from a JSON document. examine all the entity children c of p with type τ (c) = τλ . If o is the only one, as is "Jean S.", then its siblings that are leaf children of p and are not entities themselves are considered as attributes of EN (o). Otherwise, e.g., "Maria M." has a sibling "Matthieu L." which is also of a type τ = PERSON, then for each edge p − → c, where c is the child of p with type τλ and a is the label of the edge, we compare a to a manually chosen set of keywords whose meaning is similar to "key", e.g., “ID”, “name” etc. The label a with the smallest distance to this set determines the “winner” entity occurrence (child of p) which captures its siblings as attributes. In our example, "Name" is closer in meaning to "key" than "Mentor", so "Maria M." is chosen. Figure 3: Modified graph G(d) after extracting entities In application of our principle (C), when EN (o) captures from the graph in Figure 2. an attribute, EN (o) replaces its container parent p in G(d), node labeled “Researcher”. This decision materializes the con- and is pushed back in Q paired with p’s parent. nection which exists between Jean S. and Maria M.: both are (c) If p is a JSON value from which several entities occur- researchers. rences were extracted (e.g., p is a long text, say, a politician’s Given a graph G(d) such as shown in the figure, corresponding speech), no attribute of EN (o) is extracted from p. to a dataset d, a graph EG(d) consisting of entity nodes and relationships between them can be built as follows. This algorithm creates a set of ENs, each encompassing several in- (1) Following (A) and (D), we seek to create ENs starting from the terconnected nodes from the original graph G(d). It also modifies entity occurrences. To that purpose, we use a priority queue the structure of G(d) as shown in Figure 3. Q in which we push all pairs (entity occurrence, its parent Finally, to satisfy principle (E), we look for relationships (links) node) from G(d); the priority is computed as the length of the to be added to EG, based on paths found in the modified graph path from the occurrence to the parent until the dataset root G as shown above. For now we consider two simple approaches; (the longer the path, the higher the pair’s priority). If an entity • If the shortest path between two ENs in G has a length below a occurrence has several parents, it is pushed in Q once for each fixed threshold, we create an edge between them in EG, labeled parent. with the concatenation of labels on this path. For example we (2) While Q is not empty: Pop from Q the pair (o, p) with the high- create the edge "Co-workers" between "Maria M." and "Joseph est priority. The type of o, denoted τ (o), is available in the graph L." G(d); the set of entity occurrence types currently supported • If two ENs have identically labeled paths to their nearest com- is T = {Person, Location, Organization}. We need to create mon ancestor nca, we create an edge between them, whose a rich entity node EN (o) out of o. As o is a leaf node (created label is "co-" concatenated with the sequence of labels on this by ConnectionLens extraction), to find possible attributes of path. If nca happens to be an array, we add to this the first EN (o), we need to climb up from o one or a few levels, then label encountered on the path from nca to the datasource. go down to find o’s attributes in the vicinity. Finally, to avoid overloading EG with edges, we do not create (a) If p is an array (list), e.g., in the case when o is the entity oc- an edge between two entities if the shortest path between them in currence "Mathilde B.", we conclude that o has no attributes G goes through another EN. Figure 4 shows the resulting graph. among its siblings, as one does not expect to find, in an The above algorithm handles one dataset, a JSON one. It di- array, entities and attributes of the same (or comparable) rectly applies to relational data (where a tuple plays the role entities. of a map and a relation that of an array), (X)HTML documents, (b) If p is a map, e.g., when o is the entity occurrence "Jean S.", and text documents which we view as shallow trees (one dataset we find the first edge with a non-empty label λ on the path node with entity occurrence children). Further, two ENs e 1, e 2 from p to the dataset node; in our example, λ is "Laureate". extracted from datasets d 1, d 2 such that the originating entity oc- We make the assumption that λ carries useful information currences o 1, o 2 were connected by same-As in G(D) are unified about the content (meaning) of the map p. in the final graph. Next, we need to understand how p relates to the entity Figure 5: Impact of the graph abstraction algorithms. Figure 4: Graph of extracted entities EG from example 3 VISUALIZING ENTITY GRAPHS National Assembly member and a tweet from F. Ruffin respec- As mentioned earlier, node-link diagram representations of graphs tively, as well as two texts. The resulting ConnectionLens graph do not scale beyond a few dozen nodes and edges. Alternative (the one in1 ) is not easy to understand; it has many structural representations of graphs exist, such as adjacency matrices, but nodes and also uninteresting metadata part of the Tweet content. are much less familiar to users and require some training to in- DS2 is a large dataset representing a list of Nobel Laureates. This terpret. Furthermore, adjacency matrices complicate tasks that comprises many Person and Location entity occurrences, as well involve following paths in graphs, which is an essential aspect of as many attributes: birth dates, professions etc. DS3 is a RDF analysis in ConnectionLens. Thus, we explore ways to support the graph describing French politicians. DS4 describes the career of visual exploration of ConnectionLens graphs by improving the more one politician, most nodes describe him, while other entities are familiar node-link diagram representation, addressing problems cities where he was elected. of scalability both in terms of number of nodes and links, and in Figure 5 shows, for each dataset, the number of edges and terms of attributes of the multivariate network. nodes in G(d), respectively, EG(d). We see that graph abstraction The abstraction strategy we explored so far (Section 2) can be brings about an order of magnitude reduction in the number seen as a pre-processing before generating a visual representa- of nodes and edges. The “deleted nodes” are neither recognized tion of the graph. Another strategy would consist in converting ENs nor attributes of one. A text with no entity occurrence, or some links in the graph that point to leaf nodes (literal values) children of a map or array which does not contain entities, is not into attributes of the source node. Conceptually similar to what present in EG, nor in the visualization. GSS [4] does for RDF graphs, applying such a strategy means that the node attributes that have been preprocessed this way can 5 PERSPECTIVES then be visually conveyed by encoding their value using visual Attribute assignment needs to be refined: we currently assume variables of the nodes themselves, as is typically done in mul- a map refers to at most one EN. However, a map can describe tivariate network visualization [3]. Then, expressive node-link several ENs, or a link between several ENs, in which case the diagram authoring environments such as Graphies [5], which attributes should be added to the link instead of the ENs. For feature a rich palette of visual mapping options, will let users example, if we consider a contract, the id of the contract should be create elaborate visual representations in which nominal, or- an attribute of the edge link the ENs that are part of this contract. dered and quantitative attributes can be conveyed using different We aim to improve our work in order to take these occurences properties of the graphical nodes: shape (predefined shapes or into account. sketches), size, fill and stroke color hue/saturation/brightness, To improve the visualization, we also seek to simplify the label font properties, even position depending on the choice of labels of the edges between entities. Indeed, in large graphs, con- layout strategy. catenating labels found on a path will lead to long and unintelli- The second strategy consists in incrementally building the gible labels. One idea is to simplify these labels by recognizing graph based on one or a few named entities of interest, iteratively a more general type of link such as "worksIn", "writesAbout", adding subsets of nodes and relationships based on their type "worksWith", etc. or other criterion (e.g., a threshold value for a given attribute), Acknowledgements. This work was partially supported by specified interactively by the user in the visualization environ- ANR-15-CE23-0025 (ContentCheck) and ANR-16-CE23-0007 (DI- ment. Thus, users populate the initially-empty canvas with nodes COS). and links of particular interest, adding/removing elements by direct manipulation. Combined with a history mechanism that REFERENCES enables backtracking and forking to explore different subsets of [1] Camille Chanial, Rédouane Dziri, Helena Galhardas, Julien Leblay, Minh Huong Le Nguyen, and Ioana Manolescu. 2018. ConnectionLens: the original ConnectionLens graph, this bottom-up approach to Finding Connections Across Heterogeneous Data Sources (demonstration). building the visual representation makes it possible to explore VLDB (2018). graphs that would not be amenable to visualization considered [2] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. arXiv (2013). in their entirety. Here again, we rely on Graphies [5] to enable [3] C. Nobre, M. Meyer, M. Streit, and A. Lex. 2019. The State of the Art in such an incremental construction of the graph. While Graphies Visualizing Multivariate Networks. Computer Graphics Forum 38, 3 (2019). is fully functional, we are still in the early stages of integrating it [4] E. Pietriga. 2006. Semantic Web Data Visualization with Graph Style Sheets. In SoftVis. with ConnectionLens. Preliminary results, that rely on a manual [5] Hugo Romat, Caroline Appert, and Emmanuel Pietriga. 2019. Expressive Au- processing pipeline for now, are encouraging. thoring of Node-Link Diagrams with Graphies. TVCG (2019). [6] T. von Landesberger, A. Kuijper, T. Schreck, J. Kohlhammer, J.J. van Wijk, J.-D. 4 PRELIMINARY EXPERIMENTS Fekete, and D.W. Fellner. 2011. Visual Analysis of Large Graphs: State-of-the-Art and Future Research Challenges. Computer Graphics Forum 30, 6 (2011). We implemented in Python the algorithm outlined previously. We [7] Gerhard Weikum, Johannes Hoffart, and Fabian M. Suchanek. 2016. Ten present some quantitative results based on tests made over four Years of Knowledge Harvesting: Lessons and Challenges . In Data Engineering Bulletin . heterogeneous datasets denoted DS1, DS2, DS3 and DS4. DS1 is composed of four datasets : two JSON documents describing