Enabling Hierarchical Exploration for Large-Scale Multidimensional Data with Abstract Parallel Coordinates Gaëlle Richer, Joris Sansen, Frédéric Lalanne, David Auber, Romain Bourqui {gaelle.richer,joris.sansen,frederic.lalanne,david.auber,romain.bourqui}@u-bordeaux.fr LaBRI, Univ. Bordeaux, CNRS, UMR5800 France   ABSTRACT  2 2 2 2   1 1 2 6  As data collection grows more common in various domains, there    5 2 2 6  is a call for adapted or newer methods of visualization to tackle  6 3 3 1  magnitudes exceeding the number of available pixels on screens and challenging interactivity. Exploratory visualization of large (a) Toy data (b) Line-based (c) Abstract data present two major challenges: perceptual scalability and processing scalability. The first is concerned with overcoming Figure 1: Examples of (b) line-based and (c) abstract paral- the fundamental limitation of screens and human perception. lel coordinate plots for the same dataset (a). On the line- The second deals with efficiently processing large volumes of based plot, colors match those of tuples on (a). On the data to achieve responsive interactions. Multiscale visualizations abstract plot, the height of aggregates corresponds to the are an effective technique for solving the first challenge that number of tuples they cover. builds on several levels of data abstraction to provide the user with an initial overview and subsequent incremental detail. The focus of this paper is on multidimensional data, a ubiquitous form subsets of items across axes. Selection relates to interactive means of data among large-scale data sets, and parallel coordinates, a of choosing subsets of items and enhancing them such that they representation largely used for this type of data. For this represen- can be discriminated from the rest. tation, defining abstractions and interactively generating levels is In this paper, we are interested in supporting the interactive not straightforward. Building upon several previous aggregated exploration of multidimensional datasets with a large number parallel coordinates representations, we propose a unifying and of items and a moderate number of dimensions. We built upon thinking model for conceiving and describing multiscale par- the assumption that large datasets, scaling to billions of records, allel coordinates and their interactions. Using this formalism, lead to major overplot when displayed using the traditional line- we present a focus+context representation which bounds the based parallel coordinate representation. Indeed, as the number number of visual items with a fixed resolution parameter while of records increases, the plot becomes cluttered and analyses supporting exploration up to the item-level. Processing scala- may be hindered as specific patterns are concealed [10]. Heinrich bility is addressed by carrying out computation in a distributed and Weiskopf [18] listed several clutter reduction approaches for manner on a remote data-intensive infrastructure. Bounding the parallel coordinates. One solution lies in the display of aggregates visual items ensures perceptual scalability but also bounds the computed per-dimension as in [20, 29, 32] instead of single items data transfer between this infrastructure and the rendering client. (see Fig 1c). Despite the aggregation, these abstract plots success- fully provide overview information similar to a traditional plot KEYWORDS and perceptually scale for any size of input data. Rendering time and transfer time in client-server architectures are also reduced visualization, large-scale data, parallel coordinates, multiscale since they are bounded by the number of aggregates. Sansen visualization, distributed computing, focus+context et al. [32] leveraged this bound by precomputing the result of several interactions and thus clear the dependency of linear scans 1 INTRODUCTION of the data upon interaction. Indeed, past a certain number of Multidimensional data, a form of data common in many domains, records, data cannot fit in a desktop computer memory and linear encompasses all lists of individuals composed of several attributes scans over the data affect performance more negatively since (possibly temporal). Multidimensional items are studied from sev- they either involve reads on disk or network transfer between eral aspects: the particular behavior of individual items relative several computing units. to the whole, the relationship between values from two dimen- As for any abstract visualization, deeper analyses are limited sions and the distribution of values along each dimension [7]. by the amount of information conveyed by the visual aggregates. Parallel coordinates, introduced by Inselberg and Dimsdale [19], Two types of interactions can alleviate this limitation: changing is a well-known technique of visualization for such data. Each the level of detail (show more details) and adjusting the aggre- item is represented by a polyline which anchors on axes are gation (show different details). Supporting these interactions is positioned at the corresponding attribute value of the item (see essential but strongly increases the number of states of the visu- Fig 1b). Axes are traditionally aligned in parallel forming a se- alization and reasonable precomputing and storing of all these quence of two-dimensional subplots sharing one axis with their states as performed by Sansen et al. [32] is challenging. The predecessor. Usual interactions are axis reordering to analyze processing scalability challenge here has been addressed by Rü- relationships between all dimension pairs and selection to trace bel et al. [31] on a modern high-performance computing (HPC) platform. HPC platforms are large and expensive computing sys- © 2018 Copyright held by the owner/author(s). Published in the Workshop tems suited for highly complex and real-time computation. They Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna, Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted are composed of multiple processors connected through a fast under the terms of the Creative Commons license CC-by-nc-nd 4.0. network and use fast memory. As such, they are particularly 76 adapted to tightly coupled tasks where several processors work precomputation (e.g. [15, 24, 30]). This work is related to general on the same task and exchange data. Distributed systems, on the techniques addressing perceptual and processing scalability for contrary, are networks of computing units, usually commodity interactive parallel coordinates. hardware, connected in a shared-nothing architecture (memory and storage are independent to each unit). On these systems, 2.1 Perceptual Scalability in Parallel data transfer between computing units uses a slower network Coordinates connection and thus is critical for performances. Consequently, Various approaches have been proposed to improve the legibility they are most adapted to loosely coupled data-parallel tasks on of parallel coordinate plots by either reducing the clutter pro- large amounts of data. In addition to being cheaper alternatives duced by the multiplicity of overlapping and crossing lines or for data-intensive computing, they offer easier horizontal scalabil- enhancing their patterns. Approaches can be categorized into ity (allocation of additional computing units) as their hardware geometry-based relying on computer graphics techniques and and architecture are less sophisticated. The filtration and ag- data reduction approaches that use approximation or summary of gregation problems at stake in abstract parallel coordinates are the data. Geometry-based approaches display all items with shape data-parallel tasks. We focus on these less expensive and more or position modifications to alleviate the overdraw in-between accessible infrastructures to address them. axes, for instance by bundling lines (e.g. [33]). These techniques Our aim is to enable interactive exploration down to the item- have the advantage of producing no or few loss of information level over large-scale multidimensional data with an abstract but have the drawback of still being prone to over-plot since no parallel coordinate plot. The main contributions of this paper reduction of the number of displayed items is performed. Data are summarized as follows: (i) a unifying model for abstract reduction approaches limit the number of visual items either by parallel coordinate plots based on hierarchical aggregation, (ii) sampling items [9] or using meta-items. Model-based approaches formalization of multiscale and regular interactions in this model, mathematically reduce the data to a continuous function, and (iii) a focus+context representation with bounded visual items meta-items usually represents the density of the underlying data and drill-down/roll-up interactions, extension of the work of [32], (e.g. [27]). Aggregation approaches have been presented for paral- (iv) an example implementation using a distributed infrastructure lel coordinates through different schemes: aggregation of dimen- with components that benefits from horizontal scalability. sions (e.g. [2]), aggregation of items or values (e.g. [13, 28, 29]) The paper is organized as follows: we first present previous or combinations of both (e.g. [21, 26]). work on perceptual scalability in parallel coordinate plots and In this paper, we are interested in scalability relative to the their interactions with a focus on multiscale approaches (Sec. number of items, not dimensions, hence we focus on aggregation 2). Then, we describe the proposed graph-based formalism for over items and their values. Previous work using aggregation abstract parallel coordinates based on hierarchical aggregation have used kernel density estimation, independently applied to (Sec. 3). In Sec. 4, we study how different interactions are ex- dimension [29], different hierarchical clustering algorithms over pressed in this model. This yields a prototype implementation multidimensional items or dimension (e.g. [13]) but also binning and design of a scalable parallel coordinate plot based on hierar- on two-dimensional sub-spaces (e.g. [14, 28]). Aggregates have chical aggregation, using a so-called big data infrastructure. This been represented by their statistical properties: extrema (e.g. [3]), visualization is described and its scalability evaluated in Sec. 5. cardinality (e.g. [20]), mean (e.g. [13]), and/or other metrics [29]. Finally, in Sec. 6 we draw conclusions and present directions for As reflected by previous work, multiple dimensions allow for future work. different levels of grouping: the item level (multidimensional), the value level (per-dimension), the line level (two-dimensional). For 2 RELATED WORK aggregation-based methods, the value level has the advantages Recent works have focused on the scalability of visualization ap- of losing less information, preserving continuity on axes (poten- plications for large-scale data with different techniques, among tially broken by two-dimensional aggregation) as well as an idea which are: data reduction, multi-threading, GPU-acceleration, of the relationship between dimensions (broken by item-level and incremental or approximate data processing. In the case of aggregation). visualization, the scalability of a system often refers to its ca- pacity to accommodate and handle growing amounts of data. 2.2 Multiscale Parallel Coordinates Handling massive datasets brings about two main challenges for Abstract representations are usually conditioned by a parameter exploratory visualization: perceptual scalability and processing that controls the resolution of the display. Multiscale visualiza- scalability as noted by [12, 16, 24]. The first is concerned with the tions allow navigation between multiple levels of abstraction, legibility of visualizations representing numerous items relative using a drill-dow/roll-up interaction (respectively for increas- to the space available on a screen (so-called screen real-estate ing and decreasing levels of detail). Elmqvist and Fekete [11] problem) and human capabilities to apprehend them. The sec- provided general guidelines for multiscale representations and ond relates to the computational cost of processing numerous interactions, based on such hierarchical aggregation. Bikakis et al. items on each user input, that can create latencies responsible for [4] presented a framework for hierarchical aggregation oriented decreasing user performances [23]. A taxonomy of different tech- towards the computational aspects of hierarchical navigation. niques regarding the perceptual scalability aspect was established Interactively changing the level of detail can be integrated in dif- by Ellis and Dix [10]. A general solution is data reduction, which ferent ways (see [8] for a general review). A first approach, similar can be categorized into two approaches: either representing a to geometry zooming, either (1) filters part of the representation subset of the data items (sampling, filtering) or meta-items (aggre- to maintain a fixed number of visual items on screen or instead gation, mathematical models). Several work proposed methods (2) purposely displays increasing number of items (e.g. [6, 29]). (called multiresolution, multiscale, hierarchical or even stratified) Filtering has the advantage of being scalable when the amount of for navigating through multiple levels of detail supported by displayed items is controlled; however, the overview and context 77 are lost along the way. One drawback of displaying an uncon- a   2 2 2 2    X = b  = trolled number of items is that the screen pixel limitation could  1 1 2 6  always be reached for a sufficiently massive dataset and infor- c   5 2 2 6   d1 d2 d3 d4 mation would be lost. Additionally, from a certain point it may overwhelm human capacities. Consequently, zooming without (a) (b) filtering does not allow interactive exploration to the item-level d3 complying with perceptual scalability. For zooming with filter- ing, an alternative to the loss of context is the overview+detail d1 d2 d3 d4 approach which displays both the filtered zoomed view and an d1 overview (e.g. dimension zooming of [13]). A second approach, called focus+context, consists in displaying d2 d4 heterogeneous levels of detail: a selected portion of the data is d3 d1 d4 d2 shown with greater details to the detriment of the rest. This ap- (c) (d) proach allows to gain detail locally while preserving the overview. a1 Fisheye lenses used in the work of [25] are an example applied in a1 screen space. In data space, [13] and [28], although using differ- ent aggregation strategies, presented techniques where a subset a4 2 of items can be enhanced and displayed with fine-scale details a4 a 2, c 2 a 3, c 3, b3 layered over the rest of the data, abstracted to some level. a2 = 2 a3 = 2 c3 = 2 c1 2 c2 = 2 To the best of our knowledge, none of these solutions pro- b3 = 2 b1 pose to interactively change the level of detail locally, in a fo- c1 b1 b2 cus+context fashion, while bounding the number of visual items. b2 b4, c 4 c4 = 6 b4 = 6 (e) (f) Figure 2: Rationale behind the graph model. (a) Example data. (b) Opacity-based parallel coordinates. (c) Parallel co- 2.3 Processing Scalability of Multiscale ordinates matrix where all 2-dimensional subspaces are Parallel Coordinates represented. (d) Many-to-many plot. (e) Graph-based rep- For the past ten years, large-scale or massive has been used to resentation G (X ). Shadowed areas cover identical values qualify increasingly big data, now up to tera or petabytes in from the same dimension and form the partition R. (f) size [12]. For multidimensional data, above about 107 items, with Quotient graph G /R (X ) (cardinality mapped to size). a dozen of dimensions, perceptual scalability and processing scalability becomes problematic for interactive analysis. Abstract 3 A GRAPH-BASED FORMALISM FOR representation and precomputation of interactions are solutions respectively addressing perceptual and processing scalability, ABSTRACT PARALLEL COORDINATES as [32] presented. In addition to the enhancement of subsets of Consider X ∈ Rn,d , a set of multidimensional quantitative data items and the reordering of axes, abstract representations require with n items and d dimensions. On a traditional parallel coor- a drilling interaction to allow fine data analysis. dinate plot, only a fraction of the input data subspaces is rep- Processing scalability can be tackled by parallelism, distributed resented. Fig. 2c and 2d show two examples of representations processing and precomputing of complete or partial results for in- that display all subspaces at once: respectively the parallel coor- stance. For parallel coordinates, most interactions can be applied dinate matrix [17] and the many-to-many plot [22]. The model to two-dimensional subplots independently and on separate por- we propose integrates all subspaces without assumption on the tions of the dataset without needing any communication (they layout of axes and supports abstraction. A value in the matrix are pleasantly parallel). This property has been exploited by Rübel inherently belongs to a tuple (or item), and a dimension. We et al. [31] on a HPC platform, and [32] on a distributed platform. focus on abstraction based on aggregation of tuples and do not Rübel et al. [31] addressed both scalability challenges with an directly address aggregation of dimensions. The rationale behind histogram-based representation adapted from [28]. Histograms this approach is that dimensions hold semantic meaning not em- are two-dimensional aggregations of the data, precomputed in bedded in their numerical content. Therefore, automatically and parallel, potentially for different resolutions. On more affordable meaningfully aggregate them is not straightforward: some types hardware, Sansen et al. [32] addressed the same challenges with a of values may not make sense aggregated together. Consequently, parallel sets representation for an Hadoop+Elasticsearch ecosys- we target datasets that challenge scalability by their size in tuples tem that also relies on full precomputation of certain types of (more than millions of items) and hold moderate amounts of interactions. Both works presented good scalability evaluations well-chosen dimensions. relative to the number of computing units used. However, these techniques only support the display of balanced levels of detail i.e. 3.1 Abstraction in Parallel Coordinates drill-down is global and necessarily increases the number of dis- In parallel coordinates, the values of the input matrix are con- played items. With this approach, gaining detail at the item-level nected to each other by lines to materialize the tuple or tuples does not scale to large datasets. they belong to. Extending this metaphor, an n × d matrix can be seen a graph linking its values based on the relation induced by tuples. This undirected graph is composed of n separate cliques 78 a a  a b   10 7 6 0  e e    5 6 10 10  b X =  c  =  3 2 0 8  b c c    1 4 1 10  d   d  e   8 8 3 1  d d1 d2 d3 d4 d1 d2 d3 d4 d1 d2 d3 d4 (a) (b) (a) (b) b3 Figure 4: Aggregation of tuples. (a) Translation of the b1 tuple partition (color) into a vertex partition refining D b4, c 4, d a a 2, e 2, b2 (shadowed regions). (b) Tuple-oriented abstract represen- e3 tation using mean and extrema values. a 4, e 4 d 3, c 3 a3 a 1, e 1 c 1, d 1 d 2, c 2 D. Fig. 4 shows an example of tuple partition refinement and a d1 d2 d3 d4 tuple-oriented representation. Using our model and the intro- (c) (d) duced notations, the abstract representation is fully defined by: (i) a valid partition of the matrix values, (ii) two aggregate func- Figure 3: Aggregation of dimension values. (a) Exam- tions assigning weights to meta-nodes and meta-edges, (iii) an ple data (b) Parallel coordinate plot augmented by axis layout and visual encodings for meta-node and meta-edge per-dimension partitions. (c) Quotient graph. (d) Value- weights (vertical position, color, etc). oriented abstract representation using extrema values and cardinality (with link opacity). 3.2 Hierarchical Abstraction in Parallel Coordinates Hierarchical aggregations are precomputation of different lev- of d vertices, that is, n complete graphs (see Fig. 2e). We note this els of abstraction that naturally support multiscale representa- graph G = (V , E, w ), where V denotes the vertices, E the edges, tions. Such precomputation outputs a rooted tree structure whose w is the weighting function: w : V → R that stores vertex values. leaves are the data objects to aggregate. Every node of a rooted We note D = {Vi , i ∈ ⟦1, d ⟧} the partition of V that groups ver- tree is said to cover the leaves of the subtree it is the root of. tices along their origin dimension. Given a dimension sequence, An antichain (also called tree cut) in a tree is a set of nodes S, a parallel coordinate plot is built by taking the subgraph of G such that no node of S is an ancestor of another node of S. An composed of only the edges joining vertices of consecutive dimen- antichain is maximal if it cannot be expanded by any other node sions in the sequence. Vertices of this subgraph are positioned without violating the antichain property. The subsets covered on their corresponding axis along a vertical position given by by the nodes of a maximal antichain form a partition of the tree the weighting function. Shadowed regions on Fig. 2e represent a leaves: the antichain property makes them pairwise disjoint, the vertex partition R that groups identical values of the same dimen- maximal property ensures that their union is the set of leaves. sion. Such partition is a refinement of the partition D. Merging In our model, an abstraction is directed by a partition refining vertices of G belonging to the same subset of a partition R corre- the partition D which is equivalent to d partitions, one for each sponds to taking the quotient graph of G relative to R, noted G /R . Vi of D. By augmenting each subset Vi with a tree, such partition Two subsets S, S ′ in the partition R are represented by adjacent can be defined by one maximal antichain for each tree. This is vertices in G /R if, and only if, some vertex in S is adjacent to a equivalent to defining a single maximal and non-trivial antichain vertex in S ′ in G. Further, two aggregation functions are defined in the unified tree where the root’s children are the roots of all to compute the properties of the meta-nodes and meta-edges, i.e. dimension trees. We note this unified tree T (V ), and call its direct vertices and edges of the quotient graph. Aggregation functions subtrees dimension hierarchies. return a tuple of weights given a set of edges or vertices as input. Overall, for some input data modeled by the clique graph Weights (cardinality, extrema, mean, etc) provided by such func- G = (V , E, w ) and its associated hierarchy T (V ), an abstraction tions are subsequently used for assigning visual properties to is defined as previously described with the valid partition of V meta-edges and meta-nodes. For instance, on Fig. 2b, the opacity induced by a maximal antichain in T (V ). This unifying approach of lines on the parallel coordinate plot depends on the cardinal- accommodates multiple layouts of parallel coordinates (see Fig. 2) ity of meta-edges. On this example, the aggregation of identical and several abstract parallel coordinates representations. We pre- values allows to find the optimal number of lines to draw. In- sented two abstract representations: per-dimension aggregation deed, the aggregation done here in data-space relates to the one on Fig. 3d that corresponds to the technique presented by Palmas done in screen-space when rendering the polylines with alpha et al. [29] and on Fig. 4 tuple aggregation like presented by Fua composing. et al. [13]. For tuple aggregation, the hierarchy defined on tu- Given some aggregation functions, the same process is applied ples is used to form all dimension hierarchies. Additionally, to to obtain an abstract representation from any valid partition R represent the same tuple aggregates on all subplots, dimension of V , that is, refinement of D. Fig. 3 presents an example of parti- hierarchies should all be cut at the same level. tion obtained with per-dimension clustering and another type of representation based on extrema values. Tuple-based approaches are incorporated into the model by refining the input tuple par- 4 COMMON INTERACTIONS tition into a refinement of D. Each subset of the tuple partition We take advantage of the hierarchical graph model described in can be decomposed into d groups along their membership in the previous section to explore several hierarchical operations 79 and usual operations for parallel coordinates and investigate the a predefined budget. In this setting, drilling potentially implies incremental computation they require. modifying the current maximal antichain in multiple points such that previously acquired detail collapses to allow drilling-down 4.1 Level-of-Detail Operations when the budget would have been exceeded otherwise. If these changes are restricted to a single dimension hierarchy, the incre- The drill-down and roll-up operations correspond to changes in mental changes only involve one or two subplots. Computing the displayed level of detail. They respectively mean moving these subplots costs a pass over n bottom-level edges for each, in deeper in the hierarchical aggregation to show finer details, and addition to the cost of finding a suitable maximal antichain. upper to present a coarser display of the data. Since the model For a budget defined in number of visual items, the minimal rests upon precomputed hierarchies of dimension values, the budget allowing to define a maximal antichain that does contain weights (cardinality, extrema, etc) of all possible meta-nodes, i.e. a leaf node depends on the arity and depth of the tree. In a tree T nodes of the hierarchies, can be precomputed as well. The number with n leaves and arity a, a maximal antichain containing at least of edges of a quotient graph with respect to a maximal antichain one leaf and minimal in size can be expected to have between is always less than those of the original graph and generally depth(T ) and a · depth(T ) nodes. Considering the toy example greatly depends on the properties of the hierarchies (arity, depth). of binary trees which nodes all have either 0 or 2 children, the In general, the number of maximal antichains is exponential. For minimum depth is ⌈loд2 (n)⌉ + 1 and the maximal depth is n. Con- example, complete binary trees, that are rooted binary trees with sequently, a visual budget allowing gaining detail up to the leaf every level full except the last, have Ω(2n ) maximal antichains level has to be chosen with respect to the depth of the hierarchy. where n is their number of leaves. This makes the precomputation In our context, since this depth can reach orders of magnitude of all possible quotient graphs, i.e. all meta-edge weights, with same as the number of tuples, this method does not scale without respect to every possible maximal antichain for given dimension strong constraints on the properties of hierarchies. hierarchies not possible for the scale of data we aim to tackle. In traditional parallel coordinates plots, and in many derived techniques, each two-dimensional subplot is independent of the others and can be computed and displayed separately. When Dynamic Context Aggregation. An alternative solution to col- axes hold visual elements, like frequency plots, the correspond- lapsing nodes when focusing on deeper nodes is to aggregate ing visual items are usually mirrored and the two composing them dynamically. The meta-nodes from the chosen antichain parts conceptually included in both sides subplots. If no edge that are not in focus (i.e. not the deepest in the hierarchy) are rep- is precomputed, the cost of an abstraction can be decomposed resented aggregated which lowers their impact on the visual item into independent substeps corresponding to all subplots, each budget. In general, these dynamically-computed aggregates, do requiring aggregation over n edges, one for each tuple. not correspond to existing nodes in the precomputed hierarchy. In the context of hierarchical aggregation, gaining detail can This approach limits the visual items on an axis to the siblings be defined locally as the substitution of a visual aggregate for its of each focus and the aggregates of the rest of the meta-nodes children. In our model, the displayed level of detail is given by that forms the context. Thus, for a dimension hierarchy of arity a maximal antichain of the hierarchy. Drilling-down on a meta- k, an axis with f foci corresponds to at most k · f + ( f + 1) visual node replaces it in the current antichain by all of its children nodes, whatever the depth of these foci in the hierarchy is. Form- which still forms a maximal antichain of the tree. This entails ing these aggregates induces an aggregation of their outgoing computing the outgoing edges of these children since they are not edges, therefore it similarly reduces the number of meta-edges. precomputed. It corresponds to aggregating the tuples covered Consequently, with additional constraints on the arity of dimen- by the drilled meta-node. The opposite operation, rolling up or sion hierarchies and the number of concurrent foci on an axis, collapsing, is the replacement of sibling nodes with their parent this approach enables drill-down up to the deepest level while in the antichain. complying with a visual budget. Contrary to the previously de- Globally increasing detail may result in an uncontrolled num- scribed approaches, dynamic context aggregation matches the ber of elements on display, independently of the arity and depth requirement of our system and thus is the one implemented (see of hierarchies. We detail three drilling approaches that tackle this Sec. 5.1 and 5.2). issue by trading context detail for scalability. Detail & Filter. One approach is to define drilling on a meta- node as a filtering operation. This corresponds to computing the expansion of the meta-node over a filtered clique-graph, induced 4.2 Other Operations solely by vertices adjacent to the leaves it covers in the original Axis Reordering can be conceived either as the computation of a clique-graph. This method effectively bounds the number of full plot or, incrementally, as an edition of one or more subplots aggregates on an axis by the arity of the dimension hierarchy, i.e. which requires one or two passes over n bottom-level edges, for the maximal number of siblings in the tree. This also bounds the n the number of tuples. Subset Selection relates to the emphasis number of edges in an abstract plot. The limit of this approach is of a subset of tuples in a given representation. In an aggregate that all subplots are modified since the weights of all meta-nodes visualization setting, an abstraction can be computed over the and meta-edges from the quotient graph are to be updated to subset of tuples and the resulting weights displayed over the account for the filtering. Thus for each subplot, it costs a pass complete abstraction using a discriminating encoding. Using this over as many bottom-level edges as the drilled meta-node covers approach, a subset selection of m tuples requires going through items. m bottom-level edges per subplot, one per tuple, to compute their contribution to the current meta-edge weights. The process is Budgeted Detail. A second approach lies in constraining the the same for meta-node weights. definition of maximal antichains such that their size comply with 80 5 FOCUS+CONTEXT REPRESENTATION & 3 SCALABLE IMPLEMENTATION Our goal is to enable hierarchical exploration in abstract parallel 2 coordinates while complying with the following scaling proper- ties. On the representation-side, exploration should be possible down to the item level, in a top-down manner, while the num- 1 ber of visual items on display should be bounded for any size of input data. On the processing-side, network transfer latency between the displaying unit and the computing unit should be bounded and traditional operations (axis reordering, subset selec- (a) tion) should be supported. In the following, we present first the cornerstone of our method: the budgeted number of visual items. Then, we detail a novel focus+context display with intuitive drill- context node down capabilities using the dynamic context aggregation. And with aggregate of 2 and 3 finally, we describe technical details and performance evaluation 4 of the proposed system. 5.1 Bounded Number of Visual Items 5 focus nodes Bounding visual items is essential for perceptual scalability but (children of 1) also ensures that data transfer between our rendering client and back-end unit remains bounded in size thus predictable in time. To this end, the properties of dimension hierarchies are con- strained and the number of foci restricted to one per dimension. Hierarchy Constraints. We use a user-defined k value that acts (b) (c) as a resolution parameter, bounding the number of visual items per displayed axis. This parameter is enforced as the maximal Figure 5: Focus+context representation. (a) Initial display. arity of dimension hierarchies. k should be chosen large enough (b) Drill-down on 1 . The top context node represents the for the representation to preserve an appropriate amount of infor- aggregation of 2 and 3 . (c) Drill-down on 4 and 5 succes- mation but small enough for the visual items to fit the available sively. screen space. Additionally, dimension hierarchies should order their leaves with respect to their values such that every maximal antichain defines partitions of dimension intervals since the cho- detail. Context regions are displayed with slightly lower width sen visual representation relies on this property. Binning (equal to emphasize the focus regions. Since regions of interest are in- range partitioning) and adaptive binning (equal size partitioning) crementally refined by successive drill-down, they get smaller are examples of partitioning algorithms that can be applied in a in height relative to the whole both in terms of cardinality and bottom-up fashion to produce hierarchies complying with these interval of values. Therefore, the height of focus and context re- two requirements. No depth constraint is required, we adopt the gions are rendered with different scaling factors. Focus nodes are dynamic context aggregation strategy to bound the number of represented colored and augmented by a bidirectional smoothed vertices from quotient graphs displayed. histogram. Context nodes are augmented by a pile of level blocks Level-of-Detail Navigation. We define as focus nodes, the nodes (see Fig. 5c). The wider these blocks are, the lower in the hierar- that have the maximal depth in the current antichain. The rest chy are the nodes they represent (notice on Fig. 5c the difference of the antichain nodes from one dimension are aggregated into in width of the outer blocks from the top and bottom context). the minimal number of context nodes such that their order is pre- The height and vertical order of level blocks follows the same served. In the proposed implementation, the top-level nodes are encoding as those of focus nodes. Level blocks are computed on initially presented as focus nodes and each drill-down triggers the client side and thus not transferred nor counted in the visual a dynamic aggregation (examples are presented in Fig. 5). Con- item budget. sequently, there is no more than k focus nodes and two context In the drilling mode, focus nodes and level blocks are clickable. nodes at once per axis. Thus, the number of nodes on display is Clicking on a focus node triggers its animated expansion which bounded by k + 2 per axis and the number of edges by (k + 2) 2 split it into its children and merges its siblings into level blocks. per subplot. Level blocks represent an aggregation of nodes previously in focus, thus they enable going back to this precise previous state. 5.2 Visual Encoding We propose an extension of the work presented in [32] which 5.3 System Overview included two visual encodings: one oriented towards the distri- Our system consists of two main parts and follows a client/server bution of tuples, the other towards the distribution of values architecture where the client is the visualization endpoint and themselves. Basically, the first encoding maps aggregate height the server is an interface to an on-demand computing and pre- to their cardinality, while the second maps it to their covered processing back-end. The latter could be implemented both in interval size (distance between extrema). We extend the encod- a distributed environment for a Hadoop cluster and as a multi- ing and positioning of aggregates to emphasize the focus regions, threaded application for single machines (desktop computer or composed on each dimension of the data displayed with the finest dedicated server). Past a certain number of input records (for a 81 given number of dimensions and resolution parameter), a dis- 104 in selection (number of tuples) tributed platform should be more efficient while facilitating load 108 expansion. execution time (ms) The client is a WebGL application that displays the represen- tation, computes level blocks and context aggregation on node 102 106 drill-down, and queries the supporting back-end for other inter- actions. The back-end server is a long-lived Spark application distributed single which runs distributed job on demand while keeping prepared 104 selection cardinality data in memory. It computes dimension hierarchies in a pre- 100 computation step and stores the resulting meta-node weights 104 105 106 107 108 109 (extrema and cardinality) for all the hierarchies in a distributed n (number of tuples) database. Contrary to the previous system, the hierarchical as- pect makes the number of displayable meta-edges too large to (a) Execution time for node selection relative to data size for two allow their precomputation in reasonable time (see Sec. 4.1). implementations. The distributed version uses 15 executors. Er- rors bars are not symmetrical and missing their bottom parts on The membership of each input data value is stored in a hierar- two points due to the log-log scale. chy matrix of the same size as the input data, where each value 8,000 holds the list of computed ancestors for the matching input data execution time (ms) mean measured value. This matrix is kept in memory and split among computing 6,000 ideal units which will pass over their slice of the data to filter and 4,000 aggregate results on demand. The aggregation outputs the meta- 2,000 edges (source, target and cardinality) as well as the weights of 0 context nodes since they do not exist in the precomputed hierar- 5 10 15 chies. Upon user interaction and if necessary, the client requests number of executors the server which in turn runs a distributed operation and merges (b) Scalability test using node selection, n = 2 · 108 . the partial results returned by computing units. Finally, the client receives the incremental changes in plain text and updates the view consequently. Figure 6: Performance evaluations. Experimental results are averaged over 1000 runs. Executor resources (12 cores, 31GB memory) are sized to match those of physical units. 5.4 Performance Evaluation Our system supports moving an axis, selecting an aggregate node or edge, drill-down on a focus node and rolling-up by clicking on a context part. Drill-down, roll-up and moving an axis correspond of the largest nodes does not increase linearly with n. We observe to the same low-level operation which consists in computing the that, on our infrastructure, under n = 2 · 108 (about 108 selected edges of two-dimensional subplots given the current focus on tuples) the Spark application performances are stable despite in- each dimension involved. Compared to the other operations that creasing workload. This suggests that execution time for datasets only modify two to three subplots, selection operations affect all of smaller size is dominated by costs related to network and disk subplots, thus are more computationally expensive. I/O, and/or by merging all executor results by the driver unit. Since we are interested in supporting "n ≫ d" datasets, we Indeed, the cost of merging results remains approximately the evaluate our system for varying number of tuples n and using same as it is a function of the output size and the number of task the most expensive interactive operation: node selection of the running in parallel. Another visible aspect of the results on small largest node (in covered tuples). To demonstrate the scalability datasets is their high variation. The oversized allocated memory of the approach we also evaluate performance relative to the can be an explanation: when triggered, garbage collection incurs resources allocated for computation. a substantial delay. Overall, for n less than approximately 106 , the Each node of the distributed platform has 64GB RAM and 2x6 distributed infrastructure underperforms the single-computer de- hyper-threaded cores at 2.10GHz each, connected via a 1Gbit/s spite having more resources. Past this limit, the system is better network. The single computer used for running the multi-threaded tuned and seems to scale linearly relative to the number of tuples implementation has 2x4 hyper-threaded cores at 3.3GHz and in the selection. This observation holds for the other experiments 64GB RAM. Test datasets are generated for varying n (from 104 carried out for edge selection, roll-up, and drill-down. to 109 ), with d = 15 and k = 31. Test datasets are generated We investigate the scalability of the system using a dataset such that pairs of dimensions present a close to null correlation larger than this limit. Fig. 6b presents the median execution time factor [5] which tends to create close to the maximum number of for node selection on a 2 · 108 -tuple dataset with varying numbers edges between dimensions. Dimension hierarchies are generated of executors. The plotted ideal execution time corresponds to a using Canopy clustering [1] applied in a bottom-up manner. linear speedup, that is, halving the execution time when doubling On Fig. 6a, execution times of the selection of the largest top- the number of executors. This experiment shows that the system level node are plotted for increasingly large datasets. We evaluate performs close to the ideal. both a multi-threaded single-computer implementation and the distributed Spark implementation. The stairs-shaped curve of the single computer can be explained by the similar trend of the 6 CONCLUSION selected node cardinality for every test dataset. Indeed, due to the We presented a graph model for hierarchical aggregation and hierarchy constraints and the bottom-up approach for clustering, interaction strategies for parallel-coordinate-based visualization. the number of nodes on the top-level varies and the cardinality This model formalizes aggregation over multidimensional data at 82 the most-expressive level by making use of per-dimension hier- [5] S. Börzsönyi, D. Kossmann, and K. Stocker. 2001. The Skyline Operator. In archies. This approach treats all dimensions equivalently which Proc. of Int. Conf. on Data Engineering. IEEE Computer Society, 421–430. [6] K. Selçuk Candan, L. Di Caro, and M. L. Sapino. 2012. PhC: Multiresolution matches the way dimensions are handled in parallel coordinates. Visualization and Exploration of Text Corpora with Parallel Hierarchical Based on this model, we presented a client/server system for Coordinates. ACM TIST 3, 2 (2012). [7] J. H. T. Claessen and J. J. van Wijk. 2011. Flexible Linked Axes for Multivariate interactive exploration of large multidimensional data using ab- Data Visualization. IEEE Trans. Vis. Comput. Graph. 17, 12 (2011). stract parallel coordinates. We address the limitation of the pre- [8] A. Cockburn, A. K. Karlson, and B. B. Bederson. 2008. A review of vious system, based on fixed partitions of dimension values, with overview+detail, zooming, and focus+context interfaces. Comput. Surveys 41, 1 (2008). hierarchies that allow to interactively change partition. The pro- [9] G. Ellis, E. Bertini, and A. Dix. 2005. The sampling lens. In CHI '05. ACM posed drill-down operation enables the definition of an arbitrary- Press. detailed focus region on each axis while retaining context in [10] G. P. Ellis and A. J. Dix. 2007. A Taxonomy of Clutter Reduction for Information Visualisation. IEEE Trans. Vis. Comput. Graph. 13, 6 (2007). a reduced form. On the client-side, we showed how to display [11] N. Elmqvist and J.-D. Fekete. 2010. Hierarchical Aggregation for Information focus and context regions with intuitive navigation cues. The Visualization: Overview, Techniques, and Design Guidelines. IEEE Trans. Vis. Comput. Graph. 16, 3 (2010). strength of this approach is that it supports exploration down to [12] D. Fisher. 2016. Big data exploration requires collaboration between visualiza- the item-level while controlling the number of visual items, thus tion and data infrastructures. In Proc. Workshop on Human-in-the-Loop Data ensuring perceptual scalability. Analytics. [13] Y.-H. Fua, M.O. Ward, and E.A. Rundensteiner. 1999. Hierarchical parallel On the server-side, the back-end processing is handled on a coordinates for exploration of large datasets. In Proc. Visualization’99. IEEE. distributed platform with components that scale horizontally. Be- [14] Z. Geng, Z. Peng, R. S. Laramee, J. C. Roberts, and R. Walker. 2011. Angular sides drill-down/roll-up operations, the implementation supports Histograms: Frequency-Based Visualizations for Large, High Dimensional Data. IEEE Trans. Vis. Comput. Graph. 17, 12 (2011). standard parallel coordinate operations. Experimental results [15] P. Godfrey, , J. Gryz, P. Lasek, and N. Razavi. 2016. Interactive Visualization of demonstrate the scalability of the back-end system relative to the Big Data. Springer International Publishing, 3–22. [16] P. Godfrey, J. Gryz, and P. Lasek. 2016. Interactive Visualization of Large Data size of the input data and to the resources allocated for computa- Sets. IEEE Trans. Knowl. Data Eng. 28, 8 (2016). tion. The results indicate that the proposed system can support [17] J. Heinrich, J. Stasko, and D. Weiskopf. 2012. The Parallel Coordinates Matrix. increasingly large datasets by expanding its network of comput- EuroVis–Short Papers (2012). [18] J. Heinrich and D. Weiskopf. 2013. State of the Art of Parallel Coordinates. In ing units. To a certain extent, adding computing resources can Eurographics 2013. Eurographics Association. also reduce interaction latencies. [19] A. Inselberg and B. Dimsdale. 1990. Parallel Coordinates: A Tool for Visualizing One focus of the design is the bounded data transfer between Multi-dimensional Geometry. In IEEE Visualization. [20] R. Kosara, F. Bendix, and H. Hauser. 2006. Parallel Sets: Interactive Exploration the client and server parts which relies on the single-focus ap- and Visual Analysis of Categorical Data. IEEE Trans. Vis. Comput. Graph. 12, proach (on each axis) and the precomputation of dimension hi- 4 (2006). [21] A. Lex, M. Streit, C. Partl, K. Kashofer, and D. Schmalstieg. 2010. Comparative erarchies with a bounded arity k of small orders of magnitude. Analysis of Multidimensional, Quantitative Data. IEEE Trans. Vis. Comput. The hierarchies being predefined allows for efficient filtering and Graph. 16, 6 (2010). aggregation of items based on ancestors but may be limiting for [22] M. Lind, J. Johansson, and M. D. Cooper. 2009. Many-to-Many Relational Parallel Coordinates Displays. In IV 2009. IEEE Computer Society. exploration. As a counterbalancing measure, future work could [23] Z. Liu and J. Heer. 2014. The Effects of Interactive Latency on Exploratory add support for user-driven hierarchy refinement. Depending on Visual Analysis. IEEE Trans. Vis. Comput. Graph. 20, 12 (2014). the scope and scale of the refinements this could be handled on [24] Z. Liu, B. Jiang, and J. Heer. 2013. imMens: Real-time Visual Querying of Big Data. Comput. Graph. Forum 32, 3 (2013). the client-side and/or enforced on the server-side. [25] T. V. Long and L. Linsen. 2009. MultiClusterTree: Interactive Visual Exploration Another path for future work is methods for latency reduction of Hierarchical Clusters in Multidimensional Multivariate Data. Comput. Graph. Forum 28, 3 (2009). other than horizontal scaling. Space is a possible trade-off. Even [26] J. Lu, Y. Zhang, J. Xu, G. Xiao, and Q. Liang. 2014. Data Visualization of Web if the number of possible meta-edges makes their total precom- Service with Parallel Coordinates and NodeTrix. In SCC 2014. IEEE Computer putation infeasible, partial precomputation could be investigated: Society. [27] H. Nguyen and P. Rosen. 2017. DSPCP: A Data Scalable Approach for Identi- either beforehand or as a background process targeting meta- fying Relationships in Parallel Coordinates. IEEE Trans. Vis. Comput. Graph. edges that are the most likely to be requested given the current (2017). state. [28] M. Novotny and H. Hauser. 2006. Outlier-Preserving Focus+Context Visual- ization in Parallel Coordinates. IEEE Trans. Vis. Comput. Graph. 12, 5 (2006). [29] G. Palmas, M. Bachynskyi, A. Oulasvirta, H. P. Seidel, and T. Weinkauf. 2014. An Edge-Bundling Layout for Interactive Parallel Coordinates. In 2014 IEEE ACKNOWLEDGMENTS Pacific Vis. Symp. IEEE. This work has been carried out as part of the "REQUEST" (PIAO18062- [30] A. Perrot, R. Bourqui, N. Hanusse, F. Lalanne, and D. Auber. 2015. Large interactive visualization of density functions on big data infrastructure. In 645401) and "SpeedData" (PIAO17298-398711) projects supported Symp. on Large Data Analysis and Visualization. IEEE. by the French "Investissement d’Avenir" Program (Big Data – [31] O. Rübel, Prabhat, K. Wu, H. Childs, J. S. Meredith, C. G. R. Geddes, E. Cormier- Cloud Computing topic). The authors would also like to thank Michel, S. Ahern, G. H. Weber, P. Messmer, H. Hagen, B. Hamann, and E. W. Bethel. 2008. High performance multivariate visual data exploration for the anonymous referees for their valuable comments and helpful extremely large data. In SC ’08. suggestions. [32] J. Sansen, G. Richer, T. Jourde, F. Lalanne, D. Auber, and R. Bourqui. 2017. Visual Exploration of Large Multidimensional Data Using Parallel Coordinates on Big Data Infrastructure. Informatics 4, 3 (2017). REFERENCES [33] J. Wang, X. Liu, H.-W. Shen, and Guang Lin. 2017. Multi-Resolution Climate Ensemble Parameter Analysis with Nested Parallel Coordinates Plots. IEEE [1] A.McCallum, K. Nigam, and L. H. Ungar. 2000. Efficient clustering of high- Trans. Vis. Comput. Graph. 23, 1 (2017). dimensional data sets with application to reference matching. In Knowledge Discovery and Data Mining. ACM, 169–178. [2] K. Andrews, M. Osmic, and G. Schagerl. 2015. Aggregated parallel coordinates: integrating hierarchical dimensions into parallel coordinates visualisations. In Proc. of Int. Conf. on Knowledge Technologies and Data-driven Business. ACM. [3] G. Andrienko and N. Andrienko. 2004. Parallel coordinates for exploring properties of subsets. In Proc. of Int. Conf. on Coordinated & Multiple Views in Exploratory Visualization. 93–104. [4] N. Bikakis, G. Papastefanatos, M. Skourla, and T. Sellis. 2017. A hierarchical aggregation framework for efficient multilevel visual exploration and analysis. Semantic Web 8, 1 (2017). 83