=Paper= {{Paper |id=Vol-2578/BigVis8 |storemode=property |title=Web-based Scalable Visual Exploration of Large Multidimensional Data Using Human-in-the-Loop Edge Bundling in Parallel Coordinates |pdfUrl=https://ceur-ws.org/Vol-2578/BigVis8.pdf |volume=Vol-2578 |authors=Wenqiang Cui,Girts Strazdins,Hao Wang |dblpUrl=https://dblp.org/rec/conf/edbt/CuiSW20 }} ==Web-based Scalable Visual Exploration of Large Multidimensional Data Using Human-in-the-Loop Edge Bundling in Parallel Coordinates== https://ceur-ws.org/Vol-2578/BigVis8.pdf
            Web-based Scalable Visual Exploration of Large
         Multidimensional Data Using Human-in-the-Loop Edge
                   Bundling in Parallel Coordinates
                  Wenqiang Cui                                             Girts Strazdins                               Hao Wang
      Department of ICT and Natural                            Department of ICT and Natural                 Department of Computer Science
                Sciences                                                  Sciences                           Norwegian University of Science
      Norwegian University of Science                          Norwegian University of Science                      and Technology
             and Technology                                           and Technology                                    Norway
                Norway                                                    Norway                                    hawa@ntnu.no
         wenqiang.cui@ntnu.no                                          gist@ntnu.no

ABSTRACT                                                                               data items are mapped to lines (or edges) intersecting the axes at
Visual clutter and overplotting are the main challenges for vi-                        their respective values. The embedding of an arbitrary number of
sualizing large multidimensional data in parallel coordinates,                         parallel axes into the plane allows for the simultaneous display of
which greatly hampers the recognition of patterns in the data.                         many dimensions to provide a good overview of the data, which
Although many automatic clustering and edge-bundling methods                           reveals intrinsic patterns and trends. However, when datasets are
have been used in parallel coordinates to reduce visual clutter and                    large, PCPs create visual clutter and overplotting in which lines
overplotting, a scalable, transparent, and interactive approach                        are crossed and plotted on top of one another, overwhelming
that allows analysts to interact with large data and generate                          the display, and obscuring the underlying patterns. This hides
interpretable results of visualization in real time is lacking. To                     information and hampers the recognition of patterns in the data.
solve this problem, we propose an approach, human-in-the-loop                             Edge bundling [7] and automatic data clustering [10] are two
edge bundling, to visually explore and interpret large multidimen-                     widely used approaches to reduce visual clutter and overplotting
sional data in parallel coordinates. This approach combines data                       in PCPs. Edge bundling bends similar lines to the center of vi-
binning-based clustering and density-based confluent drawing,                          sual clutters in groups to create more informative visualizations.
which reduces much data processing time and rendering time. It                         Automatic data clustering aggregates data points in groups that
provides novel interactions, such as splitting, adjusting, and merg-                   can be visualized in an illustrative fashion using different forms
ing clusters, to integrate human judgment into the edge-bundling                       of edge bundling.
process. These interactions make the underlying clustering trans-                         However, when datasets become large, these methods face
parent to users, which allow users to generate interpretable visu-                     challenges in supporting real-time interactions (limiting the vi-
alization without complex data clustering. The scalability of our                      sual response in a few milliseconds) along with mechanisms for
approach was evaluated through experiments on several large                            information abstraction. Without interactions, these automatic
datasets. The results show that our approach is scalable for large                     methods provide only groups that may contain interesting com-
multidimensional data, which supports real-time interactions                           binations of dimensions and data points, but do not give analysts
on millions of data items in web browsers without hardware-                            control over the data clustering and visualization processes, and
accelerated rendering and big data infrastructure-based data pro-                      do not offer opportunities for analysts to take advantage of their
cessing. We used a case study to highlight the effectiveness of                        judgments and expertise.
our approach. The results show that our approach provides an                              In this study, we propose a web-based visual analytics sys-
interpretable way of visually exploring large multidimensional                         tem that uses data binning-based clustering and density-based
data in parallel coordinates.                                                          confluent drawing to create a new edge-bundling paradigm in
                                                                                       PCPs for large multidimensional data. To the best of our knowl-
KEYWORDS                                                                               edge, this is the first web-based system that supports the HITL
                                                                                       (human-in-the-loop) edge-bundling process in PCPs through
interactive visualization, human-in-the-loop, visual exploration,
                                                                                       specific interactions, such as splitting, adjusting, and merging
multidimensional data, big data, parallel coordinates
                                                                                       clusters of each dimension, for large multidimensional data. The
                                                                                       contribution of this study are as follows:
1    INTRODUCTION
A multidimensional dataset contains numerical or categorical                               • New paradigm for edge bundling in PCP. Our approach
dimensions (or features), with n (n > 3) dimensions and m data                               provides a novel edge-bundling paradigm (HITL edge
items. To avoid confusion, in this paper, a data item is an n-                               bundling) for the visual exploration of large multidimen-
dimensional point, and a data point is the projection of a data                              sional data in PCPs. With the real-time interactions, such
item to a particular dimension. Parallel coordinate plots (PCPs)                             as splitting, adjusting, and merging clusters, it enables
are widely used, and have become a standard tool for visualizing                             analysts to integrate their judgments and expertise into
multidimensional data [6]. In PCPs, axes corresponding to the                                the data clustering and edge-bundling processes of large
number of dimensions are aligned parallel to each other, and                                 multidimensional data.
                                                                                           • Fast, scalable, and transparent edge-bundling algo-
Copyright © 2020 for this paper by its author(s). Published in the Workshop Proceed-
ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen,             rithm. To support the real-time interactions of large data
Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At-                    in PCPs, we propose a fast, scalable, and transparent edge-
tribution 4.0 International (CC BY 4.0)
                                                                                             bundling algorithm that consists of two parts: 1) a data
EDBT 2020, March 30-April 2, 2020, Copenhagen, Denmark                                     Wenqiang Cui, Girts Strazdins, and Hao Wang


        binning-based clustering method, and 2) density-based           context visualization in PCPs to represent outliers [9]. In this
        confluent drawing.                                              study, we use one-dimensional (1D) binning to cluster data points
      • A web-based visual analytics system. We build a web-            for each dimension with the following three considerations:
        based visual analytics system to support HITL edge bundling         • In PCPs, for a single dimension, the clusters must be or-
        in PCPs for large multidimensional data.                              dered because the data points are ordered.
      • Experiments, and a case study. We conducted experi-                 • A data point belongs to only one cluster.
        ments and a case study on several datasets to highlight             • For large data, to support HITL edge bundling in PCPs, the
        the benefits of HITL edge bundling in PCPs for large mul-             clustering process must be fast, scalable, and transparent
        tidimensional data.                                                   to analysts.
The remainder of this paper is organized as follows: Section 2          With the first and second considerations, for each axis, the data
presents the proposed approach. Section 3 reports the experi-           points are binned into ordered and adjacent clusters, which is
ments, a case study, and discusses the result. Section 4 draws          shown in Figure 3. Since a data point belongs to only one cluster,
the conclusions of this study and discusses directions for future       there is no overlaps between clusters. This reduces the overplot-
work.                                                                   ting of clusters in PCPs created by multidimensional clustering
                                                                        methods, such as DBSCAN [5]. As shown in Figure 3, for each
2     SYSTEM AND METHODS                                                axis, the data points are first grouped into the same number of
In this section, we first describe the HITL edge-bundling process       clusters. For a particular axis, the initial clusters have the same
with our system. Then, we introduce the methods used in the             initial diameters. Users then use the control points to split, adjust,
system and the novel interactions provided by the system.               and merge clusters (see Section 2.4), which makes the clustering
                                                                        process transparent for analysts. For an axis with k initial clusters
2.1     System Overview                                                 (the initial value of k is configured by users), the initial diameter
Figure 1 shows the overview of our system. The system first             L is computed as:
visualizes multidimensional data in a classic PCP without edge                                   L = (dmax − dmin )/k
bundling. For example, in Figure 1 (A), the Cars dataset [1] is
visualized in a classic PCP without edge bundling. The system           where dmax and dmin are the maxima and minima, respectively,
then bundles the edges according to the initial clusters for each       of the data points on the corresponding axis. For an axis, the
dimension as shown in Figure 1 (B). The system supports HITL            initial control points Pi denotes the boundaries of clusters, which
edge bundling by allowing analysts to split, adjust, and merge          are computed as:
clusters for each dimension, which is shown in Figure 1 (C).                            Pi = dmin + i × L, i = 1, 2, ..., k − 1
During the HITL edge-bundling process, the system can update
the visualization according to the corresponding interactions           Then, a data point d is grouped into a cluster Ci as:
in real time for large multidimensional data. This makes the
                                                                                           (
                                                                                             Pi−1 < d < Pi , i = 1, 2, ..., k − 1
underlying clustering process transparent to analysts. With the                  d ∈ Ci if
                                                                                             d > Pi−1, i = k
interactions, analysts can integrate their judgments and expertise
into the edge-bundling process to generate visualizations that          To reveal the internal patterns and distribution of data, we com-
can be better interpreted. For example, in Figure 1 (C), by creating    pute the density of each pair of clusters and use it for density-
an empty cluster that ranges from 6 to 8 and a cluster with 0           based confluent drawing (see Section 2.3). For two adjacent axes
diameter (ranges from 8 to 8) at 8 on the axis cylinders, we found      axisn and axisn+1 , a cluster pair (Caxis
                                                                                                             i         j
                                                                                                                    , Caxisn+1 ) consists of a
                                                                                                                  n
that all cars with eight cylinders in the dataset weighted between      cluster in axisn and another in axisn+1 , where Caxis
                                                                                                                            i       is the i-th
3354 and 5140 kilograms. Moreover, by highlighting the subsets                                    j
                                                                                                                                  n

that contains cars with eight cylinders in red, the patterns of         cluster in axisn , and Caxisn+1 is the j-th cluster in axisn+1 . For
other features of these cars are clearly highlighted.                   two adjacent axes, an edge containing two data points (dn , dn+1 )
   The rudiment of our system is the combination of data binning-       that belongs to a pair of clusters is defined as:
based data clustering and density-based confluent drawing, which                         i
                                                                        (dn , dn+1 ) ∈ (Caxis
                                                                                                  j                   i
                                                                                                , Caxisn+1 ) if dn ∈ Caxis
                                                                                                                                          j
                                                                                                                             ∧dn+1 ∈ Caxisn+1
supports the real-time interactions for large multidimensional                                n                            n

data without hardware-accelerated rendering and big data infrast-       The density D i,j of a pair of clusters is computed as:
ructure-based data processing. Figure 2 shows the workflow of                                               j
                                                                                                i
                                                                                            N (Caxis   , Caxisn+1 )
our system, where the HITL process is highlighted in the dashed                D i,j = Í             n
                                                                                                                            , n = 1, 2, ...
                                                                                              j                  j
line rectangle. The system first uses data binning to cluster data                       i
                                                                                         i=1 j=1 N (C axis n , C axis n+1 )
                                                                                                       i
                                                                                            Í
points for each dimension with the default settings. Then the
                                                                                             j
density of each pair of clusters on two adjacent axes is computed,      where N (Caxis
                                                                                  i
                                                                                       n
                                                                                         , Caxisn+1 ) is the number of edges that belong to
and the edges are bundled and rendered through density-based            the cluster pair (Caxis
                                                                                            i         j
                                                                                                  , Caxisn ).
confluent drawing. Finally, users create a more interpretable vi-
                                                                                                n
                                                                           The clustering process, including computing the clusters and
sualization of edge bundling through the interactions, including        the density of cluster pairs, is linearly dependent on the number
splitting, adjusting, and merging clusters.                             of dimensions, the number of data points, and the number of
                                                                        clusters (see Section 3.1). This fast and scalable clustering process
2.2     Data Binning-Based Clustering                                   is the basis of real-time interactions (see Section 2.4), which
Data binning groups a number of more or less continuous values          supports HITL edge bundling for large multidimensional data in
into a smaller number of given data intervals (also called "bins") to   PCPs.
transform numerical variables into their categorical counterparts          Categorical variables are not clustered using the above method.
[12]. Multidimensional binning is used to implement focus +             Instead, we treat each category as a cluster.
Web-based Scalable Visual Exploration of Large Multidimensional Data Using Human-in-the-Loop Edge Bundling in Parallel
Coordinates                                                           EDBT 2020, March 30-April 2, 2020, Copenhagen, Denmark
   47          5140         8               25                 455                230                                       47                5140            8             25             455           230

        A
                                                                                                                                                                                                         221
                                                                                                                                    C         4577                          22
                                                                                                                                                                                           405
                                                                                                                                                                                                         212
                                                                                                                                                                                                         199
                                                                                                                                                              7
   37          4258         7               21                 358                184                                                                                                                    187
                                                                                                                            37                                                             355
                                                                                                                                              4014                          19             331
                                                                                                                                                              6                            308
                                                                                                                                              3684
                                                                                                                                                                                                         145
   28          3377         6               16                 262                138                                                         3354            5.5
                                                                                                                            27

                                                                                                                                                              5             15

                                                                                                                                                              4.5                          188           104
   18          2495         4               12                 165                92                                        18                2484
                                                                                                                                                              4
                                                                                                                                                                                                         75
                                                                                                                                                              3.5           10
                                                                                                                                                                            9
   9           1613         3               8                  68                 46                                        9                 1613            3             8              68            46
 mpg        weight     cylinders     acceleration       displacement     horsepower                                     mpg               weight         cylinders   acceleration   displacement   horsepower


                                                    47                     5140               8              25            455               230

                                                           B
                                                    40                     4552               7.2            22            391               199


        Initial Edge Bundling                       34                     3964               6.3            19            326               169
                                                                                                                                                     Human-in-the-loop Edge Bundling Process
                                                    28                     3376               5.5            16            262               138


                                                    22                     2789               4.6            14            197               107


                                                    15                     2201               3.8            11            133               77


                                                    9                      1613               3              8             68                46
                                                mpg                    weight            cylinders    acceleration   displacement       horsepower




Figure 1: Overview of the system that supports HITL edge bundling in PCPs. A. Visualization of the Cars dataset [1] in a classic
PCP. B. Edge bundling of the dataset with 3 initial clusters for each dimension. C. Interpretable edge bundling of the dataset with a
subset highlighted (continuous path over axes) in red, which is generated through user interactions.


                                     Human in the Loop Edge bundling
                                                                                                                                                                                      Cluster



                                                                                                                                                                                    Control Point
                                        Interpretable Visualization

                                                             Human Judgment
                                                             & Expertise

   Data                                                 Interactions
                                                                                                                                                                                    Data Point
                                                        Rendering                                                        Diameter
                      Clustering                                                        Confluent Drawing


               Figure 2: The workflow of the system.

                                                                                                                     Figure 3: Using 1D binning to cluster data points for each
2.3       Density-based Confluent Drawing                                                                            axis in PCPs. The blue points are data points and the red points
Confluent drawing is a technique for bundling links in node-                                                         are control points. An edge between the axes represents two data
link diagrams. It coalesces groups of lines into common paths                                                        points that belong to two clusters respectively. Elliptical areas
or bundles based on network connectivity to reduce edge clut-                                                        represent clusters in an axis. The initial k is 2. For each axis, the
ter in node-link diagrams [2, 4]. In this study, we use confluent                                                    two initial clusters have the same diameter. The two red clusters
drawing to coalesce edges that belong to a pair of clusters to                                                       form a pair of clusters. Its density is 0.4.
reduce visual clutter in PCPs, where we use the clusters as nodes
and edges between them as links. Each pair of clusters then has
only one bundled edge, which is shown in Figure 4. This elimi-
nates the occlusion and ambiguity near the bundle joints created
by bundling techniques that bundle edges by spatial proximity.                                                       where Wmax is the width of a bundle with the density of one.
More importantly, it reduces rendering time by coalescing edges,                                                     Wmax is a constant and is configured by users.
which supports real-time interactions for HITL edge bundling of                                                         To guarantee C 1 -continuity across axes, we draw bundles as
large multidimensional data in PCPs.                                                                                 Bézier curves. Figure 4 shows the bundled edge of a pair of clus-
   To reveal the information hidden by coalescing of the edges                                                       ters. Between two adjacent axes, the width of a bundle represents
and the distribution of the data points between axes, we use the                                                     the proportion of the data points (coalesced edges) that belong to
                                                  j
density D i,j of a pair of clusters (Cax
                                      i
                                         is n , C ax is n+1 ) to define the                                          the corresponding cluster pair. This reveals the trend and distribu-
width Wi,j of the coalesced bundle as follow:                                                                        tion of the data items as well as outliers in large multidimensional
                                   Wi,j = D i,j × Wmax                                                               data in PCPs (see Section 3.2).
EDBT 2020, March 30-April 2, 2020, Copenhagen, Denmark                                        Wenqiang Cui, Girts Strazdins, and Hao Wang


                                                        Cluster Center              two new clusters. In Figure 5, the red dashed line circle on
                                                                                    Axis A is a newly added control point by double-clicking.
                                                                                  • Adjust clusters. All control points can be dragged along
   Cluster                                              Control Point               the axes. Dragging a control point to a new position ad-
                                                                                    justs the boundaries and the diameters of the two adjacent
                                                                                    clusters. Figure 5 shows dragging the control point on Axis
                                                                                    B to a new position (red dashed line circle on Axis B).
                         Bé
                              zie




                                                                                  • Merge clusters. All control points can be double-clicked
                               rC
                                 ur




                                                                                    to be deleted. The two adjacent clusters of the deleted
                                    ve




                                                           Width                    control point are merged into a new cluster.
                                                                                  • Highlight bundles over axes. Hovering the pointer over
                                                                                    a bundle highlights it and its related bundles in red. Only
                                                                                    bundles with a density greater than a threshold will be
Figure 4: Using the density-based confluent drawing to                              highlighted. The threshold is a constant and is configured
bundle the edges that belong to a pair of clusters. For a                           by users.
pair of clusters, the bundled edge is rendered as a Bézier curve                  • Re-order axes. The labels of axes can be dragged to the
that starts from the center of a cluster and ends at the center of                  front or back of other labels to re-order them to the corre-
another. Its width represents the density of the cluster pair.                      sponding positions.

                                                                            3     EVALUATION
                                                                            In this section, we evaluate the scalability and the effectiveness
                                Bundle                                      of our system through experiments and a case study on the Office
                                Mouseover
      Axis Area                                                             Occupancy Detection dataset [3] and the Cars dataset [1].
      Double Click

                                                                            3.1     Experiments
                                                     Control Point          To examine the scalability of our system, we synthesized several
                                                     Double Click & Drag    large datasets based on the office dataset. All experiments were
                                                                            conducted on the same laptop without big data infrastructure-
                                                                            based data processing and hardware-accelerated rendering.
                                                                               In our system, the HITL edge-bundling process contains two
                                                                            time-consuming processes: the data binning-based clustering and
  Label Area
  Drag                                                                      the density-based confluent drawing (rendering process). We first
                                                                            performed a run time analysis of the clustering process. Table
                     Axis A                 Axis B                 Axis A   1 shows the run times (measured by the second) of the cluster-
                                                                            ing process on large multidimensional datasets (with different
                                                                            number of dimensions, data points, and clusters). According to
Figure 5: Interactions provided by our system for support-                  Table 1, the computation time of data binning-based clustering is
ing HITL edge bundling. Double click on the axis area to add                linearly dependent on the number of dimensions, the number of
a control point to split a cluster. Double click on a control point         data points, and the number of clusters. More importantly, this
to delete it to merge two clusters. Drag a control point along an           data binning-based clustering is much faster than other cluster-
axis to adjust the adjacent clusters. Mouseover on a bundle to              ing algorithms used for bundling edges in PCPs. For example,
highlight a subset with color. Drag an axis label to re-order the           Palmas et al. [10] used a density-based clustering method for
axes.                                                                       each dimension independently to bundle edges in PCPs, which
                                                                            takes approximately 60 seconds to cluster 105 data points for
                                                                            one dimension. By contrast, our clustering method takes approx-
2.4      Interactions for HITL Edge Bundling                                imately 1 seconds to cluster 106 data points for four dimensions.
In our system, in addition to common interactions in PCPs such
as re-ordering the axes and brushing (highlighting) [11], we use               We then examined the efficiency of the rendering process by
specifically designed interactions to allow users to split, adjust,         comparing the rendering time of our method with both the clas-
and merge clusters. Our system updates the visualization ac-                sic PCP and Lima et al.’s edge-bundling PCP [5] that also uses
cording to user interactions in real time, which is the key to              confluent drawing to coalesce edges. To compare the rendering
implement the HITL edge bundling process. These interactions                time, all three PCPs were implemented with the same JavaScript
are supported by the combination of the data binning-based clus-            library (D3.js) and rendered in Chrome. The times needed for
tering and the density-based confluent drawing. Figure 5 shows              rendering the axes, labels, and stickers were not included, which
the interactions provided by our system, which are described as             are constant regardless of the number of data points. Table 2
follows:                                                                    shows the rendering time of the three methods (measured by the
     • Split a cluster. Each axis has a clickable area (called axis         second) on the datasets that has six dimensions and the different
       area) around it, which is shown as gray rectangle area               numbers of data items. For our method and [5], each dimension
       around Axis A in Figure 5. Double-clicking on this area              has 3 clusters. According to Table 2, the classic PCP and [5] take
       adds a new control point to the corresponding position on            1.7672 and 3.6989 seconds to visualize 105 data points. The clas-
       the axis. This control point splits the original cluster into        sic PCP takes 8.7183 seconds to visualize 5 × 105 data points
Web-based Scalable Visual Exploration of Large Multidimensional Data Using Human-in-the-Loop Edge Bundling in Parallel
Coordinates                                                           EDBT 2020, March 30-April 2, 2020, Copenhagen, Denmark

Table 1: Run-time analysis of the data binning-based clus-
                                                                           40           2077          24                1697        Occupied
tering

       Dimensions     Data Points    Clusters    Run-time                  34           1661          23                1273



             2             104           3         0.0169
             2             104           4         0.0167                  28           1245          22                849

             3             104           3         0.0230
             3             104           4         0.0277
             2             105           3         0.0505                  22           829           20                424


             2             105           4         0.0554
             3             105           3         0.0937                  17           413           19                0           Not occupied

             3             105           4         0.0996             humidity        CO2        temperature    light          occupancy

             4             105           3         0.1175                                             (a)
             4             105           4         0.1404
             4             105          10         0.2574                  40           2077          24                1697


             4             105          20         0.4139                  37           1869          24                1485

             4             105          30         0.5495                  34           1661          23                1273       Occupied

             4             105          40         0.6892
             4             105          50         0.8872                  31           1453          22                1061



             4             106           3         0.8211                  28           1245          22                849


             4             106           4         0.9398                  25           1037          21                636


                                                                           22           829           20                424        Not occupied


        Table 2: Comparison of the rendering time                          20           621           20                212


                                                                           17           413           19                0
                                                                      humidity       CO2         temperature    light          occupancy
       Data Points    Our Method     Classic PCP      [5]
                                                                                                      (b)
           103          0.00243         0.0273       0.0503
           104          0.00231         0.1916       0.3740                40           2077          24                1697

           105          0.00230         1.7672       3.6989
                                                                           37
         5 × 105        0.00229         8.7183        N/A                               1843
                                                                                                                        1414


           106
                                                                                                      23
                        0.00248          N/A          N/A                  34

                                                                           33
                                                                                        1609
                                                                                                                                   Occupied

                                                                                                                        1131
                                                                           31           1459

                                                                           29                         22
                                                                                        1308



and crashes the browser when visualizing 106 data points. The
                                                                           27
                                                                                                                        743
                                                                                        1110


method [5] crashes the browser when visualizing 5 × 105 data                            912
                                                                                                      21
points. By contrast, the rendering process of our method is inde-          22
                                                                                                                        354
                                                                                                                                   Not occupied



pendent of the number of data points, which takes approximately                         662
                                                                                                                        177

0.002 seconds for each dataset.                                            17           413           19                0
                                                                      humidity        CO2        temperature    light          occupancy

3.2    Case Study                                                                                      (c)

To assess the effectiveness of our system, we compared our
method with the classic PCP and several algorithmic analysis          Figure 6: The visualization of the office dataset in the clas-
methods with the office dataset. The office dataset uses the data     sic PCP and our system. (a) Visualization of the office dataset
on temperature, humidity, light, and CO2 to detect the occupancy      in the classic PCP. (b) Visualization of the office dataset in our
of an office room. It has five dimensions and 20,560 data points      system with 4 initial clusters for each dimension. (c) Visualiza-
for each dimension.                                                   tion of the office dataset in our system generated by a user who
   Figure 6 shows the visualization of the office dataset in the      does not have knowledge of the dataset.
classic PCP and our system. Figure 6c shows the visualization
in our system, which is generated by a user who does not have
knowledge of the dataset. In Figure 6b and Figure 6c, the red            Moreover, by integrating human judgments into the edge-
bundles are the subsets highlighted by hovering the pointer on        bundling process, our method creates a interpretable visualization
the widest bundle between the axes of light and occupancy. The        in PCPs for the office dataset. For example, during the HITL edge-
extreme narrow bundles (data points with extreme low densi-           bundling process (from Figure 6b to Figure 6c), the user obtained
ties) are visualized as the dashed lines to detect and highlight      the following findings:
the outliers (rare data points that raise suspicions by differing           • Finding 1. The dataset contains outliers which are high-
significantly from the majority of the data [8]) in the dataset. By           lighted by the dashed lines in Figure 6c.
comparing Figure 6a and and Figure 6c, it is clear that for large           • Finding 2. When the value of light is smaller than 354
multidimensional datasets, our method reduces the visual clutter              Lux, the room is considered unoccupied. When it is be-
and overplotting in the classic PCP and reveals the patterns in               tween 354 and 1131 Lux, the room is considered occupied.
the data.                                                                     The accuracy of this estimation is higher than 90% (the
EDBT 2020, March 30-April 2, 2020, Copenhagen, Denmark                                        Wenqiang Cui, Girts Strazdins, and Hao Wang

Table 3: The comparison our system with the algorithmic                  this process, users can continuously gain insights from data and
methods in [3].                                                          visualization.

 Criteria                Our Method              [3]                     4     CONCLUSION AND FUTURE WORK
 Finding 1               Yes                     No                      In this study, we proposed HITL edge bundling and built a system
 Finding 2               Yes                     Yes                     based on it to support the visual exploration of large multidi-
 Finding 3               Yes                     Yes                     mensional data in PCPs. The system provides an interpretable
 Finding 4               Yes                     Yes                     visualization, which reduces the visual clutters and overplotting,
 Interpretability        Interpretable           Black-box process       and eliminates the occlusion and ambiguity of large multidimen-
                         visualization           of training the mod-    sional data in PCPs. More importantly, the system provides the
                         with transparent        els.                    specifically designed interactions, including splitting, adjusting,
                         clustering process.                             and merging clusters, to integrate human judgments into the
 Processing time         Real-time.              Time for training       edge-bundling process in real time. We evaluated the scalability
                                                 and selecting mod-      and effectiveness of the system through experiments and a case
                                                 els.                    study. We compared our system with the classic PCP and the
                                                                         algorithmic analysis methods. The results show that our system
                                                                         provides a scalable and interpretable way of visually exploring
                                                                         large multidimensional data in PCPs.
        estimated sum of the densities of the two widest bundles            Anchoring bundled edges in different positions, such as the
        between the axes of light and the occupancy).                    mean/centroid position of all data points in a cluster, could be
      • Finding 3. When the temperature is between 19 and 22 ℃,          investigated in the future to improve the continuity across axes
        the room is considered unoccupied. When the temperature          and reveal more information of clusters. This requires more com-
        is higher than 22 ℃, the room is considered occupied.            putation and may delay the visual response of the interactions.
        The accuracy of this estimation is higher than 80% (the          The interactions and color effects (highlighting subsets in dif-
        estimated sum of the densities of the two widest bundles         ferent colors) of the system are not fully evaluated. This can be
        between the axes of temperature and light).                      done in a qualitative user study in future work.
      • Finding 4. Using all features may reduce the accuracy of
        prediction. Humidity has a much weaker correlation with          REFERENCES
        occupancy than other features.                                    [1] 2005. Cars DataSet. Retrieved September 20, 2019 from http://davis.wpi.edu/
                                                                              xmdv/datasets/cars.html
   Candanedo and Feldheim tested linear discriminant analysis,            [2] B. Bach, N. H. Riche, C. Hurter, K. Marriott, and T. Dwyer. 2017. Towards
classification and regression trees, and random forest on the                 Unambiguous Edge Bundling: Investigating Confluent Drawings for Network
                                                                              Visualization. IEEE Transactions on Visualization and Computer Graphics 23, 1
office dataset to detect the occupancy of rooms [3]. In Table 3, we           (Jan 2017), 541–550. https://doi.org/10.1109/TVCG.2016.2598958
compared the findings obtained in our system with that obtained           [3] Luis M. Candanedo and Véronique Feldheim. 2016. Accurate occupancy
                                                                              detection of an office room from light, temperature, humidity and CO2 mea-
in [3] of the office dataset. It shows that our system obtained               surements using statistical learning models. Energy and Buildings 112 (2016),
more findings of the data than the algorithmic methods in [3]. We             28 – 39. https://doi.org/10.1016/j.enbuild.2015.11.071
also compared the interpretability of our system with that of the         [4] Matthew Dickerson, David Eppstein, Michael T. Goodrich, and Jeremy Y.
                                                                              Meng. 2005. Confluent Drawings: Visualizing Non-planar Diagrams in a
algorithmic methods in [3]. It shows that without the black-box               Planar Way. Journal of Graph Algorithms and Applications 9, 1 (2005), 31–52.
process of training the models, our system is more interpretable              https://doi.org/10.7155/jgaa.00099
with the visualization by integrating human judgments into the            [5] Rodrigo Santos do Amor Divino Lima, Carlos Gustavo Resque dos Santos, San-
                                                                              dro de Paula Mendonça, Jefferson Magalhães de Morais, and Bianchi Serique
edge-bundling process. Moreover, our system can obtain the                    Meiguins. 2018. Understanding Data Dimensions by Cluster Visualization
result faster by eliminating the time to train the models.                    Using Edge Bundling in Parallel Coordinates (SAC ’18). ACM, New York, NY,
                                                                              USA, 640–647. https://doi.org/10.1145/3167132.3167203
                                                                          [6] Julian Heinrich and Daniel Weiskopf. 2013. State of the Art of Parallel Coordi-
3.3     Discussion                                                            nates. In Eurographics 2013 - State of the Art Reports, M. Sbert and L. Szirmay-
                                                                              Kalos (Eds.). The Eurographics Association. https://doi.org/10.2312/conf/
Our approach uses data binning to create initial clusters for each            EG2013/stars/095-116
dimension. For a particular dimension, it divides the entire range        [7] D. Holten. 2006. Hierarchical Edge Bundles: Visualization of Adjacency Rela-
                                                                              tions in Hierarchical Data. IEEE Transactions on Visualization and Computer
of values into a series of consecutive, non-overlapping and equal-            Graphics 12, 5 (Sep. 2006), 741–748. https://doi.org/10.1109/TVCG.2006.147
size intervals (clusters/bins). By computing the density of cluster       [8] Ling Liu and M. Tamer Zsu. 2009. Encyclopedia of Database Systems (1st ed.).
pairs, our approach counts the number of data points for each                 Springer Publishing Company, Incorporated.
                                                                          [9] M. Novotny and H. Hauser. 2006. Outlier-Preserving Focus+Context Visualiza-
cluster, which is represented by the total width of the bundled               tion in Parallel Coordinates. IEEE Transactions on Visualization and Computer
edges starting from the cluster. Therefore, the initial clustering re-        Graphics 12, 5 (Sep. 2006), 893–900. https://doi.org/10.1109/TVCG.2006.170
sults in our approach is an adapted histogram for each dimension.        [10] G. Palmas, M. Bachynskyi, A. Oulasvirta, H. P. Seidel, and T. Weinkauf. 2014.
                                                                              An Edge-Bundling Layout for Interactive Parallel Coordinates. In 2014 IEEE
With the appropriate initial number of clusters, it can capture               Pacific Visualization Symposium. 57–64. https://doi.org/10.1109/PacificVis.
the accurate distribution of data points for each dimension. This             2014.40
                                                                         [11] R. C. Roberts, R. S. Laramee, G. A. Smith, P. Brookes, and T. D’Cruze. 2019.
is the basis for users to use their judgments and expertise in the            Smart Brushing for Parallel Coordinates. IEEE Transactions on Visualization
edge bundling process and generate interpretable visualization.               and Computer Graphics 25, 3 (March 2019), 1575–1590. https://doi.org/10.
With HITL edge bundling, to obtain the final interpretable visual-            1109/TVCG.2018.2808969
                                                                         [12] Bernard W Silverman. 2018. Density estimation for statistics and data analysis.
ization, for example, from Figure 6b to Figure 6c, users may need             Routledge.
several iterations to adjust the initial clusters for each dimension,
such as merging a cluster with small density to an adjacent clus-
ter, or splitting a cluster with large density to obtain more details
of data. This process may take 1 or 2 minutes. However, during