Eliciting Structure in Data
                Anders Holst                                Ahmad Al-Shishtawy                             Juhee Bae
              RISE SICS, Sweden                               RISE SICS, Sweden                   University of Skövde, Sweden
              anders.holst@ri.se                           ahmad.al-shishtawy@ri.se                    juhee.bae@his.se

        Mohamed-Rafik Bouguelia                               Göran Falkman                          Sarunas Girdzijauskas
         CAISR, Halmstad, Sweden                         University of Skövde, Sweden                  RISE SICS, Sweden
             mohbou@hh.se                                   goran.falkman@his.se                        sarunasg@kth.se

                Olof Görnerup                               Alexander Karlsson                         Sławomir Nowaczyk
              RISE SICS, Sweden                          University of Skövde, Sweden               CAISR, Halmstad, Sweden
              olof.gornerup@ri.se                         alexander.karlsson@his.se                 slawomir.nowaczyk@hh.se

             Sepideh Pashami                                      Alan Said                              Amira Soliman
          CAISR, Halmstad, Sweden                        University of Skövde, Sweden                   RISE SICS, Sweden
           sepideh.pashami@hh.se                              alansaid@acm.org                             aaeh@kth.se

ABSTRACT                                                                   1   INTRODUCTION
This paper demonstrates how to explore and visualize differ-               Information visualization (IV) is concerned with the trans-
ent types of structure in data, including clusters, anomalies,             formation of abstract data, information, and knowledge into
causal relations, and higher order relations. The methods                  interactive visual representations that amplify human cogni-
are developed with the goal of being as automatic as pos-                  tion and assist humans in solving problems. IV has proven
sible and applicable to massive, streaming, and distributed                effective for presenting important information and for mak-
data. Finally, a decentralized learning scheme is discussed,               ing complex analyses in solving big-data problems [7]. In
enabling finding structure in the data without collecting the              contrast to traditional IV, in Visual Analytics (VA) the user
data centrally.                                                            is not just a consumer of information who interprets and
                                                                           acts on visually presented results, but is the driving force of
CCS CONCEPTS                                                               the whole problem solving process, from problem formula-
• Human-centered computing → Visualization; Graph-                         tion, data selection and cleaning, over hypothesis generation
ical user interfaces; • Mathematics of computing →                         and testing, model building and steering computations, to
Causal networks; • Theory of computation → Unsuper-                        interpreting results and communicating insights. The scope
vised learning and clustering; • Computing methodologies                   of VA is very broad, and the area has grown to encompass
→ Anomaly detection.                                                       a wide array of topics. A unifying principle is the need for
                                                                           systems to leverage computational methods in data mining
KEYWORDS                                                                   and machine learning for big data analysis. In a VA system,
Information Visualization; Clustering; Anomaly Detection;                  the human user and the computational process operate in
Causal Inference; Higher-Order Structure; Distributed Ana-                 coordination in an integrated fashion, forming a continuous
lytics.                                                                    and cooperative problem solving loop [3].
                                                                              Here, we focus on understanding the underlying structure
ACM Reference Format:
Anders Holst, Ahmad Al-Shishtawy, Juhee Bae, Mohamed-Rafik
                                                                           of real-world data sets before applying any artificial intel-
Bouguelia, Göran Falkman, Sarunas Girdzijauskas, Olof Görnerup,            ligence and machine learning algorithm. However, today,
Alexander Karlsson, Sławomir Nowaczyk, Sepideh Pashami, Alan               making sense of data is quite hard due to high-dimensional
Said, and Amira Soliman. 2019. Eliciting Structure in Data. In Joint       aspect of real-world data sets. Therefore it is important to
Proceedings of the ACM IUI 2019 Workshops, Los Angeles, USA, March         find different structures in the data as automatic as possible
20, 2019 , 4 pages.                                                        to facilitate information visualization.
                                                                              The endeavour is similar to that of the Automated Statis-
                                                                           tician [14] in that we want a data set to be automatically an-
IUI Workshops’19, March 20, 2019, Los Angeles, USA
                                                                           alyzed from several different aspects. However, while much
© 2019 Copyright held for the individual papers by the papers’ authors.
Copying permitted for private and academic purposes. This volume is pub-   of the efforts regarding the Automated Statistician is about
lished and copyrighted by its editors.                                     characterizing time series or predicting an output variable
IUI Workshops’19, March 20, 2019, Los Angeles, USA                                                                      Holst, et al.


from inputs, we focus on unsupervised characterization of           needs to be involved in the process. With the help of a human
the data set as a whole. Also, rather than producing a fixed        operator, various deviation levels can be inspected, enriching
report, we focus on interactive and exploratory capabilities.       the view of the potential important anomalies.

2   STRUCTURE IN THE DATA                                           Causal structure
"Structure" can mean many different things. The kind of             For systems that not only learn from data but actually act
structure we focus on here can be characterized as either           based on the resulting knowledge, it thus becomes more
horizontal or vertical properties of the data. Considering a        and more important to identify cause and effect. This is a
(tabular) data set with different samples row-wise, and dif-        very challenging task since in the general case it is easy to
ferent features column-wise, then properties describing how         discover correlations between the attributes in data but is
samples relate to each other can be considered as horizontal        impossible to distinguish which is the cause or effect only
properties, whereas properties of how different features re-        based on observational data. There are however in many
late to each other as vertical. Examples of the former kind of      cases several hints in the data that can potentially be used.
structure that we focus on are clusters among samples in the        We propose a visualization tool which incrementally and
data, and finding anomalous samples or groups of samples.           gradually discovers causal relations between features as more
Examples of the latter kind of structure are relations between      data becomes available. The results are shown in a form of
the features, such as statistical correlation, causal relations,    a causal graph [4] where the strength and uncertainty of
and a higher order similarity relations.                            causal and correlation links are marked by color and shading
                                                                    [1].
Clusters
Unsupervised clustering of data is a frequently occurring           Higher-Order Structures
task in statistical machine learning. Real world data is sel-
dom completely homogeneous, but usually contains clusters           Appropriately representing data with regard to relevant fea-
representing different varieties of the entities under study        tures for the task at hand is often critical when applying
[2]. The reasons to detect such clusters are plenty: clusters       machine learning. Specifically, by building hierarchical - or
may represent a useful categorization of the data that is valu-     higher-order - representations, we are able to capture mul-
able to discern; the significance and value of the correlation      tiple abstraction levels in data, examine information from
between attributes may be misleading when there are clus-           different perspectives and thereby capture various aspects of
ters; or just as a mathematical trick that complex distribution     it. Such representations are applicable both to solve machine
can be approximated by a sum of simpler distributions. At           learning tasks at the appropriate granularity level and to per-
the same time it is hard to reliably detect clusters. Most al-      form exploratory data analysis on and across multiple levels
gorithms are “unstable” in the sense that different random          of abstraction. In this context, we have proposed a fast and
starting points of iteration will lead to very different cluster-   lightweight demonstrator estimating word similarities from
ing results. It is also inherently hard to find the “best” number   the text by explicitly counting second-order co-occurrences
of clusters. Furthermore, in streaming data, the clusters may       [5]. It visualizes the similarity between words in a single
not even be fixed but change over time . Finally, in many           pass over data in a form of a tree starting from leaves and
situations, it is not even possible to objectively measure what     building towards the root.
the best clustering is, since there may be many equally good
alternatives.                                                       Visualization and Interaction
                                                                    We have so far implemented separate modules for all the
Anomalies                                                           above analyses. It remains to make the different analyses
In many domains, various units, such as vehicles and vessels,       accessible through a combined tool, to visualize and explore
are generating data over time. To diagnose a unit A at each         them simultaneously. Figure 1 illustrates one possible way of
time period T as anomalous or not, the data for A needs             doing this. (The pictures are not representing any real data
to be compared against the data from other similar groups           but are manually constructed to illustrate the principles.)
of units over the same time period in an online fashion.               The horizontal properties relating samples to each other
Various representations or features can be extracted from           as described above, i.e. clusters and anomalies/outliers, are
each of these units which can capture different anomalies.          very natural to display in the same view (figure 1a), by show-
Some anomalies might be relevant to inspect further, whereas        ing each cluster in a different color and then assigning the
others can be regarded as irrelevant from the human operator        outliers a (markedly) different color or symbol to make them
perspective. However, it is not easy to determine whether           stand out. Here it is illustrated by a scatter plot, but the same
an event is anomalous or not and a human operator often             principle works also with time series.
Eliciting Structure in Data                                                                 IUI Workshops’19, March 20, 2019, Los Angeles, USA


          (a) Horizontal properties              (b) Vertical properties                                                 (c) Combined view of structure


Figure 1: A mock-up illustration of horizontal and vertical properties of the data, and a possible way to combine them in the
same view.


   The vertical properties relating features to each other, i.e                                       Fed-PS        Dec-HLL        Dec-HLL-Skewed        Golf       Golf-Skewed
                                                                                      0,9
correlations, causal relations and higher order relations, can                        0,8
also be visualized in the same view (figure 1b). Different                            0,7
shapes and colors can be used for the different types of re-                          0,6

lations. Here direct correlations (without established causal                         0,5
                                                                             recall


direction) are shown in blue, links with established causal di-                       0,4

rection as narrowing links in purple, and links representing                          0,3

higher order similarity in hatched orange.                                            0,2


   Our suggestion for combining the horizontal and vertical                           0,1


views into one (figure 1c) is to start with the vertical view
                                                                                       0
                                                                                             1 5 10    20      30   40     50    60    70    80    90   100   110   120   130   140   150

showing all features and their relations, and interactively                                                                     number of rounds


popping up the horizontal properties, as scatter plots or time
                                                                           Figure 2: Comparative analysis of performing a cluster-
series, for those features or links that the user wishes to                ing task using our proposed scheme (Dec-HLL), federated
explore further. In this way all the data and its different                learning framework (Fed-PS) and gossip learning frame-
types of structure is made readily accessible to the user to               work (GOLF). Fed-PS achieves the highest performance as
explore.                                                                   a result of performing model aggregation centrally. Yet,
                                                                           our scheme achieves higher performance compared to P2P
3   SHARING KNOWLEDGE WITHOUT SHARING DATA                                 learning scheme used in Golf, particularly the performance
                                                                           gap increases when the data distribution is very skewed.
The predominant way of using machine learning involves
collecting all the data in a data center. The model is then
trained on powerful servers. However, this data collection
process is often privacy-invasive. Users may not want to                   centrally store the data [8, 10, 13]. Specifically, users start by
share private data with companies, making it difficult to                  contacting the central server and downloading the learning
use machine learning in some situations. For example, users                algorithm and a global model that is common to all users.
generate vast amounts of data through interaction with the                 The algorithm trains its model locally on each device using
applications installed on their mobile devices. This data is               user data and computes the update to the current global
often deeply private in nature and should not be shared                    model. Afterwards, the new updates on the learning param-
completely with a remote server. Even when privacy is not a                eters obtained from the algorithm on the device of each
concern, having to collect data in one central storage can be              user are sent to the central server for aggregation step. The
unfeasible. For example, self-driving cars generate too much               server integrates the new learning outcomes as if it were
data to be able to send all of it to a server.                             directly trained on the data and sends the aggregated global
   Federated Learning approach provides distributed and                    model back to each user. This distributed approach for model
privacy-friendly approach by allowing users to train models                computation diminishes the need to centrally store the data,
locally using their sensitive data, and communicate interme-               hence, computation becomes distributed among the users
diate model updates to a central server without the need to                and their personal data never leaves their devices.
IUI Workshops’19, March 20, 2019, Los Angeles, USA                                                                               Holst, et al.


   In this context, we present a decentralized learning scheme     REFERENCES
that removes the dependency on a central aggregating au-            [1] Juhee Bae, Tove Helldin, and Maria Riveiro. 2017. Understanding
thority in order to offer a more scalable Federated Learning            Indirect Causal Relationships in Node-Link Graphs. Computer Graphics
approach. Our proposed scheme organizes user devices in a               Forum (2017). https://doi.org/10.1111/cgf.13198
                                                                    [2] Mohamed-Rafik Bouguelia, Alexander Karlsson, Sepideh Pashami,
Peer-to-Peer (P2P) network allowing machine learning tasks              Sawomir Nowaczyk, and Anders Holst. 2018. Mode Tracking Us-
to cooperate among each other using Gossip protocols [12].              ing Multiple Data Streams. Inf. Fusion 43, C (Sept. 2018), 33–46.
Furthermore, our learning scheme operates as a decentral-               https://doi.org/10.1016/j.inffus.2017.11.011
ized learning and knowledge sharing approach that considers         [3] Remco Chang, David S. Ebert, and Daniel Keim. 2014. Introduction
how strongly each local update is backed up by a represen-              to the Special Issue on Interactive Computational Visual Analytics.
                                                                        ACM Trans. Interact. Intell. Syst. 4, 1, Article 3 (April 2014), 3 pages.
tative volume of data. Particularly, our scheme integrates              https://doi.org/10.1145/2594648
Hyperloglog [6] as a cardinality estimation mechanism to            [4] Diego Colombo and Marloes H. Maathuis. 2014. Order-independent
maintain the number of data items used in generating and                Constraint-based Causal Structure Learning. J. Mach. Learn. Res. 15, 1
incrementally updating the models exchanged among partic-               (Jan. 2014), 3741–3782.
                                                                    [5] Olof Görnerup, Daniel Gillblad, and Theodore Vasiloudis. 2017.
ipating devices. Figure2 shows performance analysis results
                                                                        Domain-agnostic Discovery of Similarities and Concepts at Scale.
of our scheme (i.e., Dec-HLL) compared to two state-of-the-             Knowl. Inf. Syst. 51, 2 (May 2017), 531–560. https://doi.org/10.1007/
art techniques: federated learning scheme (i.e., Fed-PS) [9]            s10115-016-0984-2
and Gossip learning (i.e., Golf) [12]. As shown, our scheme         [6] Hazar Harmouch and Felix Naumann. 2017. Cardinality estimation: an
achieves higher accuracy than P2P learning scheme adopted               experimental survey. Proceedings of the VLDB Endowment 11, 4 (2017),
                                                                        499–512.
in Golf, specifically, our accuracy is much closer to the accu-
                                                                    [7] Daniel A. Keim, Huamin Qu, and Kwan-Liu Ma. 2013. Big-Data Visu-
racy of Fed-PS when the data distribution is very skewed.               alization. IEEE computer graphics and applications 33 4 (2013), 20–1.
                                                                    [8] Jakub Konecnỳ, H Brendan McMahan, Daniel Ramage, and Peter
4   A PLATFORM FOR ELICITING STRUCTURE IN DATA                          Richtárik. 2016. Federated optimization: Distributed machine learning
We have taken the first steps to implement a platform to                for on-device intelligence. arXiv preprint arXiv:1610.02527 (2016).
                                                                    [9] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr
simplify the above discussed exploratory data analysis and
                                                                        Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing
visual analytics. The purpose is to be able to get a first quick        Su. 2014. Scaling Distributed Machine Learning with the Parameter
overview of various useful structures in a data set. To make            Server.. In OSDI, Vol. 14. 583–598.
it generally applicable, we base it on an existing big data        [10] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson,
platform HOPS [11], develop the code in Python, and do                  et al. 2016. Communication-efficient learning of deep networks from
                                                                        decentralized data. arXiv preprint arXiv:1602.05629 (2016).
the interaction and visualization through Jupyter notebooks.
                                                                   [11] Salman Niazi, Mahmoud Ismail, Seif Haridi, Jim Dowling, Steffen
This make development flexible and easy, and the chances                Grohsschmiedt, and Mikael Ronström. 2017. HopsFS: Scaling Hi-
of it to be useful for others. The intention is to publish the          erarchical File System Metadata Using NewSQL Databases.. In FAST.
entire package as open source.                                          89–104.
                                                                   [12] Róbert Ormándi, István Hegedűs, and Márk Jelasity. 2013. Gossip
5   ACKNOWLEDGEMENTS                                                    learning with linear models on fully distributed data. Concurrency and
                                                                        Computation: Practice and Experience 25, 4 (2013), 556–571.
This research has been conducted within the “A Big Data            [13] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Tal-
Analytics Framework for a Smart Society” (BIDAF) project                walkar. 2017. Federated multi-task learning. In Advances in Neural
(http://bidaf.sics.se/) supported by the Swedish Knowledge              Information Processing Systems. 4424–4434.
Foundation.                                                        [14] Christian Steinrucken, Emma Smith, David Janz, James Lloyd, and
                                                                        Zoubin Ghahramani. [n. d.]. The Automatic Statistician. 175–188.