Eliciting Structure in Data Anders Holst Ahmad Al-Shishtawy Juhee Bae RISE SICS, Sweden RISE SICS, Sweden University of Skövde, Sweden anders.holst@ri.se ahmad.al-shishtawy@ri.se juhee.bae@his.se Mohamed-Rafik Bouguelia Göran Falkman Sarunas Girdzijauskas CAISR, Halmstad, Sweden University of Skövde, Sweden RISE SICS, Sweden mohbou@hh.se goran.falkman@his.se sarunasg@kth.se Olof Görnerup Alexander Karlsson Sławomir Nowaczyk RISE SICS, Sweden University of Skövde, Sweden CAISR, Halmstad, Sweden olof.gornerup@ri.se alexander.karlsson@his.se slawomir.nowaczyk@hh.se Sepideh Pashami Alan Said Amira Soliman CAISR, Halmstad, Sweden University of Skövde, Sweden RISE SICS, Sweden sepideh.pashami@hh.se alansaid@acm.org aaeh@kth.se ABSTRACT 1 INTRODUCTION This paper demonstrates how to explore and visualize differ- Information visualization (IV) is concerned with the trans- ent types of structure in data, including clusters, anomalies, formation of abstract data, information, and knowledge into causal relations, and higher order relations. The methods interactive visual representations that amplify human cogni- are developed with the goal of being as automatic as pos- tion and assist humans in solving problems. IV has proven sible and applicable to massive, streaming, and distributed effective for presenting important information and for mak- data. Finally, a decentralized learning scheme is discussed, ing complex analyses in solving big-data problems [7]. In enabling finding structure in the data without collecting the contrast to traditional IV, in Visual Analytics (VA) the user data centrally. is not just a consumer of information who interprets and acts on visually presented results, but is the driving force of CCS CONCEPTS the whole problem solving process, from problem formula- • Human-centered computing → Visualization; Graph- tion, data selection and cleaning, over hypothesis generation ical user interfaces; • Mathematics of computing → and testing, model building and steering computations, to Causal networks; • Theory of computation → Unsuper- interpreting results and communicating insights. The scope vised learning and clustering; • Computing methodologies of VA is very broad, and the area has grown to encompass → Anomaly detection. a wide array of topics. A unifying principle is the need for systems to leverage computational methods in data mining KEYWORDS and machine learning for big data analysis. In a VA system, Information Visualization; Clustering; Anomaly Detection; the human user and the computational process operate in Causal Inference; Higher-Order Structure; Distributed Ana- coordination in an integrated fashion, forming a continuous lytics. and cooperative problem solving loop [3]. Here, we focus on understanding the underlying structure ACM Reference Format: Anders Holst, Ahmad Al-Shishtawy, Juhee Bae, Mohamed-Rafik of real-world data sets before applying any artificial intel- Bouguelia, Göran Falkman, Sarunas Girdzijauskas, Olof Görnerup, ligence and machine learning algorithm. However, today, Alexander Karlsson, Sławomir Nowaczyk, Sepideh Pashami, Alan making sense of data is quite hard due to high-dimensional Said, and Amira Soliman. 2019. Eliciting Structure in Data. In Joint aspect of real-world data sets. Therefore it is important to Proceedings of the ACM IUI 2019 Workshops, Los Angeles, USA, March find different structures in the data as automatic as possible 20, 2019 , 4 pages. to facilitate information visualization. The endeavour is similar to that of the Automated Statis- tician [14] in that we want a data set to be automatically an- IUI Workshops’19, March 20, 2019, Los Angeles, USA alyzed from several different aspects. However, while much © 2019 Copyright held for the individual papers by the papers’ authors. Copying permitted for private and academic purposes. This volume is pub- of the efforts regarding the Automated Statistician is about lished and copyrighted by its editors. characterizing time series or predicting an output variable IUI Workshops’19, March 20, 2019, Los Angeles, USA Holst, et al. from inputs, we focus on unsupervised characterization of needs to be involved in the process. With the help of a human the data set as a whole. Also, rather than producing a fixed operator, various deviation levels can be inspected, enriching report, we focus on interactive and exploratory capabilities. the view of the potential important anomalies. 2 STRUCTURE IN THE DATA Causal structure "Structure" can mean many different things. The kind of For systems that not only learn from data but actually act structure we focus on here can be characterized as either based on the resulting knowledge, it thus becomes more horizontal or vertical properties of the data. Considering a and more important to identify cause and effect. This is a (tabular) data set with different samples row-wise, and dif- very challenging task since in the general case it is easy to ferent features column-wise, then properties describing how discover correlations between the attributes in data but is samples relate to each other can be considered as horizontal impossible to distinguish which is the cause or effect only properties, whereas properties of how different features re- based on observational data. There are however in many late to each other as vertical. Examples of the former kind of cases several hints in the data that can potentially be used. structure that we focus on are clusters among samples in the We propose a visualization tool which incrementally and data, and finding anomalous samples or groups of samples. gradually discovers causal relations between features as more Examples of the latter kind of structure are relations between data becomes available. The results are shown in a form of the features, such as statistical correlation, causal relations, a causal graph [4] where the strength and uncertainty of and a higher order similarity relations. causal and correlation links are marked by color and shading [1]. Clusters Unsupervised clustering of data is a frequently occurring Higher-Order Structures task in statistical machine learning. Real world data is sel- dom completely homogeneous, but usually contains clusters Appropriately representing data with regard to relevant fea- representing different varieties of the entities under study tures for the task at hand is often critical when applying [2]. The reasons to detect such clusters are plenty: clusters machine learning. Specifically, by building hierarchical - or may represent a useful categorization of the data that is valu- higher-order - representations, we are able to capture mul- able to discern; the significance and value of the correlation tiple abstraction levels in data, examine information from between attributes may be misleading when there are clus- different perspectives and thereby capture various aspects of ters; or just as a mathematical trick that complex distribution it. Such representations are applicable both to solve machine can be approximated by a sum of simpler distributions. At learning tasks at the appropriate granularity level and to per- the same time it is hard to reliably detect clusters. Most al- form exploratory data analysis on and across multiple levels gorithms are “unstable” in the sense that different random of abstraction. In this context, we have proposed a fast and starting points of iteration will lead to very different cluster- lightweight demonstrator estimating word similarities from ing results. It is also inherently hard to find the “best” number the text by explicitly counting second-order co-occurrences of clusters. Furthermore, in streaming data, the clusters may [5]. It visualizes the similarity between words in a single not even be fixed but change over time . Finally, in many pass over data in a form of a tree starting from leaves and situations, it is not even possible to objectively measure what building towards the root. the best clustering is, since there may be many equally good alternatives. Visualization and Interaction We have so far implemented separate modules for all the Anomalies above analyses. It remains to make the different analyses In many domains, various units, such as vehicles and vessels, accessible through a combined tool, to visualize and explore are generating data over time. To diagnose a unit A at each them simultaneously. Figure 1 illustrates one possible way of time period T as anomalous or not, the data for A needs doing this. (The pictures are not representing any real data to be compared against the data from other similar groups but are manually constructed to illustrate the principles.) of units over the same time period in an online fashion. The horizontal properties relating samples to each other Various representations or features can be extracted from as described above, i.e. clusters and anomalies/outliers, are each of these units which can capture different anomalies. very natural to display in the same view (figure 1a), by show- Some anomalies might be relevant to inspect further, whereas ing each cluster in a different color and then assigning the others can be regarded as irrelevant from the human operator outliers a (markedly) different color or symbol to make them perspective. However, it is not easy to determine whether stand out. Here it is illustrated by a scatter plot, but the same an event is anomalous or not and a human operator often principle works also with time series. Eliciting Structure in Data IUI Workshops’19, March 20, 2019, Los Angeles, USA (a) Horizontal properties (b) Vertical properties (c) Combined view of structure Figure 1: A mock-up illustration of horizontal and vertical properties of the data, and a possible way to combine them in the same view. The vertical properties relating features to each other, i.e Fed-PS Dec-HLL Dec-HLL-Skewed Golf Golf-Skewed 0,9 correlations, causal relations and higher order relations, can 0,8 also be visualized in the same view (figure 1b). Different 0,7 shapes and colors can be used for the different types of re- 0,6 lations. Here direct correlations (without established causal 0,5 recall direction) are shown in blue, links with established causal di- 0,4 rection as narrowing links in purple, and links representing 0,3 higher order similarity in hatched orange. 0,2 Our suggestion for combining the horizontal and vertical 0,1 views into one (figure 1c) is to start with the vertical view 0 1 5 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 showing all features and their relations, and interactively number of rounds popping up the horizontal properties, as scatter plots or time Figure 2: Comparative analysis of performing a cluster- series, for those features or links that the user wishes to ing task using our proposed scheme (Dec-HLL), federated explore further. In this way all the data and its different learning framework (Fed-PS) and gossip learning frame- types of structure is made readily accessible to the user to work (GOLF). Fed-PS achieves the highest performance as explore. a result of performing model aggregation centrally. Yet, our scheme achieves higher performance compared to P2P 3 SHARING KNOWLEDGE WITHOUT SHARING DATA learning scheme used in Golf, particularly the performance gap increases when the data distribution is very skewed. The predominant way of using machine learning involves collecting all the data in a data center. The model is then trained on powerful servers. However, this data collection process is often privacy-invasive. Users may not want to centrally store the data [8, 10, 13]. Specifically, users start by share private data with companies, making it difficult to contacting the central server and downloading the learning use machine learning in some situations. For example, users algorithm and a global model that is common to all users. generate vast amounts of data through interaction with the The algorithm trains its model locally on each device using applications installed on their mobile devices. This data is user data and computes the update to the current global often deeply private in nature and should not be shared model. Afterwards, the new updates on the learning param- completely with a remote server. Even when privacy is not a eters obtained from the algorithm on the device of each concern, having to collect data in one central storage can be user are sent to the central server for aggregation step. The unfeasible. For example, self-driving cars generate too much server integrates the new learning outcomes as if it were data to be able to send all of it to a server. directly trained on the data and sends the aggregated global Federated Learning approach provides distributed and model back to each user. This distributed approach for model privacy-friendly approach by allowing users to train models computation diminishes the need to centrally store the data, locally using their sensitive data, and communicate interme- hence, computation becomes distributed among the users diate model updates to a central server without the need to and their personal data never leaves their devices. IUI Workshops’19, March 20, 2019, Los Angeles, USA Holst, et al. In this context, we present a decentralized learning scheme REFERENCES that removes the dependency on a central aggregating au- [1] Juhee Bae, Tove Helldin, and Maria Riveiro. 2017. Understanding thority in order to offer a more scalable Federated Learning Indirect Causal Relationships in Node-Link Graphs. Computer Graphics approach. Our proposed scheme organizes user devices in a Forum (2017). https://doi.org/10.1111/cgf.13198 [2] Mohamed-Rafik Bouguelia, Alexander Karlsson, Sepideh Pashami, Peer-to-Peer (P2P) network allowing machine learning tasks Sawomir Nowaczyk, and Anders Holst. 2018. Mode Tracking Us- to cooperate among each other using Gossip protocols [12]. ing Multiple Data Streams. Inf. Fusion 43, C (Sept. 2018), 33–46. Furthermore, our learning scheme operates as a decentral- https://doi.org/10.1016/j.inffus.2017.11.011 ized learning and knowledge sharing approach that considers [3] Remco Chang, David S. Ebert, and Daniel Keim. 2014. Introduction how strongly each local update is backed up by a represen- to the Special Issue on Interactive Computational Visual Analytics. ACM Trans. Interact. Intell. Syst. 4, 1, Article 3 (April 2014), 3 pages. tative volume of data. Particularly, our scheme integrates https://doi.org/10.1145/2594648 Hyperloglog [6] as a cardinality estimation mechanism to [4] Diego Colombo and Marloes H. Maathuis. 2014. Order-independent maintain the number of data items used in generating and Constraint-based Causal Structure Learning. J. Mach. Learn. Res. 15, 1 incrementally updating the models exchanged among partic- (Jan. 2014), 3741–3782. [5] Olof Görnerup, Daniel Gillblad, and Theodore Vasiloudis. 2017. ipating devices. Figure2 shows performance analysis results Domain-agnostic Discovery of Similarities and Concepts at Scale. of our scheme (i.e., Dec-HLL) compared to two state-of-the- Knowl. Inf. Syst. 51, 2 (May 2017), 531–560. https://doi.org/10.1007/ art techniques: federated learning scheme (i.e., Fed-PS) [9] s10115-016-0984-2 and Gossip learning (i.e., Golf) [12]. As shown, our scheme [6] Hazar Harmouch and Felix Naumann. 2017. Cardinality estimation: an achieves higher accuracy than P2P learning scheme adopted experimental survey. Proceedings of the VLDB Endowment 11, 4 (2017), 499–512. in Golf, specifically, our accuracy is much closer to the accu- [7] Daniel A. Keim, Huamin Qu, and Kwan-Liu Ma. 2013. Big-Data Visu- racy of Fed-PS when the data distribution is very skewed. alization. IEEE computer graphics and applications 33 4 (2013), 20–1. [8] Jakub Konecnỳ, H Brendan McMahan, Daniel Ramage, and Peter 4 A PLATFORM FOR ELICITING STRUCTURE IN DATA Richtárik. 2016. Federated optimization: Distributed machine learning We have taken the first steps to implement a platform to for on-device intelligence. arXiv preprint arXiv:1610.02527 (2016). [9] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr simplify the above discussed exploratory data analysis and Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing visual analytics. The purpose is to be able to get a first quick Su. 2014. Scaling Distributed Machine Learning with the Parameter overview of various useful structures in a data set. To make Server.. In OSDI, Vol. 14. 583–598. it generally applicable, we base it on an existing big data [10] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, platform HOPS [11], develop the code in Python, and do et al. 2016. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629 (2016). the interaction and visualization through Jupyter notebooks. [11] Salman Niazi, Mahmoud Ismail, Seif Haridi, Jim Dowling, Steffen This make development flexible and easy, and the chances Grohsschmiedt, and Mikael Ronström. 2017. HopsFS: Scaling Hi- of it to be useful for others. The intention is to publish the erarchical File System Metadata Using NewSQL Databases.. In FAST. entire package as open source. 89–104. [12] Róbert Ormándi, István Hegedűs, and Márk Jelasity. 2013. Gossip 5 ACKNOWLEDGEMENTS learning with linear models on fully distributed data. Concurrency and Computation: Practice and Experience 25, 4 (2013), 556–571. This research has been conducted within the “A Big Data [13] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Tal- Analytics Framework for a Smart Society” (BIDAF) project walkar. 2017. Federated multi-task learning. In Advances in Neural (http://bidaf.sics.se/) supported by the Swedish Knowledge Information Processing Systems. 4424–4434. Foundation. [14] Christian Steinrucken, Emma Smith, David Janz, James Lloyd, and Zoubin Ghahramani. [n. d.]. The Automatic Statistician. 175–188.