=Paper=
{{Paper
|id=Vol-2327/UIBK4
|storemode=property
|title=Eliciting Structures in Data
|pdfUrl=https://ceur-ws.org/Vol-2327/IUI19WS-UIBK-4.pdf
|volume=Vol-2327
|authors=Ahmad Al-Shishtawy,Juhee Bae,Mohamed-Rafik Bouguelia,Göran Falkman,Sarunas Girdzijauskas,Olof Gönerup,Anders Holst,Alexander Karlsson,Slawomir Nowaczyk,Sepideh Pashami,Alan Said,Amira A. Soliman El Hosary
|dblpUrl=https://dblp.org/rec/conf/iui/Al-ShishtawyBBF19
}}
==Eliciting Structures in Data==
Eliciting Structure in Data
Anders Holst Ahmad Al-Shishtawy Juhee Bae
RISE SICS, Sweden RISE SICS, Sweden University of Skövde, Sweden
anders.holst@ri.se ahmad.al-shishtawy@ri.se juhee.bae@his.se
Mohamed-Rafik Bouguelia Göran Falkman Sarunas Girdzijauskas
CAISR, Halmstad, Sweden University of Skövde, Sweden RISE SICS, Sweden
mohbou@hh.se goran.falkman@his.se sarunasg@kth.se
Olof Görnerup Alexander Karlsson Sławomir Nowaczyk
RISE SICS, Sweden University of Skövde, Sweden CAISR, Halmstad, Sweden
olof.gornerup@ri.se alexander.karlsson@his.se slawomir.nowaczyk@hh.se
Sepideh Pashami Alan Said Amira Soliman
CAISR, Halmstad, Sweden University of Skövde, Sweden RISE SICS, Sweden
sepideh.pashami@hh.se alansaid@acm.org aaeh@kth.se
ABSTRACT 1 INTRODUCTION
This paper demonstrates how to explore and visualize differ- Information visualization (IV) is concerned with the trans-
ent types of structure in data, including clusters, anomalies, formation of abstract data, information, and knowledge into
causal relations, and higher order relations. The methods interactive visual representations that amplify human cogni-
are developed with the goal of being as automatic as pos- tion and assist humans in solving problems. IV has proven
sible and applicable to massive, streaming, and distributed effective for presenting important information and for mak-
data. Finally, a decentralized learning scheme is discussed, ing complex analyses in solving big-data problems [7]. In
enabling finding structure in the data without collecting the contrast to traditional IV, in Visual Analytics (VA) the user
data centrally. is not just a consumer of information who interprets and
acts on visually presented results, but is the driving force of
CCS CONCEPTS the whole problem solving process, from problem formula-
• Human-centered computing → Visualization; Graph- tion, data selection and cleaning, over hypothesis generation
ical user interfaces; • Mathematics of computing → and testing, model building and steering computations, to
Causal networks; • Theory of computation → Unsuper- interpreting results and communicating insights. The scope
vised learning and clustering; • Computing methodologies of VA is very broad, and the area has grown to encompass
→ Anomaly detection. a wide array of topics. A unifying principle is the need for
systems to leverage computational methods in data mining
KEYWORDS and machine learning for big data analysis. In a VA system,
Information Visualization; Clustering; Anomaly Detection; the human user and the computational process operate in
Causal Inference; Higher-Order Structure; Distributed Ana- coordination in an integrated fashion, forming a continuous
lytics. and cooperative problem solving loop [3].
Here, we focus on understanding the underlying structure
ACM Reference Format:
Anders Holst, Ahmad Al-Shishtawy, Juhee Bae, Mohamed-Rafik
of real-world data sets before applying any artificial intel-
Bouguelia, Göran Falkman, Sarunas Girdzijauskas, Olof Görnerup, ligence and machine learning algorithm. However, today,
Alexander Karlsson, Sławomir Nowaczyk, Sepideh Pashami, Alan making sense of data is quite hard due to high-dimensional
Said, and Amira Soliman. 2019. Eliciting Structure in Data. In Joint aspect of real-world data sets. Therefore it is important to
Proceedings of the ACM IUI 2019 Workshops, Los Angeles, USA, March find different structures in the data as automatic as possible
20, 2019 , 4 pages. to facilitate information visualization.
The endeavour is similar to that of the Automated Statis-
tician [14] in that we want a data set to be automatically an-
IUI Workshops’19, March 20, 2019, Los Angeles, USA
alyzed from several different aspects. However, while much
© 2019 Copyright held for the individual papers by the papers’ authors.
Copying permitted for private and academic purposes. This volume is pub- of the efforts regarding the Automated Statistician is about
lished and copyrighted by its editors. characterizing time series or predicting an output variable
IUI Workshops’19, March 20, 2019, Los Angeles, USA Holst, et al.
from inputs, we focus on unsupervised characterization of needs to be involved in the process. With the help of a human
the data set as a whole. Also, rather than producing a fixed operator, various deviation levels can be inspected, enriching
report, we focus on interactive and exploratory capabilities. the view of the potential important anomalies.
2 STRUCTURE IN THE DATA Causal structure
"Structure" can mean many different things. The kind of For systems that not only learn from data but actually act
structure we focus on here can be characterized as either based on the resulting knowledge, it thus becomes more
horizontal or vertical properties of the data. Considering a and more important to identify cause and effect. This is a
(tabular) data set with different samples row-wise, and dif- very challenging task since in the general case it is easy to
ferent features column-wise, then properties describing how discover correlations between the attributes in data but is
samples relate to each other can be considered as horizontal impossible to distinguish which is the cause or effect only
properties, whereas properties of how different features re- based on observational data. There are however in many
late to each other as vertical. Examples of the former kind of cases several hints in the data that can potentially be used.
structure that we focus on are clusters among samples in the We propose a visualization tool which incrementally and
data, and finding anomalous samples or groups of samples. gradually discovers causal relations between features as more
Examples of the latter kind of structure are relations between data becomes available. The results are shown in a form of
the features, such as statistical correlation, causal relations, a causal graph [4] where the strength and uncertainty of
and a higher order similarity relations. causal and correlation links are marked by color and shading
[1].
Clusters
Unsupervised clustering of data is a frequently occurring Higher-Order Structures
task in statistical machine learning. Real world data is sel-
dom completely homogeneous, but usually contains clusters Appropriately representing data with regard to relevant fea-
representing different varieties of the entities under study tures for the task at hand is often critical when applying
[2]. The reasons to detect such clusters are plenty: clusters machine learning. Specifically, by building hierarchical - or
may represent a useful categorization of the data that is valu- higher-order - representations, we are able to capture mul-
able to discern; the significance and value of the correlation tiple abstraction levels in data, examine information from
between attributes may be misleading when there are clus- different perspectives and thereby capture various aspects of
ters; or just as a mathematical trick that complex distribution it. Such representations are applicable both to solve machine
can be approximated by a sum of simpler distributions. At learning tasks at the appropriate granularity level and to per-
the same time it is hard to reliably detect clusters. Most al- form exploratory data analysis on and across multiple levels
gorithms are “unstable” in the sense that different random of abstraction. In this context, we have proposed a fast and
starting points of iteration will lead to very different cluster- lightweight demonstrator estimating word similarities from
ing results. It is also inherently hard to find the “best” number the text by explicitly counting second-order co-occurrences
of clusters. Furthermore, in streaming data, the clusters may [5]. It visualizes the similarity between words in a single
not even be fixed but change over time . Finally, in many pass over data in a form of a tree starting from leaves and
situations, it is not even possible to objectively measure what building towards the root.
the best clustering is, since there may be many equally good
alternatives. Visualization and Interaction
We have so far implemented separate modules for all the
Anomalies above analyses. It remains to make the different analyses
In many domains, various units, such as vehicles and vessels, accessible through a combined tool, to visualize and explore
are generating data over time. To diagnose a unit A at each them simultaneously. Figure 1 illustrates one possible way of
time period T as anomalous or not, the data for A needs doing this. (The pictures are not representing any real data
to be compared against the data from other similar groups but are manually constructed to illustrate the principles.)
of units over the same time period in an online fashion. The horizontal properties relating samples to each other
Various representations or features can be extracted from as described above, i.e. clusters and anomalies/outliers, are
each of these units which can capture different anomalies. very natural to display in the same view (figure 1a), by show-
Some anomalies might be relevant to inspect further, whereas ing each cluster in a different color and then assigning the
others can be regarded as irrelevant from the human operator outliers a (markedly) different color or symbol to make them
perspective. However, it is not easy to determine whether stand out. Here it is illustrated by a scatter plot, but the same
an event is anomalous or not and a human operator often principle works also with time series.
Eliciting Structure in Data IUI Workshops’19, March 20, 2019, Los Angeles, USA
(a) Horizontal properties (b) Vertical properties (c) Combined view of structure
Figure 1: A mock-up illustration of horizontal and vertical properties of the data, and a possible way to combine them in the
same view.
The vertical properties relating features to each other, i.e Fed-PS Dec-HLL Dec-HLL-Skewed Golf Golf-Skewed
0,9
correlations, causal relations and higher order relations, can 0,8
also be visualized in the same view (figure 1b). Different 0,7
shapes and colors can be used for the different types of re- 0,6
lations. Here direct correlations (without established causal 0,5
recall
direction) are shown in blue, links with established causal di- 0,4
rection as narrowing links in purple, and links representing 0,3
higher order similarity in hatched orange. 0,2
Our suggestion for combining the horizontal and vertical 0,1
views into one (figure 1c) is to start with the vertical view
0
1 5 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
showing all features and their relations, and interactively number of rounds
popping up the horizontal properties, as scatter plots or time
Figure 2: Comparative analysis of performing a cluster-
series, for those features or links that the user wishes to ing task using our proposed scheme (Dec-HLL), federated
explore further. In this way all the data and its different learning framework (Fed-PS) and gossip learning frame-
types of structure is made readily accessible to the user to work (GOLF). Fed-PS achieves the highest performance as
explore. a result of performing model aggregation centrally. Yet,
our scheme achieves higher performance compared to P2P
3 SHARING KNOWLEDGE WITHOUT SHARING DATA learning scheme used in Golf, particularly the performance
gap increases when the data distribution is very skewed.
The predominant way of using machine learning involves
collecting all the data in a data center. The model is then
trained on powerful servers. However, this data collection
process is often privacy-invasive. Users may not want to centrally store the data [8, 10, 13]. Specifically, users start by
share private data with companies, making it difficult to contacting the central server and downloading the learning
use machine learning in some situations. For example, users algorithm and a global model that is common to all users.
generate vast amounts of data through interaction with the The algorithm trains its model locally on each device using
applications installed on their mobile devices. This data is user data and computes the update to the current global
often deeply private in nature and should not be shared model. Afterwards, the new updates on the learning param-
completely with a remote server. Even when privacy is not a eters obtained from the algorithm on the device of each
concern, having to collect data in one central storage can be user are sent to the central server for aggregation step. The
unfeasible. For example, self-driving cars generate too much server integrates the new learning outcomes as if it were
data to be able to send all of it to a server. directly trained on the data and sends the aggregated global
Federated Learning approach provides distributed and model back to each user. This distributed approach for model
privacy-friendly approach by allowing users to train models computation diminishes the need to centrally store the data,
locally using their sensitive data, and communicate interme- hence, computation becomes distributed among the users
diate model updates to a central server without the need to and their personal data never leaves their devices.
IUI Workshops’19, March 20, 2019, Los Angeles, USA Holst, et al.
In this context, we present a decentralized learning scheme REFERENCES
that removes the dependency on a central aggregating au- [1] Juhee Bae, Tove Helldin, and Maria Riveiro. 2017. Understanding
thority in order to offer a more scalable Federated Learning Indirect Causal Relationships in Node-Link Graphs. Computer Graphics
approach. Our proposed scheme organizes user devices in a Forum (2017). https://doi.org/10.1111/cgf.13198
[2] Mohamed-Rafik Bouguelia, Alexander Karlsson, Sepideh Pashami,
Peer-to-Peer (P2P) network allowing machine learning tasks Sawomir Nowaczyk, and Anders Holst. 2018. Mode Tracking Us-
to cooperate among each other using Gossip protocols [12]. ing Multiple Data Streams. Inf. Fusion 43, C (Sept. 2018), 33–46.
Furthermore, our learning scheme operates as a decentral- https://doi.org/10.1016/j.inffus.2017.11.011
ized learning and knowledge sharing approach that considers [3] Remco Chang, David S. Ebert, and Daniel Keim. 2014. Introduction
how strongly each local update is backed up by a represen- to the Special Issue on Interactive Computational Visual Analytics.
ACM Trans. Interact. Intell. Syst. 4, 1, Article 3 (April 2014), 3 pages.
tative volume of data. Particularly, our scheme integrates https://doi.org/10.1145/2594648
Hyperloglog [6] as a cardinality estimation mechanism to [4] Diego Colombo and Marloes H. Maathuis. 2014. Order-independent
maintain the number of data items used in generating and Constraint-based Causal Structure Learning. J. Mach. Learn. Res. 15, 1
incrementally updating the models exchanged among partic- (Jan. 2014), 3741–3782.
[5] Olof Görnerup, Daniel Gillblad, and Theodore Vasiloudis. 2017.
ipating devices. Figure2 shows performance analysis results
Domain-agnostic Discovery of Similarities and Concepts at Scale.
of our scheme (i.e., Dec-HLL) compared to two state-of-the- Knowl. Inf. Syst. 51, 2 (May 2017), 531–560. https://doi.org/10.1007/
art techniques: federated learning scheme (i.e., Fed-PS) [9] s10115-016-0984-2
and Gossip learning (i.e., Golf) [12]. As shown, our scheme [6] Hazar Harmouch and Felix Naumann. 2017. Cardinality estimation: an
achieves higher accuracy than P2P learning scheme adopted experimental survey. Proceedings of the VLDB Endowment 11, 4 (2017),
499–512.
in Golf, specifically, our accuracy is much closer to the accu-
[7] Daniel A. Keim, Huamin Qu, and Kwan-Liu Ma. 2013. Big-Data Visu-
racy of Fed-PS when the data distribution is very skewed. alization. IEEE computer graphics and applications 33 4 (2013), 20–1.
[8] Jakub Konecnỳ, H Brendan McMahan, Daniel Ramage, and Peter
4 A PLATFORM FOR ELICITING STRUCTURE IN DATA Richtárik. 2016. Federated optimization: Distributed machine learning
We have taken the first steps to implement a platform to for on-device intelligence. arXiv preprint arXiv:1610.02527 (2016).
[9] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr
simplify the above discussed exploratory data analysis and
Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing
visual analytics. The purpose is to be able to get a first quick Su. 2014. Scaling Distributed Machine Learning with the Parameter
overview of various useful structures in a data set. To make Server.. In OSDI, Vol. 14. 583–598.
it generally applicable, we base it on an existing big data [10] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson,
platform HOPS [11], develop the code in Python, and do et al. 2016. Communication-efficient learning of deep networks from
decentralized data. arXiv preprint arXiv:1602.05629 (2016).
the interaction and visualization through Jupyter notebooks.
[11] Salman Niazi, Mahmoud Ismail, Seif Haridi, Jim Dowling, Steffen
This make development flexible and easy, and the chances Grohsschmiedt, and Mikael Ronström. 2017. HopsFS: Scaling Hi-
of it to be useful for others. The intention is to publish the erarchical File System Metadata Using NewSQL Databases.. In FAST.
entire package as open source. 89–104.
[12] Róbert Ormándi, István Hegedűs, and Márk Jelasity. 2013. Gossip
5 ACKNOWLEDGEMENTS learning with linear models on fully distributed data. Concurrency and
Computation: Practice and Experience 25, 4 (2013), 556–571.
This research has been conducted within the “A Big Data [13] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Tal-
Analytics Framework for a Smart Society” (BIDAF) project walkar. 2017. Federated multi-task learning. In Advances in Neural
(http://bidaf.sics.se/) supported by the Swedish Knowledge Information Processing Systems. 4424–4434.
Foundation. [14] Christian Steinrucken, Emma Smith, David Janz, James Lloyd, and
Zoubin Ghahramani. [n. d.]. The Automatic Statistician. 175–188.