1. Introduction

March

Clustering, Universalities, and Evolutionary Schema Design

Issei Fujishiro

Naoko Sawada

Makoto Uemura

0 0 Hiroshima University , 1-3-2 Kagamiyama, Higashi-Hiroshima, Hiroshima 739-8511 , Japan 1 Keio University , 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanawaga 223-8522 , Japan

2023

31 2023

Exploring data features using visual clustering is a significant challenge of big data analytics. In this vision paper, we focus primarily on the relationship among visual data clustering, the discovery of universalities, and the design of an evolutionary database to propose an inter-disciplinary method for scientific data management. The feasibility of the proposed method is empirically proven through application to a practical visual analytics environment for time-varying multi-dimensional datasets of blazar observations.

eol>visual data clustering universality evolutionary database schema design

1. Introduction

Feature exploration is a significant challenge of big data analytics. In response, visual data clustering [ 1 ] has become a useful approach for such a task, because it enables the identification of salient features coupled with appropriate user intervention. Careful visual data clustering can lead to the discovery of universalities hidden in target datasets. In this vision paper, we strive to demonstrate how evolutionary database design [2] can fully support this kind of valuable scientific activity.

2. Evolutionary Schema Design

This section proposes our evolutionary schema design in relation to the visual discovery of universalities. We use Universal Modeling Language (UML) [3] class diagrams for conceptual design, followed by translations into corresponding relational schemas.

2.1. Sample Class

A data matrix (multi-dimensional data samples) can be formulated as the class Samples, consisting of attributes, as shown in Fig. 1. Samples have observational relationships with each other, and these can be abstracted by a recursive association, called Samples_Transit, also shown in Fig. 1.

The corresponding relational schema consists of the following two third normal form (3NF) relation schemas:

Samples(sample-ID, sa-1, sa-2, ..., sa-n)

Samples_Transit(sample-ID_s, sample-ID_d, t-info).

Actual instances of Samples and mutual relationships between the instances clearly form a weighted directed graph and are usually visualized with a node-and-link diagram. In the case of many Samples and dense mutual relationships, such a diagram often sufers from visual clutter artifacts.

Samples(sample-ID, cluster-ID, sa-1, sa-2, ..., sa-n) Clusters(cluster-ID, ca-1, ca-2, ..., ca-n) Clusters_Transit(cluster-ID_s, cluster-ID_d, meta_t-info).

In normal visualization, visual clutter artifacts cannot be resolved. It is because each cluster may be accentuated by an ellipse, while the original inter-instance links usually remain unchanged.

Here, we consider making explicit the universalities found in the Samples instances. Specifically, if associations between Samples instances can commonly be Figure 3: Subsamples class observed in the same pair of Clusters, we propose to upgrade the mutual relationships between Samples to mutual associations between Clusters, also shown in Fig. 2. Note that the specialization IS_A is naturally realized

At this point, provided that an evolutionary data by the common primary key sample-ID in the relation management environment is available, the correspond- schemas. The idiosyncratic attributes of Subsamples ing relational schema can be re-formulated using the may be used to derive new attributes of Clusters. From following three 3NF relation schemas: the viewpoint of big data visual analytics, a remarkable advantage of idiosyncratic attribute separation lies in its ability to avoid the explosion of inapplicable null values in single relation Samples.

3. Case Study

Note that the aggregation Belong_to is realized via the foreign key cluster-ID in the new definition of the relation schema Samples. Note also that the relation schema Blazars are the brightest and most energetic objects in the universe. To demystify the physics of the magnetic Clusters_Transit has meta_t-info, which can be de- field within a relativistic jet ejected from a central black rived from the t-info values of the belonging Samples.

It would be interesting to describe the occurrence probability as an attribute of meta_t-info. As a by-product of such a universality specification, the number of intercluster associations can drastically be reduced, resulting in a simplified visualization.

2.3. Subsample Class

For each instance of Clusters, idiosyncratic attributes may have to be specified. To manage such attributes eficiently, we propose to define a new class, Subsamples, as a specialization of Samples, as shown in Fig. 3.

The corresponding relational schema consists of the following ( + 1) 3NF relation schemas:

Samples(sample-ID, sa-1, sa-2, ..., sa-n) Subsamples(sample-ID, ssa-1, ssa-2, ..., ssa-n ) ( = 1, ..., ). hole of a blazar, the light from a blazar is regularly observed. The Hiroshima Astrophysical Science Center (HASC) has scrutinized optical photo-polarimetric and near-infrared observation datasets to identify characteristic blazar behaviors, such as light bursts (i.e., flares ) and rotated polarization (i.e., rotation), to explore recurring time-variation patterns. TimeTubesX [4, 5] is an integrated visual analytics environment that allows blazar researchers to analyze eficiently and in detail long-term, multi-dimensional blazar observation datasets. This section strives to apply the evolutionary schema design in Sec. 2 to sophisticated data management in the TimeTubeX system. 3.1. Data The HASC has observed the polarization, intensity, and color () of the light from a blazar, where the linear polarization is described by three Stokes parameters, , , and , with denoting the total intensity of the polarized and unpolarized components, the intensity of the linear horizontal or vertical polarization components, and the intensity of the linear +1/4 or − 1/4 polarization components, respectively. Instead of and , we mainly utilize and , which can be obtained by dividing and by , because and explain blazar behaviors better than and . The observation errors of and are described as and , respectively. The space spanned by and is termed the Stokes plane (Fig. 4a). When analyzing time variations in the Stokes (a) Stokes plane ous lengths from a long-term observation dataset, considering missing data and observation frequencies, and then they filter subsequences with overlapping features. The clustering methods consider correlations among variables and compute means of subsequences without smoothing out their features.

The timeline view of TimeTubesX in Fig. 5 summarizes the temporal distributions of six found clusters of diferent stripe colors.

3.3. Inter-flare Cluster Transitions 3.2. Visual Clustering

To enable blazar researchers to examine universalities in blazar datasets, TimeTubesX provides them with timevarying multi-dimensional subsequence clustering methods [5], together with a designated set of visual analysis methods, including the advanced sample retrieval functionalities query-by-example and query-by-sketch [4]. The clustering methods extract subsequences of variIn this paper, we demonstrated the possibility of bridging three worlds, i.e., visual analytics, universality discovery, and database refactoring. Through the application of the present methodology to the practical problem of blazar observation, we empirically proved that universality identification based on visual data clustering is strongly supported by evolutionary schema design.

Acknowledgments

This work has been partially supported by the Grant-inAid for Challenging Research (Pioneering) JP20K20481.

[1]

Sips , Visual clustering , in: Encyclopedia of

The corresponding relational schema consists of Database Systems , Springer, Boston, MA, 2009 , pp. the following five 3NF relation schemas, where the 3350-3360 . doi: 10 .1007/978-0- 387 -39940-9 _ composition is naturally realized via the foreign key 1124. ss-ID in the relation schema Samples: [2]

S. W.

Ambler ,

P. J.

Sadalage , Refactoring Databases:

Samples(sample-ID, ss-ID, time , Q, U, e_q, e_u, I, C) 2006 .

FlareSamples(sample-ID, PD , PA, q, u) [3]

Booch ,

Rumbaugh , I. Jacobson , The Unified

Subsequences(ss-ID, cluster-ID, flareID, length, cor, angle) Modeling Language User Guide , 2nd ed., Addison-

Clusters(cluster-ID, #subsequences, cluster_prototype)

Wesley , 2005 .

Is_followed_by(cluster-ID_s, cluster-ID_d, transit-prob) . [4]

Sawada ,

Uemura ,

Beyer ,

Pfister , I. Fu-