<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>March</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Clustering, Universalities, and Evolutionary Schema Design</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Issei Fujishiro</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Naoko Sawada</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Makoto Uemura</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hiroshima University</institution>
          ,
          <addr-line>1-3-2 Kagamiyama, Higashi-Hiroshima, Hiroshima 739-8511</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Keio University</institution>
          ,
          <addr-line>3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanawaga 223-8522</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>31</volume>
      <issue>2023</issue>
      <abstract>
        <p>Exploring data features using visual clustering is a significant challenge of big data analytics. In this vision paper, we focus primarily on the relationship among visual data clustering, the discovery of universalities, and the design of an evolutionary database to propose an inter-disciplinary method for scientific data management. The feasibility of the proposed method is empirically proven through application to a practical visual analytics environment for time-varying multi-dimensional datasets of blazar observations.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;visual data clustering</kwd>
        <kwd>universality</kwd>
        <kwd>evolutionary database</kwd>
        <kwd>schema design</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Feature exploration is a significant challenge of big data
analytics. In response, visual data clustering [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has
become a useful approach for such a task, because it enables
the identification of salient features coupled with
appropriate user intervention. Careful visual data clustering
can lead to the discovery of universalities hidden in target
datasets. In this vision paper, we strive to demonstrate
how evolutionary database design [2] can fully support
this kind of valuable scientific activity.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Evolutionary Schema Design</title>
      <p>This section proposes our evolutionary schema design
in relation to the visual discovery of universalities. We
use Universal Modeling Language (UML) [3] class
diagrams for conceptual design, followed by translations
into corresponding relational schemas.</p>
      <sec id="sec-2-1">
        <title>2.1. Sample Class</title>
        <p>A data matrix (multi-dimensional data samples) can be
formulated as the class Samples, consisting of 
attributes, as shown in Fig. 1. Samples have observational
relationships with each other, and these can be abstracted
by a recursive association, called Samples_Transit, also
shown in Fig. 1.</p>
        <p>The corresponding relational schema consists of the
following two third normal form (3NF) relation schemas:</p>
        <p>Samples(sample-ID, sa-1, sa-2, ..., sa-n)</p>
        <p>Samples_Transit(sample-ID_s, sample-ID_d, t-info).</p>
        <p>Actual instances of Samples and mutual relationships
between the instances clearly form a weighted directed
graph and are usually visualized with a node-and-link
diagram. In the case of many Samples and dense mutual
relationships, such a diagram often sufers from visual
clutter artifacts.</p>
        <p>Samples(sample-ID, cluster-ID, sa-1, sa-2, ..., sa-n)
Clusters(cluster-ID, ca-1, ca-2, ..., ca-n)
Clusters_Transit(cluster-ID_s, cluster-ID_d, meta_t-info).</p>
        <p>In normal visualization, visual clutter artifacts cannot
be resolved. It is because each cluster may be
accentuated by an ellipse, while the original inter-instance links
usually remain unchanged.</p>
        <p>Here, we consider making explicit the universalities
found in the Samples instances. Specifically, if
associations between Samples instances can commonly be Figure 3: Subsamples class
observed in the same pair of Clusters, we propose to
upgrade the mutual relationships between Samples to
mutual associations between Clusters, also shown in
Fig. 2. Note that the specialization IS_A is naturally realized</p>
        <p>At this point, provided that an evolutionary data by the common primary key sample-ID in the relation
management environment is available, the correspond- schemas. The idiosyncratic attributes of Subsamples
ing relational schema can be re-formulated using the may be used to derive new attributes of Clusters. From
following three 3NF relation schemas: the viewpoint of big data visual analytics, a remarkable
advantage of idiosyncratic attribute separation lies in its
ability to avoid the explosion of inapplicable null values
in single relation Samples.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Case Study</title>
      <p>Note that the aggregation Belong_to is realized via the
foreign key cluster-ID in the new definition of the
relation schema Samples. Note also that the relation schema
Blazars are the brightest and most energetic objects in
the universe. To demystify the physics of the magnetic
Clusters_Transit has meta_t-info, which can be de- field within a relativistic jet ejected from a central black
rived from the t-info values of the belonging Samples.</p>
      <p>It would be interesting to describe the occurrence
probability as an attribute of meta_t-info. As a by-product
of such a universality specification, the number of
intercluster associations can drastically be reduced, resulting
in a simplified visualization.</p>
      <sec id="sec-3-1">
        <title>2.3. Subsample Class</title>
        <p>For each instance of Clusters, idiosyncratic attributes
may have to be specified. To manage such attributes
eficiently, we propose to define a new class, Subsamples,
as a specialization of Samples, as shown in Fig. 3.</p>
        <p>The corresponding relational schema consists of the
following ( + 1) 3NF relation schemas:</p>
        <p>Samples(sample-ID, sa-1, sa-2, ..., sa-n)
Subsamples(sample-ID, ssa-1, ssa-2, ..., ssa-n )
( = 1, ..., ).
hole of a blazar, the light from a blazar is regularly
observed. The Hiroshima Astrophysical Science Center
(HASC) has scrutinized optical photo-polarimetric and
near-infrared observation datasets to identify
characteristic blazar behaviors, such as light bursts (i.e., flares ) and
rotated polarization (i.e., rotation), to explore recurring
time-variation patterns. TimeTubesX [4, 5] is an
integrated visual analytics environment that allows blazar
researchers to analyze eficiently and in detail long-term,
multi-dimensional blazar observation datasets. This
section strives to apply the evolutionary schema design in
Sec. 2 to sophisticated data management in the
TimeTubeX system.
3.1. Data
The HASC has observed the polarization, intensity, and
color () of the light from a blazar, where the linear
polarization is described by three Stokes parameters, ,
 , and , with  denoting the total intensity of the
polarized and unpolarized components,  the intensity of
the linear horizontal or vertical polarization components,
and  the intensity of the linear +1/4 or − 1/4
polarization components, respectively. Instead of  and
 , we mainly utilize  and , which can be obtained by
dividing  and  by , because  and  explain blazar
behaviors better than  and  . The observation errors
of  and  are described as   and  , respectively. The
space spanned by  and  is termed the Stokes plane
(Fig. 4a). When analyzing time variations in the Stokes
(a) Stokes plane
ous lengths from a long-term observation dataset,
considering missing data and observation frequencies, and
then they filter subsequences with overlapping features.
The clustering methods consider correlations among
variables and compute means of subsequences without
smoothing out their features.</p>
        <p>The timeline view of TimeTubesX in Fig. 5
summarizes the temporal distributions of six found clusters of
diferent stripe colors.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Inter-flare Cluster Transitions</title>
      </sec>
      <sec id="sec-3-3">
        <title>3.2. Visual Clustering</title>
        <p>To enable blazar researchers to examine universalities in
blazar datasets, TimeTubesX provides them with
timevarying multi-dimensional subsequence clustering
methods [5], together with a designated set of visual
analysis methods, including the advanced sample retrieval
functionalities query-by-example and query-by-sketch [4].
The clustering methods extract subsequences of
variIn this paper, we demonstrated the possibility of bridging
three worlds, i.e., visual analytics, universality
discovery, and database refactoring. Through the application
of the present methodology to the practical problem of
blazar observation, we empirically proved that
universality identification based on visual data clustering is
strongly supported by evolutionary schema design.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>This work has been partially supported by the
Grant-inAid for Challenging Research (Pioneering) JP20K20481.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sips</surname>
          </string-name>
          ,
          <article-title>Visual clustering</article-title>
          , in: Encyclopedia of
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>The corresponding relational schema consists of Database Systems</source>
          , Springer, Boston, MA,
          <year>2009</year>
          , pp.
          <source>the following five 3NF relation schemas, where the 3350-3360</source>
          . doi:
          <volume>10</volume>
          .1007/978-0-
          <fpage>387</fpage>
          -39940-9
          <article-title>_ composition is naturally realized via the foreign key 1124. ss-ID in the relation</article-title>
          schema Samples: [2]
          <string-name>
            <given-names>S. W.</given-names>
            <surname>Ambler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Sadalage</surname>
          </string-name>
          , Refactoring Databases:
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>Samples(sample-ID, ss-ID, time</article-title>
          , Q, U, e_q, e_u, I, C)
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>FlareSamples(sample-ID, PD</article-title>
          , PA, q, u) [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Booch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rumbaugh</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Jacobson</surname>
          </string-name>
          , The Unified
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>Subsequences(ss-ID, cluster-ID, flareID, length, cor, angle) Modeling Language User Guide</article-title>
          , 2nd ed., Addison-
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>Clusters(cluster-ID, #subsequences, cluster_prototype)</article-title>
          <string-name>
            <surname>Wesley</surname>
          </string-name>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>Is_followed_by(cluster-ID_s, cluster-ID_d, transit-prob)</article-title>
          . [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Sawada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Uemura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pfister</surname>
          </string-name>
          , I. Fu-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>