<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Visualizing Redundancies caused by Multivalued Dependencies in Relational Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christoph Köhnen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chair of Scalable Database Systems, University of Passau</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Based on a novel visualization of redundancies caused by functional dependencies in relational data, this work motivates an extension to multivalued dependencies. The earlier work is inspired by the visualization of dental plaque and colors cells in relational data with hues depending on their grade of redundancy, which corresponds to a lower information content. Lost values in a relation instance can be restored using given functional dependencies and other tuples. This efect also holds for multivalued dependencies. However, restoring data using this dependency tends to require more data from the rest of the relation. Thus, the redundancies caused by multivalued dependencies are much lower than the ones caused by functional dependencies. The optimizations to reduce the computation efort from the original work generally do not hold for multivalued dependencies, so a future challenge is to adapt these conditions to these dependencies. Our vision is to ofer the plaque test for more general kinds of dependencies as an eficient tool to visualize redundant data.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Functional Dependencies</kwd>
        <kwd>Multivalued Dependencies</kwd>
        <kwd>Relational Model</kwd>
        <kwd>Information Theory</kwd>
        <kwd>Entropy</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>ID AlbumTitle Band</p>
      <p>BY</p>
      <p>RY</p>
      <p>TR TrackTitle</p>
      <p>
        1
1 Not That. . . Anastacia 1999 2000 1 Not Th. . . 0.9
Visualization in relational data is an important topic. 1 Not That. . . Anastacia 1999 2000 2 I’m Out. . . 0.8
There exist several models to visualize functional de- 1 Not That. . . Anastacia 1999 2000 3 Cowboy. . .
pendencies, a cornerstone of database normalization, e.g., 2 Wish You. . . Pink Floyd 1965 1975 1 Shine O. . . 0.7
in sunburst diagrams or graph-based visualizations [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. 3 Freak of. . . Anastacia 1999 2001 1 Paid my. . . 0.6
These approaches can be helpful for data exploration. Figure 1: Plaque test for the given relation with the functional
However, these models visualize the functional depen- dependencies from (‹). Cell hues correspond to entropy values,
dencies stand-alone, without taking specific relation in- defined in the color scale. Figure adapted from [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
stances into account. Hence, redundancies in databases
can hardly be identified in such a visualization.
      </p>
      <p>
        Database schemas in suficiently high normal form the entropy values in relation instances.
avoid redundancies and prevent update anomalies. Are- The concept of “plaque test” for functional
dependennas and Libkin [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] define a schema as well-designed if and cies [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is illustrated in Figure 1. The relation1 shown
only if there is no possible instance containing redun- manages a CD collection. Each CD has an identifier (ID),
dancies, which, in the case of functional dependencies, a title (AlbumTitle), and was released in a given year
equivalently means that the schema is in Boyce-Codd (RYear/RY) by a band (Band), which was founded in a
normal form. Such redundancies caused by functional given year (BYear/BY). Each CD has tracks (Track/TR)
dependencies are visualized by a novel principle, which is and each track has a title (TrackTitle).
presented in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and called “plaque test”. In this approach, We consider the following functional dependencies:
each cell in a relation instance is colored according to its
value of information content (or entropy), which depends ID Ñ AlbumTitle, Band, BYear, RYear
aonviasucahloizsaetniosnetcoafnfbuelfilhleedlpffuunlcttoioanssailsdt edpaetanbdaesnecnieosr.mSualcihza- ID, Track Ñ TrackTitle and Band Ñ BYear (‹)
tion and even for teaching normal forms. The theory of
computing this information content originates from [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
and applies not only to functional dependencies but even
to generalized dependencies. In this work, we take the
obvious next step by extending the “plaque test” to
multivalued dependencies and discuss their contribution to
      </p>
      <sec id="sec-1-1">
        <title>The result of the “plaque test” for the functional de</title>
        <p>pendencies in (‹) is shown in the original instance in
Figure 1. The color hues are chosen corresponding to the
cell’s entropy, as defined in the legend on the right: the
more redundant the value, the smaller the entropy and
the deeper the blue. For white cells with entropy 1 we
say that this cell has full information content.
35th GI-Workshop on Foundations of Databases (Grundlagen von
Datenbanken), May 22-24, 2024, Herdecke, Germany.
$ christoph.koehnen@uni-passau.de (C. Köhnen)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).</p>
      </sec>
      <sec id="sec-1-2">
        <title>1Example from the German Wikipedia site on database normaliza</title>
        <p>tion, at https://de.wikipedia.org/wiki/Normalisierung_(Datenbank),
last accessed April 2024.</p>
        <p>
          If we consider the attribute “BYear”, lost values can be entropy-related information content in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] we need to
recovered by checking the values in other tuples together identify each tuple unambiguously. We introduce the
with two functional dependencies, while the values in computation of the information content and present
al“AlbumTitle”, “Band” and “RYear” can only be recovered gorithmic optimizations and an approximation approach.
using one functional dependency. This is a reason for
the lower entropy values in “BYear”. Additionally, this 2.1. Relations, Positions and Variables
column has diferently high redundancies. This is due
to the fact that the founding year in the row with ID 1 In the following, the set of positive integers is denoted by
can be restored either with the functional dependency N :“ t1, 2, . . . u. We define a partial map, representing
“ID Ñ BYear” or with “Band Ñ BYear”, while for the an order-preserving relation, allowing duplicate tuples.
row with ID 3 this can only be done with the latter one. Definition 2.1. A relation  of arity  is specified by a
fi
        </p>
        <p>
          As stated above, the entropies depend on the choice nite set sortpq :“ t1, . . . , u of attributes . The
of functional dependencies. Adding the dependency domain of an attribute  is denoted by Dompq. Then
“BYear Ñ Band”, which is fulfilled in this instance, causes an instance  of the relation  is defined as a partial map
lower entropies in “Band”. In general, adding dependen-  : N á Domp1qˆ. . .ˆDompq with finite domain
cies to the chosen set might reduce the entropies but of definition Defpq :“ t P N | pq is defined u.
can never increase them. Choosing the set of all fulfilled For simplification, we assume Dompq “ N for all
functional dependencies would cause the lowest possible  P sortpq, i.e., each tuple consists of positive integers.
entropies. This case is considered as an example in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>
          A logical next step is the extension of the “plaque test” Definition 2.2. Let  be a relation and  an instance of .
to other dependencies, since the theoretical framework For a subset  Ď sortpq of attributes, we have the
from [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] applies to general constraints. We will consider projection   pq of the tuples in  to the attributes in  .
a relation with multivalued dependencies and compute Given an instance  of a relation  we denote
the entropies with these dependencies as input. If the by  rs the value of the attribute  in the
constraint set consists only of functional dependencies, th tuple  :“ pq of . Similarly, for a subset
there are two rules presented in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]: immediately identify  “ t1, . . . , u Ď sortpq of attributes, we
cells with an entropy of 1 and delete irrelevant rows and denote by  r1 . . . s or  r s the tuple of values
columns to execute the computation on a smaller table.  r1s, . . . ,  rs, that is, the -th tuple   ppqq of
However, these rules cannot be applied in the same way if the projection   pq. For subsets ,  Ď sortpq we
the set of constraints contains multivalued dependencies. define  r s :“  r Y  s.
        </p>
        <p>Contributions. We build on a visualization approach Definition 2.3. Let  be an instance of a relation  with
for redundancies in relational databases caused by func- attributes sortpq “ t1, . . . , u. A position in  is
tional dependencies, which relies on an existing theo- a pair p, q with  P Defpq and  P t1, . . . , u; it
retical framework. We discuss the generalization of this represents the cell of the -th attribute of the -th tuple,
approach to multivalued dependencies, which is in the with value  rs. The instance obtained by putting the
scope of the theoretical framework, but has never been value  at position  “ p, q is denoted by Ð.
implemented or algorithmically optimized for implemen- Let  “ t1, . . . , u be positions,  “ p1, . . . , q
tation. We identify challenges in transferring simplifica- distinct variables, and  “ p1, . . . , q values in N.
tions for the redundancy computation, which are valid for The instance obtained by replacing each position  by
functional dependencies, to multivalued dependencies. the variable  resp. by value  for 1 ď  ď  is denoted</p>
        <p>Structure. In Section 2 we define the preliminaries, by Ð resp. Ð . Hence, the instance obtained by
such as functional and multivalued dependencies and in- replacing the positions from  by variables, and then
formation content. In Section 3 we consider related work position  by the value , is pÐ qÐ.
on functional dependencies and information theory. In
Section 4 we present results for schemas with functional
dependencies and a possible extension to multivalued
dependencies. In Section 5 we discuss the next steps.</p>
      </sec>
      <sec id="sec-1-3">
        <title>We choose to define  as a set, but  and  as tuples.</title>
        <p>
          The reason is that the information content defined in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ],
which we will introduce in Section 2.3, considers sets of
lost values without duplicates and a fixed order, while
variables and values are determined for a fixed position
and, in the case of values, duplicates are possible.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Preliminaries</title>
      <sec id="sec-2-1">
        <title>We first introduce relations and functional dependencies as in [3], that is, we take into account the order of the tuples. We also use this principle for the definition of multivalued dependencies. Due to the definition of the</title>
        <sec id="sec-2-1-1">
          <title>2.2. Dependencies</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>In this section, we define syntax and semantics of functional and multivalued dependencies. We start by defining functional dependencies as in [4].</title>
        <p>Definition 2.4. A functional dependency in a relation  and multivalued dependencies, for a considered position
is an expression of the form  :“  Ñ  , where ,  Ď  P Pos, we define two probability spaces as follows.
sortpq are sets of attributes. It is called simple, if  First, let ℬp, q :“ ppPos ztuq, ℬq, where ℬ
consists of exactly one attribute. is the uniform distribution on the set of all subsets of</p>
        <p>An instance  (without variables) of a relation  is Pos ztu. This space models the possible cases of a lost
said to fulfill the functional dependency  (write  |ù  ) set of possible values from the instance  on positions
if for all 1, 2 P Defpq it holds other than the considered one .</p>
        <p>
          Then, for  P N we let  p, q :“ pt1, . . . , u, q,
1 r s “ 2 r s ñ 1 r s “ 2 r s. where the conditional probability of  P t1, . . . , u given
Next, we define multivalued dependencies as in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].  Ď Pos ztu is equally distributed over  :“ t P
t1, . . . , u | pÐ qÐ |ù  u and zero otherwise.
        </p>
        <p>Definition 2.5. A multivalued dependency in a relation  This space models the possible values in t1, . . . , u at ,
is an expression of the form  :“  ↠  , where ,  Ď given that  is the set of lost value positions in .
sortpq are sets of attributes. With these two probability spaces, we can use the</p>
        <p>An instance  (without variables) of a relation  is said conditional entropy of ℬp, q given  p, q, which
to fulfill the multivalued dependency  (write  |ù  ) if is denoted as p p, q | ℬp, qq. This conditional
for all 1, 2 P Defpq with 1 r s “ 2 r s there exists entropy measures some kind of uncertainty at position ,
 P Defpq such that if the value is lost and putting in natural numbers up
to  is possible. From another perspective, one can say
1 r s “  r s and 2 r s “  r s, that the conditional entropy is a quantification of the
where  :“ sortpqzp Y  q is the set of all attributes information content at a position, if the value is present.
not covered by  and  . Intuitively, uncertainty, if a value is lost, corresponds to
information content, if the value is present.</p>
        <p>
          From here we consider  as a set of functional and We now present the information content of position
multivalued dependencies.  P Pos in instance  of relation  with respect to a
Definition 2.6. Let  and  be as above. The instance  set  of functional and multivalued dependencies with
(without variables) fulfills  (write  |ù  ) if  |ù  for  |ù  . While in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] the special case, where  consists
all  P  . only of functional dependencies, is used, Arenas and
        </p>
        <p>
          An instance  containing distinct variables at positions Libkin [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] define the information content for general
 “ t1, . . . , u fulfills  resp.  if there exist values dependencies. Thus, the definition also holds for sets 
 “ p1, . . . , q such that it holds Ð |ù  resp. containing functional and multivalued dependencies.
Ð |ù  .
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Definition 2.7. Let  ,  and  be as introduced above.</title>
        <p>
          To check  |ù  in the implementation, we apply the The information content of position  with respect to 
equivalence with  |ù  @ P  , which indeed holds for in instance  is given as
instances  without variables. If  contains variables this
Ietqculievaarlleynhceolcdasnifbe siescculroesdedunudnedreradlodgitiicoanlailmcpolnicdaittiioonnss.. INF p |  q :“ lÑim8 INFlopg|  q ,
If  only contains functional dependencies, it sufices where INFp |  q :“ p p, q | ℬp, qq is the
to require the transitive closeness of  . In the sense conditional entropy of  p, q given ℬp, q.
of the theoretical framework [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] the computation leads The information content for possible values limited to
to the same results regardless if  is a canonical cover,  is normalized by log  and the limit of  to infinity is
closed under logical implications or in between. However, taken to cover the entire set of natural numbers.
applying the above equivalence can lead to incorrect Since the space ℬp, q consists of all subsets of
posiresults for non-closed  . Therefore, we ofer an option to tions other than , its size grows exponentially with the
compute the closure prior to entropy calculations, which size of the relation. In the remainder of the section, we
can be disabled if the closeness of the given set  is clear introduce optimizations and an approximate approach
to reduce the computation time. from [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] to overcome this performance problem.
        </p>
        <sec id="sec-2-3-1">
          <title>2.3. Entropy and Information Content</title>
          <p>
            2.3.1. Optimizations
We now introduce the information content using the We illustrate the optimizations from [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] with an
ex(information-theoretic) entropy. Given two probabil- ample. Consider instance  given in Figure 2 and the
ity spaces, the conditional entropy is introduced in [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] (singleton) set  :“ tA Ñ Cu of functional
deand [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]. In an instance  with a set Pos :“ Defpq ˆ pendencies. The two ways to optimize the
computat1, . . . , u of positions and a set  of functional tion are to directly identify positions with entropy 1
and to reduce the size of the problem. Note that it Then Ers “ INF p |  q and for independent
identiis still an open question whether these optimizations cally distributed 1, . . . ,  as above and their average
1 ř
hold for a set containing multivalued dependencies.  :“  “1 , for all ,  ą 0 it holds
Pr `| ´ INF p |  q| ě ˘ ď 
A B C D Identify ones.A position  :“ p, q
7 2 8 4 has an entropy of 1 if and only if for all provided that  ě 2 lnp2{ q{2. Here,  stands for the
5 2 8 6 functional dependencies  “  Ñ  P accuracy of the approximation,  for an error probability,
7 2 8 6  with  P  we have that  r s is thus the accuracy  holds with a confidence of 1 ´  for
unique in the (duplicate allowing) pro- a number of samples  as required in the inequality.
IFnisgtuanrece 2.: vjeacltuieonat popsiqti.onThiscamnenaontsbteharetsatolroesdt iteWrateioshnoswis caonmexpaumtepdlesuinchwthhaicthththeeennturmopbieers ohfasvueficai efixnetd
by other tuples and thus has an infor- accuracy under a specified confidence.
mation content of 1. Hence, if a column does not appear
on the right-hand side of any functional dependency Example 2.8. An accuracy of 0.01 with a confidence of
from  , the positions in the whole column have full 99% can be achieved with at least 2 lnp2{0.01q{0.012 «
information content. This optimization allows to skip 100, 000 iterations. For an accuracy of 0.04 with
the computation for all the positions with the introduced a confidence of 99.9% the suficient sample size is
property. In the example considered, all positions p, q 2 lnp2{0.001q{0.042 « 10, 000.
with  ‰  and position p2, q have full information
content and we only have two positions left to compute. 3. Related Work
          </p>
          <p>
            Reduce problem size. Rows and columns can be deleted
under conditions and the computation can be performed Functional dependencies are a well-studied and
estabon the resulting subtable. Afterwards, the result is em- lished research field. An approach to visualize functional
bedded in the original table. A column can be deleted if dependencies is a sunburst diagram [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]. This
visualizathe attribute does not appear in any functional depen- tion can help humans process a large number of
funcdency from the given set  . A row can be deleted if all tional dependencies, but it is only on the schema level.
its positions have entropy 1, which can be checked with Conversely, Lee [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] proposes an information-theoretic
the previous optimization. In the example, the columns approach to analyze relational databases. In contrast
 and  can be deleted, as well as row 2. The resulting to the “plaque test”, given a concrete instance, the
selfsubtable has only four positions, and thus the number of information of a set of attributes using entropies is
comsubsets to be considered is reduced by a factor 212. puted, but without considering the entropies of individual
cells. We are not aware of a visualization of this
frame2.3.2. Approximation work, however, this only considers the instance level.
The “plaque test” (see our earlier work in [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]) takes
For large tables, the entropy computation can be
expenschema and instance into account. For a given instance
sive. The number of subsets  Ď Pos ztu for a consid- and a set of functional dependencies, the information
ered position  grows exponentially with the table size.
content of each cell is calculated and visualized using
Hence, even small tables can have a long computation
the term of entropy. This information content is directly
time. The aforementioned optimizations may possibly be
connected with the redundancy of the cell, and each
insuficient, especially when the rows and columns to be
cell becomes a shade of blue depending on its degree
deleted are rare. In a randomized approach, not all
subof redundancy, like a heat map. Intuitively, this means
sets are considered, but only a fixed number of randomly
that depending on the given functional dependencies and
chosen ones. Depending on the number of samples, the
other tuples in the relation, lost values in a cell can be
result can be suficiently accurate with a high probability.
restored using these dependencies and the other tuples.
          </p>
          <p>
            The approach is called the Monte Carlo method and
inThe more available tuples or functional dependencies we
troduced in [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ]. Building on that method, we derived the
have, the higher is the chance to restore a lost value, and
entropy computation in our earlier work [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]. The method
thus the higher is the redundancy in this considered cell.
holds for general dependencies; thus our derivation also
The theoretical framework for the “plaque test” is
proapplies for multivalued dependencies.
          </p>
          <p>
            Let INF p | , q :“ limÑ8 loglo#g represent vcoidnetdenintg[e2n].erTahlliys fwoorraklliknitnrodsduocfecsonthsetrtaeinrmts iinnfroerlmataiotinoanl
the information content of position  in instance  with databases. The computations for the “plaque test” are
respect to a set  of functional or multivalued dependen- simplified in our earlier work [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] and implemented only
cies with  |ù  , given a fixed subset  Ď Pos ztu of for functional dependencies, and the next obvious step
lost values. We define random variables is to extend this implementation for multivalued
dependencies. While the theoretical considerations hold for
 :  pPos ztuq Ñ r0, 1s,
 ÞÑ INF p | , q.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>3Example adapted from English Wikipedia site on multivalued depen</title>
        <p>dencies, at https://en.wikipedia.org/wiki/Multivalued_dependency,
last accessed April 2024.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2http://webdatacommons.org/webtables/index.html#results-2015</title>
        <p>Course Book Lecturer Course Book Lecturer ing the “plaque test” for teaching purposes.
AHA Silber. . . John. . . 0.9999 0.9968 0.9958 Optimizations for multivalued dependencies. The
optiAHA Neder. . . John. . . 0.9986 0.9992 0.9957 mizations from Section 2.3.1 hold in relations with
funcAHA Silber. . . Will. . . 0.9992 0.9975 0.9996 tional dependencies, however, cannot be immediately
AAHHAA SNielbdeerr...... CWhilrli.s.... . 00..99999979 00..99999839 00..99999948 transferred to multivalued dependencies. Hence, the
reAHA Neder. . . Chris. . . 0.9999 0.9996 0.9998 duction of the problem size seems to be more complicated
OSO Silber. . . John. . . 1 0.9998 1 with multivalued dependencies involved. Even if only
OSO Silber. . . Will. . . 1 0.9998 1 a part of the attributes appears in a multivalued
depenSQL Introd. . . Raym. . . 1 1 1 dency, the value of all attributes can be relevant for its
(a) The relational input data. (b) Entropies for (‹‹). validation. This reduces the possibility to delete columns
in the original relation. Hence, finding optimizations
Figure 4: Plaque test for the given relation with multivalued holding also for multivalued dependencies, which are
dependencies from (‹‹). Hues correspond to entropy values. backed by formal proofs, requires theoretical research.</p>
        <p>Scalability. Even with the prior simplifications, there
are many scenarios where the entropy computation is</p>
        <p>Figure 4 shows the plaque test, with these two mul- practically out of reach. The Monte Carlo method avoids
tivalued dependencies as input. It contains the relation exponential runtime. For functional dependencies, 150
tuin Figure 4a and a table with entropies and colored cells, ples are manageable. One goal is to execute computations
analogously to entropies for functional dependencies, in for even larger datasets, possible approaches to address
Figure 4b. While full entropies are exact, the other values this are parallelization and dynamic programming.
are rounded to four digits after the dot. Due to many entropies, caused by multivalued
depen</p>
        <p>The resulting table shows small redundancies com- dencies, close to 1, false positive entropies of 1 are likely
pared to those caused by functional dependencies in the when applying the Monte Carlo method only. Without
introductory example. This indicates that the subsets of an optimization to find cells with full information content
lost values such that a value in a considered position can immediately, these false positives cannot be identified.
be restored using the given multivalued dependencies Hence, finding a combination of optimizations and
apand the rest of the relation instance are very rare. proximation, as for functional dependencies but adapted</p>
        <p>Applying the Monte Carlo method can lead to false to multivalued dependencies, is another challenge.
positive cases of full information content, most probably User study. We are planning to use the “plaque test”
if the chosen number of iterations is small. This error for an experiment involving students. Our assumption is
appears when there is no randomly chosen subset such that a visualization of redundancies can be helpful in
findthat the value in a considered position can be restored. ing normalization approaches. The question is whether
This consideration and the small computed redundancies students using such a visualization tool can easily suspect
indicate that the Monte Carlo method in instances with ways for a suitable decomposition of a schema containing
multivalued dependencies should be applied with a better several redundancies and the user study shall give us an
accuracy than with functional dependencies. idea if the “plaque” is helpful. If the results of the study</p>
        <p>Finding reasons for the small diferences in the en- confirm our assumption, we see potential for the “plaque
tropies over the relation can be a challenge. Another test” as a tool to support the teaching of normal forms.
opportunity for future work will be to find optimizations The experimental schemas shall contain functional as
analogously to the case of functional dependencies. It well as multivalued dependencies. The results can be
involves finding a necessary and suficient condition for helpful to understand the interpretability of the plaque.
cells with full information content, so that false positive Summarizing all the possible challenges, a combination
entropies of 1 can be excluded. It also contains finding of theory, algorithmic, and user study requires skills in
conditions for reducing the problem size to make bigger diferent disciplines and allows us to view the problem
problems practically computable. of redundancies from several perspectives.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Open Research Challenges</title>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>We see several future challenges in our visualization ap- I thank Stefanie Scherzinger, Jens Zumbrägel, and Stefan
proach of redundancies in relational data. We discuss Klessinger for their valuable discussions, suggestions,
the problems of optimizing the computation efort for and support on this topic.
constraint sets with multivalued dependencies, the
scalability of the computation when applying the Monte Carlo
method and a user study to elaborate the potential of
us</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kruse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Walter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <article-title>Metacrate: Organize and Analyze Millions of Data Profiles</article-title>
          ,
          <source>in: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Arenas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Libkin</surname>
          </string-name>
          ,
          <article-title>An Information-theoretic Approach to Normal Forms for Relational and XML Data, in: Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-</article-title>
          <source>SIGART Symposium on Principles of Database Systems</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Köhnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Klessinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zumbrägel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Scherzinger</surname>
          </string-name>
          ,
          <article-title>A Plaque Test for Redundancies in Relational Data</article-title>
          ,
          <source>in: Joint Proceedings of Workshops at the 49th International Conference on Very Large Data Bases</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Abiteboul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vianu</surname>
          </string-name>
          , Foundations of Databases, Addison-Wesley,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>An Information-Theoretic Analysis of Relational Databases-Part I: Data Dependencies and Information Metric</article-title>
          ,
          <source>IEEE Transactions on Software Engineering SE-13</source>
          (
          <year>1987</year>
          )
          <fpage>1049</fpage>
          -
          <lpage>1061</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitzenmacher</surname>
          </string-name>
          , E. Upfal,
          <article-title>Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis</article-title>
          , Cambridge University Press,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>