1. Introduction

Visualizing Redundancies caused by Multivalued Dependencies in Relational Data

Christoph Köhnen

0 0 Chair of Scalable Database Systems, University of Passau , Germany

Based on a novel visualization of redundancies caused by functional dependencies in relational data, this work motivates an extension to multivalued dependencies. The earlier work is inspired by the visualization of dental plaque and colors cells in relational data with hues depending on their grade of redundancy, which corresponds to a lower information content. Lost values in a relation instance can be restored using given functional dependencies and other tuples. This efect also holds for multivalued dependencies. However, restoring data using this dependency tends to require more data from the rest of the relation. Thus, the redundancies caused by multivalued dependencies are much lower than the ones caused by functional dependencies. The optimizations to reduce the computation efort from the original work generally do not hold for multivalued dependencies, so a future challenge is to adapt these conditions to these dependencies. Our vision is to ofer the plaque test for more general kinds of dependencies as an eficient tool to visualize redundant data.

eol>Functional Dependencies Multivalued Dependencies Relational Model Information Theory Entropy

1. Introduction

ID AlbumTitle Band

TR TrackTitle

1 1 Not That. . . Anastacia 1999 2000 1 Not Th. . . 0.9 Visualization in relational data is an important topic. 1 Not That. . . Anastacia 1999 2000 2 I’m Out. . . 0.8 There exist several models to visualize functional de- 1 Not That. . . Anastacia 1999 2000 3 Cowboy. . . pendencies, a cornerstone of database normalization, e.g., 2 Wish You. . . Pink Floyd 1965 1975 1 Shine O. . . 0.7 in sunburst diagrams or graph-based visualizations [ 1 ]. 3 Freak of. . . Anastacia 1999 2001 1 Paid my. . . 0.6 These approaches can be helpful for data exploration. Figure 1: Plaque test for the given relation with the functional However, these models visualize the functional depen- dependencies from (‹). Cell hues correspond to entropy values, dencies stand-alone, without taking specific relation in- defined in the color scale. Figure adapted from [ 3 ]. stances into account. Hence, redundancies in databases can hardly be identified in such a visualization.

Database schemas in suficiently high normal form the entropy values in relation instances. avoid redundancies and prevent update anomalies. Are- The concept of “plaque test” for functional dependennas and Libkin [ 2 ] define a schema as well-designed if and cies [ 3 ] is illustrated in Figure 1. The relation1 shown only if there is no possible instance containing redun- manages a CD collection. Each CD has an identifier (ID), dancies, which, in the case of functional dependencies, a title (AlbumTitle), and was released in a given year equivalently means that the schema is in Boyce-Codd (RYear/RY) by a band (Band), which was founded in a normal form. Such redundancies caused by functional given year (BYear/BY). Each CD has tracks (Track/TR) dependencies are visualized by a novel principle, which is and each track has a title (TrackTitle). presented in [ 3 ] and called “plaque test”. In this approach, We consider the following functional dependencies: each cell in a relation instance is colored according to its value of information content (or entropy), which depends ID Ñ AlbumTitle, Band, BYear, RYear aonviasucahloizsaetniosnetcoafnfbuelfilhleedlpffuunlcttoioanssailsdt edpaetanbdaesnecnieosr.mSualcihza- ID, Track Ñ TrackTitle and Band Ñ BYear (‹) tion and even for teaching normal forms. The theory of computing this information content originates from [ 2 ] and applies not only to functional dependencies but even to generalized dependencies. In this work, we take the obvious next step by extending the “plaque test” to multivalued dependencies and discuss their contribution to

The result of the “plaque test” for the functional de

pendencies in (‹) is shown in the original instance in Figure 1. The color hues are chosen corresponding to the cell’s entropy, as defined in the legend on the right: the more redundant the value, the smaller the entropy and the deeper the blue. For white cells with entropy 1 we say that this cell has full information content. 35th GI-Workshop on Foundations of Databases (Grundlagen von Datenbanken), May 22-24, 2024, Herdecke, Germany. $ christoph.koehnen@uni-passau.de (C. Köhnen) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

1Example from the German Wikipedia site on database normaliza

tion, at https://de.wikipedia.org/wiki/Normalisierung_(Datenbank), last accessed April 2024.

If we consider the attribute “BYear”, lost values can be entropy-related information content in [ 2 ] we need to recovered by checking the values in other tuples together identify each tuple unambiguously. We introduce the with two functional dependencies, while the values in computation of the information content and present al“AlbumTitle”, “Band” and “RYear” can only be recovered gorithmic optimizations and an approximation approach. using one functional dependency. This is a reason for the lower entropy values in “BYear”. Additionally, this 2.1. Relations, Positions and Variables column has diferently high redundancies. This is due to the fact that the founding year in the row with ID 1 In the following, the set of positive integers is denoted by can be restored either with the functional dependency N :“ t1, 2, . . . u. We define a partial map, representing “ID Ñ BYear” or with “Band Ñ BYear”, while for the an order-preserving relation, allowing duplicate tuples. row with ID 3 this can only be done with the latter one. Definition 2.1. A relation of arity is specified by a fi

As stated above, the entropies depend on the choice nite set sortpq :“ t1, . . . , u of attributes . The of functional dependencies. Adding the dependency domain of an attribute is denoted by Dompq. Then “BYear Ñ Band”, which is fulfilled in this instance, causes an instance of the relation is defined as a partial map lower entropies in “Band”. In general, adding dependen- : N á Domp1qˆ. . .ˆDompq with finite domain cies to the chosen set might reduce the entropies but of definition Defpq :“ t P N | pq is defined u. can never increase them. Choosing the set of all fulfilled For simplification, we assume Dompq “ N for all functional dependencies would cause the lowest possible P sortpq, i.e., each tuple consists of positive integers. entropies. This case is considered as an example in [ 3 ].

A logical next step is the extension of the “plaque test” Definition 2.2. Let be a relation and an instance of . to other dependencies, since the theoretical framework For a subset Ď sortpq of attributes, we have the from [ 2 ] applies to general constraints. We will consider projection pq of the tuples in to the attributes in . a relation with multivalued dependencies and compute Given an instance of a relation we denote the entropies with these dependencies as input. If the by rs the value of the attribute in the constraint set consists only of functional dependencies, th tuple :“ pq of . Similarly, for a subset there are two rules presented in [ 3 ]: immediately identify “ t1, . . . , u Ď sortpq of attributes, we cells with an entropy of 1 and delete irrelevant rows and denote by r1 . . . s or r s the tuple of values columns to execute the computation on a smaller table. r1s, . . . , rs, that is, the -th tuple ppqq of However, these rules cannot be applied in the same way if the projection pq. For subsets , Ď sortpq we the set of constraints contains multivalued dependencies. define r s :“ r Y s.

Contributions. We build on a visualization approach Definition 2.3. Let be an instance of a relation with for redundancies in relational databases caused by func- attributes sortpq “ t1, . . . , u. A position in is tional dependencies, which relies on an existing theo- a pair p, q with P Defpq and P t1, . . . , u; it retical framework. We discuss the generalization of this represents the cell of the -th attribute of the -th tuple, approach to multivalued dependencies, which is in the with value rs. The instance obtained by putting the scope of the theoretical framework, but has never been value at position “ p, q is denoted by Ð. implemented or algorithmically optimized for implemen- Let “ t1, . . . , u be positions, “ p1, . . . , q tation. We identify challenges in transferring simplifica- distinct variables, and “ p1, . . . , q values in N. tions for the redundancy computation, which are valid for The instance obtained by replacing each position by functional dependencies, to multivalued dependencies. the variable resp. by value for 1 ď ď is denoted

Structure. In Section 2 we define the preliminaries, by Ð resp. Ð . Hence, the instance obtained by such as functional and multivalued dependencies and in- replacing the positions from by variables, and then formation content. In Section 3 we consider related work position by the value , is pÐ qÐ. on functional dependencies and information theory. In Section 4 we present results for schemas with functional dependencies and a possible extension to multivalued dependencies. In Section 5 we discuss the next steps.

We choose to define as a set, but and as tuples.

The reason is that the information content defined in [ 2 ], which we will introduce in Section 2.3, considers sets of lost values without duplicates and a fixed order, while variables and values are determined for a fixed position and, in the case of values, duplicates are possible.

2. Preliminaries We first introduce relations and functional dependencies as in [3], that is, we take into account the order of the tuples. We also use this principle for the definition of multivalued dependencies. Due to the definition of the 2.2. Dependencies In this section, we define syntax and semantics of functional and multivalued dependencies. We start by defining functional dependencies as in [4].

Definition 2.4. A functional dependency in a relation and multivalued dependencies, for a considered position is an expression of the form :“ Ñ , where , Ď P Pos, we define two probability spaces as follows. sortpq are sets of attributes. It is called simple, if First, let ℬp, q :“ ppPos ztuq, ℬq, where ℬ consists of exactly one attribute. is the uniform distribution on the set of all subsets of

An instance (without variables) of a relation is Pos ztu. This space models the possible cases of a lost said to fulfill the functional dependency (write |ù ) set of possible values from the instance on positions if for all 1, 2 P Defpq it holds other than the considered one .

Then, for P N we let p, q :“ pt1, . . . , u, q, 1 r s “ 2 r s ñ 1 r s “ 2 r s. where the conditional probability of P t1, . . . , u given Next, we define multivalued dependencies as in [ 5 ]. Ď Pos ztu is equally distributed over :“ t P t1, . . . , u | pÐ qÐ |ù u and zero otherwise.

Definition 2.5. A multivalued dependency in a relation This space models the possible values in t1, . . . , u at , is an expression of the form :“ ↠ , where , Ď given that is the set of lost value positions in . sortpq are sets of attributes. With these two probability spaces, we can use the

An instance (without variables) of a relation is said conditional entropy of ℬp, q given p, q, which to fulfill the multivalued dependency (write |ù ) if is denoted as p p, q | ℬp, qq. This conditional for all 1, 2 P Defpq with 1 r s “ 2 r s there exists entropy measures some kind of uncertainty at position , P Defpq such that if the value is lost and putting in natural numbers up to is possible. From another perspective, one can say 1 r s “ r s and 2 r s “ r s, that the conditional entropy is a quantification of the where :“ sortpqzp Y q is the set of all attributes information content at a position, if the value is present. not covered by and . Intuitively, uncertainty, if a value is lost, corresponds to information content, if the value is present.

From here we consider as a set of functional and We now present the information content of position multivalued dependencies. P Pos in instance of relation with respect to a Definition 2.6. Let and be as above. The instance set of functional and multivalued dependencies with (without variables) fulfills (write |ù ) if |ù for |ù . While in [ 3 ] the special case, where consists all P . only of functional dependencies, is used, Arenas and

An instance containing distinct variables at positions Libkin [ 2 ] define the information content for general “ t1, . . . , u fulfills resp. if there exist values dependencies. Thus, the definition also holds for sets “ p1, . . . , q such that it holds Ð |ù resp. containing functional and multivalued dependencies. Ð |ù .

Definition 2.7. Let , and be as introduced above.

To check |ù in the implementation, we apply the The information content of position with respect to equivalence with |ù @ P , which indeed holds for in instance is given as instances without variables. If contains variables this Ietqculievaarlleynhceolcdasnifbe siescculroesdedunudnedreradlodgitiicoanlailmcpolnicdaittiioonnss.. INF p | q :“ lÑim8 INFlopg| q , If only contains functional dependencies, it sufices where INFp | q :“ p p, q | ℬp, qq is the to require the transitive closeness of . In the sense conditional entropy of p, q given ℬp, q. of the theoretical framework [ 2 ] the computation leads The information content for possible values limited to to the same results regardless if is a canonical cover, is normalized by log and the limit of to infinity is closed under logical implications or in between. However, taken to cover the entire set of natural numbers. applying the above equivalence can lead to incorrect Since the space ℬp, q consists of all subsets of posiresults for non-closed . Therefore, we ofer an option to tions other than , its size grows exponentially with the compute the closure prior to entropy calculations, which size of the relation. In the remainder of the section, we can be disabled if the closeness of the given set is clear introduce optimizations and an approximate approach to reduce the computation time. from [ 3 ] to overcome this performance problem.

2.3. Entropy and Information Content

2.3.1. Optimizations We now introduce the information content using the We illustrate the optimizations from [ 3 ] with an ex(information-theoretic) entropy. Given two probabil- ample. Consider instance given in Figure 2 and the ity spaces, the conditional entropy is introduced in [ 2 ] (singleton) set :“ tA Ñ Cu of functional deand [ 3 ]. In an instance with a set Pos :“ Defpq ˆ pendencies. The two ways to optimize the computat1, . . . , u of positions and a set of functional tion are to directly identify positions with entropy 1 and to reduce the size of the problem. Note that it Then Ers “ INF p | q and for independent identiis still an open question whether these optimizations cally distributed 1, . . . , as above and their average 1 ř hold for a set containing multivalued dependencies. :“ “1 , for all , ą 0 it holds Pr `| ´ INF p | q| ě ˘ ď A B C D Identify ones.A position :“ p, q 7 2 8 4 has an entropy of 1 if and only if for all provided that ě 2 lnp2{ q{2. Here, stands for the 5 2 8 6 functional dependencies “ Ñ P accuracy of the approximation, for an error probability, 7 2 8 6 with P we have that r s is thus the accuracy holds with a confidence of 1 ´ for unique in the (duplicate allowing) pro- a number of samples as required in the inequality. IFnisgtuanrece 2.: vjeacltuieonat popsiqti.onThiscamnenaontsbteharetsatolroesdt iteWrateioshnoswis caonmexpaumtepdlesuinchwthhaicthththeeennturmopbieers ohfasvueficai efixnetd by other tuples and thus has an infor- accuracy under a specified confidence. mation content of 1. Hence, if a column does not appear on the right-hand side of any functional dependency Example 2.8. An accuracy of 0.01 with a confidence of from , the positions in the whole column have full 99% can be achieved with at least 2 lnp2{0.01q{0.012 « information content. This optimization allows to skip 100, 000 iterations. For an accuracy of 0.04 with the computation for all the positions with the introduced a confidence of 99.9% the suficient sample size is property. In the example considered, all positions p, q 2 lnp2{0.001q{0.042 « 10, 000. with ‰ and position p2, q have full information content and we only have two positions left to compute. 3. Related Work

Reduce problem size. Rows and columns can be deleted under conditions and the computation can be performed Functional dependencies are a well-studied and estabon the resulting subtable. Afterwards, the result is em- lished research field. An approach to visualize functional bedded in the original table. A column can be deleted if dependencies is a sunburst diagram [ 1 ]. This visualizathe attribute does not appear in any functional depen- tion can help humans process a large number of funcdency from the given set . A row can be deleted if all tional dependencies, but it is only on the schema level. its positions have entropy 1, which can be checked with Conversely, Lee [ 5 ] proposes an information-theoretic the previous optimization. In the example, the columns approach to analyze relational databases. In contrast and can be deleted, as well as row 2. The resulting to the “plaque test”, given a concrete instance, the selfsubtable has only four positions, and thus the number of information of a set of attributes using entropies is comsubsets to be considered is reduced by a factor 212. puted, but without considering the entropies of individual cells. We are not aware of a visualization of this frame2.3.2. Approximation work, however, this only considers the instance level. The “plaque test” (see our earlier work in [ 3 ]) takes For large tables, the entropy computation can be expenschema and instance into account. For a given instance sive. The number of subsets Ď Pos ztu for a consid- and a set of functional dependencies, the information ered position grows exponentially with the table size. content of each cell is calculated and visualized using Hence, even small tables can have a long computation the term of entropy. This information content is directly time. The aforementioned optimizations may possibly be connected with the redundancy of the cell, and each insuficient, especially when the rows and columns to be cell becomes a shade of blue depending on its degree deleted are rare. In a randomized approach, not all subof redundancy, like a heat map. Intuitively, this means sets are considered, but only a fixed number of randomly that depending on the given functional dependencies and chosen ones. Depending on the number of samples, the other tuples in the relation, lost values in a cell can be result can be suficiently accurate with a high probability. restored using these dependencies and the other tuples.

The approach is called the Monte Carlo method and inThe more available tuples or functional dependencies we troduced in [ 6 ]. Building on that method, we derived the have, the higher is the chance to restore a lost value, and entropy computation in our earlier work [ 3 ]. The method thus the higher is the redundancy in this considered cell. holds for general dependencies; thus our derivation also The theoretical framework for the “plaque test” is proapplies for multivalued dependencies.

Let INF p | , q :“ limÑ8 loglo#g represent vcoidnetdenintg[e2n].erTahlliys fwoorraklliknitnrodsduocfecsonthsetrtaeinrmts iinnfroerlmataiotinoanl the information content of position in instance with databases. The computations for the “plaque test” are respect to a set of functional or multivalued dependen- simplified in our earlier work [ 3 ] and implemented only cies with |ù , given a fixed subset Ď Pos ztu of for functional dependencies, and the next obvious step lost values. We define random variables is to extend this implementation for multivalued dependencies. While the theoretical considerations hold for : pPos ztuq Ñ r0, 1s, ÞÑ INF p | , q.

3Example adapted from English Wikipedia site on multivalued depen

dencies, at https://en.wikipedia.org/wiki/Multivalued_dependency, last accessed April 2024.

2http://webdatacommons.org/webtables/index.html#results-2015

Course Book Lecturer Course Book Lecturer ing the “plaque test” for teaching purposes. AHA Silber. . . John. . . 0.9999 0.9968 0.9958 Optimizations for multivalued dependencies. The optiAHA Neder. . . John. . . 0.9986 0.9992 0.9957 mizations from Section 2.3.1 hold in relations with funcAHA Silber. . . Will. . . 0.9992 0.9975 0.9996 tional dependencies, however, cannot be immediately AAHHAA SNielbdeerr...... CWhilrli.s.... . 00..99999979 00..99999839 00..99999948 transferred to multivalued dependencies. Hence, the reAHA Neder. . . Chris. . . 0.9999 0.9996 0.9998 duction of the problem size seems to be more complicated OSO Silber. . . John. . . 1 0.9998 1 with multivalued dependencies involved. Even if only OSO Silber. . . Will. . . 1 0.9998 1 a part of the attributes appears in a multivalued depenSQL Introd. . . Raym. . . 1 1 1 dency, the value of all attributes can be relevant for its (a) The relational input data. (b) Entropies for (‹‹). validation. This reduces the possibility to delete columns in the original relation. Hence, finding optimizations Figure 4: Plaque test for the given relation with multivalued holding also for multivalued dependencies, which are dependencies from (‹‹). Hues correspond to entropy values. backed by formal proofs, requires theoretical research.

Scalability. Even with the prior simplifications, there are many scenarios where the entropy computation is

Figure 4 shows the plaque test, with these two mul- practically out of reach. The Monte Carlo method avoids tivalued dependencies as input. It contains the relation exponential runtime. For functional dependencies, 150 tuin Figure 4a and a table with entropies and colored cells, ples are manageable. One goal is to execute computations analogously to entropies for functional dependencies, in for even larger datasets, possible approaches to address Figure 4b. While full entropies are exact, the other values this are parallelization and dynamic programming. are rounded to four digits after the dot. Due to many entropies, caused by multivalued depen

The resulting table shows small redundancies com- dencies, close to 1, false positive entropies of 1 are likely pared to those caused by functional dependencies in the when applying the Monte Carlo method only. Without introductory example. This indicates that the subsets of an optimization to find cells with full information content lost values such that a value in a considered position can immediately, these false positives cannot be identified. be restored using the given multivalued dependencies Hence, finding a combination of optimizations and apand the rest of the relation instance are very rare. proximation, as for functional dependencies but adapted

Applying the Monte Carlo method can lead to false to multivalued dependencies, is another challenge. positive cases of full information content, most probably User study. We are planning to use the “plaque test” if the chosen number of iterations is small. This error for an experiment involving students. Our assumption is appears when there is no randomly chosen subset such that a visualization of redundancies can be helpful in findthat the value in a considered position can be restored. ing normalization approaches. The question is whether This consideration and the small computed redundancies students using such a visualization tool can easily suspect indicate that the Monte Carlo method in instances with ways for a suitable decomposition of a schema containing multivalued dependencies should be applied with a better several redundancies and the user study shall give us an accuracy than with functional dependencies. idea if the “plaque” is helpful. If the results of the study

Finding reasons for the small diferences in the en- confirm our assumption, we see potential for the “plaque tropies over the relation can be a challenge. Another test” as a tool to support the teaching of normal forms. opportunity for future work will be to find optimizations The experimental schemas shall contain functional as analogously to the case of functional dependencies. It well as multivalued dependencies. The results can be involves finding a necessary and suficient condition for helpful to understand the interpretability of the plaque. cells with full information content, so that false positive Summarizing all the possible challenges, a combination entropies of 1 can be excluded. It also contains finding of theory, algorithmic, and user study requires skills in conditions for reducing the problem size to make bigger diferent disciplines and allows us to view the problem problems practically computable. of redundancies from several perspectives.

5. Open Research Challenges Acknowledgments

We see several future challenges in our visualization ap- I thank Stefanie Scherzinger, Jens Zumbrägel, and Stefan proach of redundancies in relational data. We discuss Klessinger for their valuable discussions, suggestions, the problems of optimizing the computation efort for and support on this topic. constraint sets with multivalued dependencies, the scalability of the computation when applying the Monte Carlo method and a user study to elaborate the potential of us

[1]

Kruse ,

Hahn ,

Walter ,

Naumann , Metacrate: Organize and Analyze Millions of Data Profiles , in: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management , 2017 .

[2]

Arenas ,

Libkin , An Information-theoretic Approach to Normal Forms for Relational and XML Data, in: Proceedings of the Twenty-Second ACM SIGMOD-SIGACT- SIGART Symposium on Principles of Database Systems , 2003 .

[3]

Köhnen ,

Klessinger ,

Zumbrägel ,

Scherzinger , A Plaque Test for Redundancies in Relational Data , in: Joint Proceedings of Workshops at the 49th International Conference on Very Large Data Bases , 2023 .

[4]

Abiteboul ,

Hull ,

Vianu , Foundations of Databases, Addison-Wesley, 1995 .

[5]

Lee , An Information-Theoretic Analysis of Relational Databases-Part I: Data Dependencies and Information Metric , IEEE Transactions on Software Engineering SE-13 ( 1987 ) 1049 - 1061 .

[6]

Mitzenmacher , E. Upfal, Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis , Cambridge University Press, 2017 .