Visualizing Dependencies during Incremental Discovery Processes Bernardo Breve Loredana Caruccio Stefano Cirillo Department of Computer Science Department of Computer Science Department of Computer Science University of Salerno University of Salerno University of Salerno Fisciano (SA), Italy Fisciano (SA), Italy Fisciano (SA), Italy bbreve@unisa.it lcaruccio@unisa.it scirillo@unisa.it Vincenzo Deufemia Giuseppe Polese Department of Computer Science Department of Computer Science University of Salerno University of Salerno Fisciano (SA), Italy Fisciano (SA), Italy deufemia@unisa.it gpolese@unisa.it ABSTRACT attributes, which might be exploited in several advanced data- Functional dependencies (fds), and their extensions relaxed func- base operations, such as query optimization, data cleansing, and tional dependencies (rfds), represent an important semantic so forth. In particular, rfds relax some constraints of canonical property of data. They have been widely used over the years for fds by admitting the possibility for a functional dependency to several advanced database operations. Thanks to the availability hold on a subset of data (also referred to as rfds relaxing on the of discovery algorithms for inferring them from data, in the last extent), and/or by relying on approximate paradigms to compare years (relaxed) fds have been exploited in many new application pairs of tuples (also referred to as rfds relaxing on the attribute contexts, including data cleansing and query relaxation. One of comparison) [3]. the main problems in this context is the possible “big” number fds were traditionally specified at design time, based on the of rfds that might hold on a given dataset, which might make it semantics of attributes, and could only change upon schema difficult for a user getting insights from them. On the other hand, evolution operations [7]. rfds can also be specified at design time, one of the main challenges that has recently arisen is the possi- but due to the thresholds they require this would be a complex bility of monitoring how dependencies change during discovery task. Moreover, since they are mainly used in the big data context, processes run over data streams, also known as continuous dis- it is desirable to have means to automatically discover them from covery processes. To this end, in this paper we present a tool for data, so as to be able to capture their evolutions upon changes to visualizing the evolution of discovered rfds during continuous the database instances [2][16]. discovery processes. It permits to analyze detailed results for dif- Although the problem of discovering fds and rfds from data ferent types of rfds, and uses quantitative measures to monitor is extremely complex, the recent definition of efficient algorithms how discovery results evolve. Finally, in order to facilitate the enabled their discovery from “big” data collections. Among these, analysis of results in long discovery processes, the tool enables it is worth to mention [20, 21, 26] for fd discovery, and [4, 6, the comparison among rfds holding in different time-slots. The 15, 25] for rfd discovery. Moreover, recent proposals dealt with effectiveness of the proposed tool has been evaluated in a case incremental or continuous data profiling scenarios [2, 22]. We study focused on dependencies discovered from streams of data particularly focus on such kind of scenarios, where the goal is associated to the tweets posted over the Twitter social network. to get holding fds/rfds even when the input data dynamically change over time, permitting the discovery of rfds also from data KEYWORDS streams. Nevertheless, in the context of continuous profiling, the possibly huge quantity of holding rfds at each point in time yields relaxed functional dependencies, visual analytics, continuous the necessity to graphically visualize them. Indeed, a proper discovery, data profiling analysis of how rfds change over time cannot be accomplished by looking at such a huge number of holding rfds over millions 1 INTRODUCTION of time instants. To the best of our knowledge, the literature does not offer solutions for mastering such a problem. Metadata have been recognized as a fundamental mean to assess In this paper, we present devis (DEpendency VISualizer), a the quality and integrity of data. They range from basic statis- tool for analyzing and comparing rfd discovery results. It permits tics on domain distributions and cardinalities to more complex to analyze the set of rfds extracted from data streams and their properties of data, such as data dependencies [18]. The necessity evolution over time. Moreover, devis enables the user to i) visual- to collect metadata from big data collections is the goal of data ize the trend related to the number of holding rfds over time, ii) profiling approaches. Among the different types of data profil- compare the rfds holding at different time-slots, and iii) interact ing tasks, in this paper we focus on the discovery of functional with the discovery results by hiding details and/or attributes, and dependencies (fds), and their extensions relaxed functional de- by filtering them according to a specific Right-Hand-Side (RHS) pendencies (rfds). They describe relationships among database attribute. devis and its visual components can be used over all the available incremental rfd discovery algorithms, also thanks © 2020 Copyright held by the owner/author(s). Published in the Workshop Proceed- ings of the EDBT/ICDT 2020 Joint Conference, (March 30-April 2, 2020 Copenhagen, to a parsing module enabling the standardization of discovery Denmark) on CEUR-WS.org . results. Distribution of this paper is permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) . The paper is organized as follows. Section 2 describes related (X implies Y ), such that, given an instance r of R, X → Y is visualization approaches and tools. Section 3 presents definitions satisfied in r if and only if for every pair of tuples (t 1 , t 2 ) in r , and theoretical foundations of fds and rfds. Section 4 presents whenever t 1 [X ] = t 2 [X ], then t 1 [Y ] = t 2 [Y ]. X and Y represent devis, whereas Section 5 reports the experimental case study on the Left-Hand-Side (LHS) and Right-Hand-Side (RHS) of the fd, streams of tweets. Finally, summary and concluding remarks are respectively. included in Section 6. Starting from the canonical definition of fd over 35 extended definitions have been provided in the literature, which have been 2 RELATED WORK generalized under the concept of Relaxed Functional Dependency (rfd) [3]. In particular, rfds enable the consideration of two main The availability of efficient rfd (or fd) discovery algorithms relaxation criteria. The first one generalizes the equality con- yields the necessity to manage big result sets of rfds, most of straint used into the fd definition. It requires that the projections which differing only for some values of relaxation parameters. of two tuples over a subset of attributes are compared by means However, only recently such algorithms are becoming capable to of the equality function. Instead, rfds consider a constraint as a scale over big data sources, but there are no many solutions in the predicate evaluating whether the distance or similarity between literature for handling the complexity related to the visualization two values of an attribute satisfies a given threshold. Thus, it of possibly huge numbers of discovered rfds. Among these, one involves a similarity or difference function, such as the Jaro or of the most effective platforms for data profiling is the Metanome the Edit distance [11], whose results are verified according to a project [19], which embeds several algorithms to automatically comparison operator and a threshold. discover complex metadata, including functional and inclusion The second relaxation criterion, defined as relaxation on the dependencies. Moreover, it embeds various result management extent, admits the possibility that the dependency might hold for techniques, such as list-based ranking techniques, and interactive a subset rather than all the tuples. In particular, the satisfiability diagrams of discovery results. To this end, a representative scal- degree of an rfd can be formally specified by means of condi- able platform for analyzing data profiles is Metacrate [14], which tions restricting the application domain of the rfd or through a permits the storage of different meta-data and their integrations, coverage measure. The latter measures the minimum number or enabling users to perform several ad-hoc analysis. In this context, percentage of tuples on which the rfd must hold. Examples of the first proposal for visualizing large sets of rfds is described widely used coverage measures are the confidence and д3-error in [5]. It presents several metaphors for representing rfds at [13]. different levels of detail. Starting from a high-level visualization rfd definition. Consider a relational database schema R, and of attribute correlations, details are interactively revealed, also a relation schema R = (A1, . . . , Am ) of R. An rfd φ on R is including details on the relaxation criteria. denoted by A related research context is represented by visual data min- Ψ ≤ε ing. In the literature, there are many approaches and tools that Dc : X Φ1 −−−−→ YΦ2 (1) aim to improve the understanding of data mining algorithms where and results (see [10] for a survey). Among these, it is worth to m • Dc = t ∈ dom(R) | c i (t[Ai ])}  mention Association Rules (ARs) visualization approaches, since Ó the concept of AR is somehow related to that of rfd. Effective i=1 with c = (c 1, . . . , cm ), and each c i is a predicate on dom(Ai ) examples are the tool in [24], which provides multiple views filtering the tuples on which φ applies; to visually inspect the overall set of ARs; and the hierarchical • X = B 1, . . . , Bh and Y = C 1, . . . , Ck , with X, Y ⊆ attr (R) matrix-based visualization technique presented in [8]. and X ∩ Y = ∅; Although all of these approaches represent effective tools to • Φ1 = ϕ i [Bi ] (Φ2 = ϕ j [C j ], resp.), where ϕ i (ϕ j , Ó Ó visualize and explore properties and metadata after the execution B i ∈X C j ∈Y of mining/discovery algorithms, our proposal goes beyond the resp.) is a conjunction of predicates on Ci (C j , resp.) with only result visualization problem, since it allows users to explore i = 1, . . . , h (j = 1, . . . , k, resp.). For any pair of tuples (t 1 , how rfds change over time, and to perform result comparisons t 2 )∈ dom(R), the constraint Φ1 (Φ2 , resp.) is true if t 1 [Bi ] among different time-slots. This meets the recent need of re- and t 2 [Bi ] (t 1 [C j ] and t 2 [C j ], resp.) satisfy the constraint searchers to focus on more complex scenarios and issues, such ϕ i (ϕ j , resp.) ∀ i ∈ [1, h] (j ∈ [1, k], resp.). as the possibility to explore how data change [1] and to perform • Ψ is a coverage measure defined on dom(R), quantify- continuous profiling [18]. For these reasons, even if in the liter- ing the amount of tuples violating or satisfying φ. It can ature several time-related visualization approaches/tools have be defined as a function Ψ : dom(X ) × dom(Y ) → R+ , been proposed [9, 12, 17, 23], they focus on different application where dom(X ) is the cartesian product of the domains of scenarios. Thus, to the best of our knowledge devis is the first attributes composing X . proposal that enables the exploration and comparison of rfd • ε is a threshold indicating the upper bound (or lower bound discovery results in dynamic scenarios. in case the comparison operator is ≥) for the result of the coverage measure. 3 PRELIMINARES Given r ⊆ Dc a relation instance on R, r satisfies the rfd φ, Before describing the proposed tool we will review the definition denoted by r |= φ, if and only if: ∀ t 1, t 2 ∈ r , if Φ1 indicates of fd, and the general definition of rfd. true, then almost always Φ2 indicates true. Here, almost always fd definition. Let R be a relational database schema defined is expressed by the constraint Ψ ≤ ε. A more general definition over a set of attributes attr (R), and r an instance of R, then we covering more types of rfds has been provided in [3]. use {A, B, C, . . . } to denote single attributes in attr (R), {X, Y,W , . . .} In the following we consider fds, rfds relaxing on the tuple sets of attributes in attr (R), t[A] and t[X ] the projection of t comparison only, rfds relaxing only on the extent through a onto A and X , respectively. An fd over R is a statement X → Y coverage measure, and a hybrid version of them, with the general acronym rfd. In fact, they can be described according to the equation (1). In this study, we do not consider rfds relaxing on the extent through a condition of domain values. For rfds relaxing on the tuple comparison only, when no cou- ple of tuples yields an rfd violation, the expression Ψ(X, Y ) = 0 is omitted from the rfd expression. Moreover, for sake of simplic- ity in what follows we always consider the д3-error as coverage measure, the operator ≤ for tuple comparison, and several dis- tance functions, such as the edit distance for textual values, the absolute difference for numerical values, and so forth. Finally, without loss of generality, we can consider rfds with a single value on the RHS. As an example, in a database of tweets, it is likely to have the same number of followed accounts (#Friends) for accounts with the same name and location. Thus, an fd {Name, Loc}→ #Friends might hold. However, this property might also hold for names and locations stored using different abbreviations and/or similar Figure 1: The system architecture. numbers of friends, hence the following rfd might hold: {Name ≤2, Loc ≤4 } → − #Friends ≤10 Moreover, since accounts might change location during their account’s life, or there might be changes in the number of friends, the previous rfd should tolerate possible exceptions. This can be modeled by introducing a different coverage measure into the rfd: ψ (N ame,Loc,#F r iends)≤0.03 Figure 2: Examples of rfds processed by the parser mod- {Name ≤2, Loc ≤4 } −−−−−−−−−−−−−−−−−−−−−−−−−−→ #Friends ≤10 ule. Among all rfds holding on a given relation database schema R, only a subset of them can be considered as the most mean- each point in time can be significantly big, 2) the presence of ingful ones. In fact, in the context of fds, Armstrong’s inference several visualization components could require frequent updates rules have been theoretically defined in order to derive the mini- in short time, and 3) discovery algorithms rely on different im- mal set of fds. The latter represents the set of fds from which plementation technologies. To this end, devis has been designed all valid ones can be derived. Into the context of rfds, an rfd ψ (X ,A)≤ε aiming to enable users monitor results during the execution of X Φ1 −−−−−−−−−→ Aϕ2 is said to be minimal if and only if 1) it is rfd discovery algorithms through a responsive visual interface. non-trivial, i.e. no attributes are shared between the LHS and Moreover, it is based on a client-server architecture, whereas each RHS; 2) it has the minimum possible number of LHS attributes; of its modules is standalone and shares information with other 3) it has LHS attributes with maximum possible threshold values; modules by using the JSON standard. High component modular- and 4) it has the RHS attribute with minimum possible threshold ity and maintainability allow for the substitution of any back-end value. component, provided that its output is formatted according to Algorithms for discovering rfds typically return the set of the JSON standard defined for the interaction. minimal rfds satisfied by the database instance provided in the Figure 1 shows the architecture of devis. In particular, the input. However, as said before, the number of holding rfds can be client is a web application based on the model-view-controller huge, especially when relaxation criteria settings increase. This (MVC) architecture that communicates with the back-end mod- prevents a proper analysis of rfds, and mostly their evolution ules through live queries. The back-end server is a long-lived over time. Aiming to solve this problem, we propose devis, which Node.js application running distributed jobs while keeping a live is described in the next section. connection with the database. Back-end modules receive data from live queries, process and represent them on the devis’s 4 A TOOL FOR ANALYZING AND dashboard. COMPARING RFD DISCOVERY RESULTS Although the architecture is flexible, to make devis compatible In this section, we describe devis. It enables monitoring of rfd with most fd and rfd discovery algorithms, it is necessary to use discovery results over time. In particular, we first present the a parsing module that uniforms the syntax of the dependencies system architecture (Section 4.1), and then provide details on regardless of possible thresholds. Figure 2 shows an example i) the visual interface (Section 4.2), ii) the feature enabling the of a parsing operation. In particular, it shows three rfds on its comparison between two different time-slots (Section 4.3), and left-bottom part. The first one is an fd, and the other two are iii) the interactions that a user can perform (Section 4.4). rfds relaxing on the attribute comparison, but with different thresholds. The module receives the dependencies, manipulates 4.1 System architecture their syntax, and extracts a compact version so as to store it in the real-time database RethinkDB1 . The latter guarantees high Visualizing rfd discovery results during the execution of incre- scalability, REST API, and real-time monitoring for the storage mental algorithms is a quite complex problem. There are several operations. It allowed us to build multi-language connection issues related to discovery processes that lead to specific choices for the system architecture: 1) the amount of rfds processed at 1 https://rethinkdb.com modules to integrate different discovery algorithms into devis. As column labeled “RHS” informs the user about the RHS attribute said before, devis focuses on continuous data profiling algorithms of the dependency. Instead, the other columns display details on [18] and therefore it requires to maximize fluidity and to minimize all the attributes that can appear in a dependency. processing times within the visual interface. Thus, all selected We provided devis with statistical counters displaying the technologies for both client- and server-side support real-time number of rfds that are valid, invalid and minimal, on a given updating of data. time instant (Figure 3(a)). Furthermore, at the bottom of the interface, the number of minimal dependencies are grouped by 4.2 rfds visualization each RHS attribute appearing in the discovered rfds (Figure 3(b)). Systems for data relationship visualization vary in what they dis- play and how users interact with them. However, most of these 4.3 Comparing dependencies over time data visualization systems are not intended for the dynamic elab- An important feature of devis is the possibility to monitor the oration of data, restricting themselves to a static representation variation of the dependency status in two different time intervals. of large sets of data. In the case of rfd discovery algorithms, a This process has no impact on the discovery algorithm, which static representation only allows for the visualization of results. continues to provide new status variations while the user is However, in the context of continuous data profiling it provides focused on the analysis. much more useful insights studying how dependencies evolve A comparison between discovery results at different time in- over time. tervals by processing a dedicated button placed in the side menu, The dynamic representation of a large portion of data requires which duplicates both the line-plot and the dependency table. the application of interactive graphs, capable of highlighting the Both the line plots and the dependency tables are continuously arrival of new information without losing track of pre-existent updated, but they react differently to the interaction of the user. information. Hence, a dynamic visual representation has been de- Figure 5 shows the disposition of the line plots. By interacting fined and implemented through a line plot, showing the variation with the bottom line-plots, the user can select the time intervals on the number of dependencies being discovered or invalidated. to be analyzed, and visually compare the dependencies through Moreover, a dependency table has been used to highlight further the two associated dependency tables. details concerning discovered dependencies. Figure 4 shows the line plot, which is responsible to display information about the number of valid dependencies discovered 4.4 Interaction in depth over time. The line plot block is divided into two sections, the As mentioned above, users interact with the interface through upper one is the main line plot, showing the number of valid the bottom line plot(s) by selecting time intervals. To this end, the rfds on its y-axis and the time-stamp on its x-axis; the line plot user places the mouse pointer in correspondence with the start- at the bottom of the figure is a replica of the upper one, but it ing point of the interval and drags it by holding the left mouse enables the interaction with the user, allowing him/her to select button up to the end point of the time interval. The selection is an interval of time through a brush in order to visualize details on highlighted with a grey rectangle on the bottom line plot. During how the number of discovered rfds changed during the selected the process the top line plot reacts consequently, reducing the period. Both these line plots are updated, so that users can see scale on the x-axis in order to adapt the range boundaries spec- how the trend changes at any time instant, as the rfd discovery ified by the user. Nevertheless, the discovery process is still in process progresses. progress, and the user can verify this through the UI when s/he The dependency table displayed below the line plot is struc- stops the brushing process by pressing the left mouse button at tured so that each row represents an rfd, whereas each column any point of the bottom line plot. represents an attribute of the dataset. More specifically, the first Concerning the dependency table, Figure 6 highlights all the interaction functionalities offered by devis. In order to allow the user to reduce the table content, we added two different types of filters: a search text field (Figure 6, yellow rectangle) allowing for a global search over the attributes in the table, and a column- based filter (Figure 6, green rectangle). Based on the text the user inserts in the search text field, the table content is adapted (a) Dependency validation counters. showing only the rows containing that specific value. If the user writes a value in the column-based filter, the table content shows only the rows having that value for the attribute corresponding to the specific column. Instead, if more column filters get compiled, only the rows having those values for all corresponding columns are shown. We planned a different behavior in the case the user decides to filter out the dependency on the table typing a value for the column “RHS”. Indeed, for this particular column only, as the user provides information about an attribute name, the query also affects the top line plot, presenting a new blue line showing information about how the number of dependencies having that attribute on their RHS has changed (Figure 7). (b) Minimal dependencies on single RHS attributes. The dependencies are sorted by default in a way that the most recently altered ones get added on top of the table. However, users Figure 3: Statistical counter included into devis. can also define column-based sorting criteria by clicking on the Figure 4: Line plot visualization. Figure 5: Comparing dependencies between two different time-slots. Figure 6: Interacting with the dependency table. chosen column and selecting either an ascending or descending be displayed in a single table page, preventing devis’s UI to be order. overfilled with too many elements. Due to the conspicuous amount of dependencies that might Finally, the buttons below the table allow the user to decide be discovered, we also added a paging functionality (Figure 6, red which column to hide or show in the table (Figure 6, orange rectangle), allowing the user to decide how many rows should rectangle). Pressing on the “x” symbol hides the column and Figure 7: Interacting with the line plot. Initial End #Analyzed #Minimal Ñ A B A∩B ID #Tweet Time Time #Validations fds fds 150 175 90 C A∩C B∩C A∩B∩C A 1347 11:42 16:51 967077 3731 150 172 87 114 76 B 1450 19:00 23:59 876810 3600 175 C 1053 11:40 17:05 604652 3503 172 D A∩D B∩D A∩B∩D D 1150 19:38 23:47 472650 2885 159 159 117 85 81 Table 1: Statistics on the different evaluation sessions. C∩D A∩C∩D B∩C∩D A∩B∩C∩D 79 73 70 68 Table 2: Statistics about the minimal fds obtained from the tweet streams of the two experimental sessions. makes the corresponding rectangle turn to black. Another click on the “x” symbol reverses this process. 5 ANALYZING FDS FROM TWITTER algorithm and the number of distinct fds involved in the total STREAMS: A CASE STUDY session time. Last column reports the number of minimal fds In order to verify the effectiveness of devis on a real-world sce- discovered at the end of the given session. We can notice that nario we analyzed discovery results on the data stream of Twitter. at the end the number of minimal fds are quite similar, even We selected this scenario for its stressful data load, and the possi- if the number of performed validations is not necessarily close. bility to monitor fd and rfd discovery algorithms for long time. This is due to the fact that we had no influence on tweets caught In particular, we monitored the results of the incremental fd on the stream, and therefore each session may yield different discovery algorithm described in [2], extended to discover fds in validation/invalidation processes. data streams. To get some knowledge on the fd validation trend, devis en- More specifically, we selected 11 attributes concerning tweet’s abled us to analyze line plots. As an example, Figure 8 shows the account details from the items of Twitter, avoiding all the key at- line plots of two executions, both performed in the morning. As tributes, images and so on, (i.e., Name, Email, Description, #Follow- expected, the two plots present a high variation at the beginning, ers, #FriendsCount, CreatedAt, #Favourites, TimeZone, Lang, #Sta- and then tend to be stable. However, by comparing the two plots tuses, #Listed), and filtered tweets by considering only those con- we can notice that the two trends present several differences, taining specific keywords (“2020”, “newyear”, “leapyear”, “happy”). especially for the first hour of execution. Tests have been performed by considering 4 execution sessions of As said above, the final number of minimal fds discovered the algorithm, totally lasting about 4 hours, during which devis in the different sessions is quite similar. Thus, devis allowed continuously updated the set of discovered dependencies as new us to compare the holding and minimal fds at the end of the tweets were read. single execution. Quantitative comparison results between all Table 1 shows some statistics concerning the performed eval- considered sessions are reported in Table 2, which show that the uation, such as the initial and the end-time of each session, the different sessions shared many fds. This result is not obvious, number of the tweets caught in the given period, and some de- since in most cases the tweets of the sessions are created by tails on the number of validations performed by the fd discovery different accounts. (a) Session A. (b) Session B. Figure 8: Line plots of two execution sessions. Some examples of minimal fds shared between the two consid- different categories, depending on the application context in ered sessions are shown in Table 3. The first fd is quite intuitive, which rfds are exploited. So far rfds have been mainly destinated whereas the other ones require some explanation. As an example, at data engineers and scientists. The aim of such interviews will the second fd states that given two tweets, if they are created by also be to identify new filtering criteria for comparing rfds, and the same account (Name) and the number of followers is the same to embed them within devis. Finally, based on the results of (#Followers), then the number of the account’s friends does not users interviews, we would like to design new visual metaphors change (#Friends). This might be due to the fact that in the con- to graphically represent the general evolution of holding rfds, sidered period of time the users focused on writing tweets (this so as to better support the identified analysis tasks. is the baseline of our scenario) without following new accounts. The breadth of the time interval also affected the other three Examples of fds fds. In particular, by comparing the last two fds we can deduce TimeZone → Lang that the number of likes (#Favourites) changes more frequently Name, #Statutes → #Friends wrt. the number of lists to which the account owner is registered Descr, Lang, #Statuses → CreatedAt (#Listed). In fact, #Favourites required another attribute #Lang on Descr, #Followers, CreatedAt → #Listed the LHS to be determined. Descr, #Followers, Lang, CreatedAt → #Favourites From these experiments of Twitter data devis allowed us to Table 3: Some meaningful fds shared into the sessions. figure out the monitored characteristics of different accounts obey to similar fds. This was an interesting insight, because such findings might be obvious for fake accounts, but not for the genuine accounts we used in the experiments. REFERENCES [1] Tobias Bleifuß, Leon Bornemann, Theodore Johnson, Dmitri V Kalashnikov, Felix Naumann, and Divesh Srivastava. 2018. Exploring change: a new di- 6 CONCLUSION AND VISION mension of data analytics. Proceedings of the VLDB Endowment 12, 2 (2018), 85–98. Continuous profiling permits to consider extremely complex sce- [2] Loredana Caruccio, Stefano Cirillo, Vincenzo Deufemia, and Giuseppe Polese. narios, where data are dynamically produced. In particular, each 2019. Incremental Discovery of Functional Dependencies with a Bit-vector Algorithm. In Proceedings of the 27th Italian Symposium on Advanced Database modification to data might produce the evolution of many fds Systems. and/or rfds. In some cases, due to both the quantity of holding [3] Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2016. Relaxed fds and rfds and the high-frequency of updates, it is unthinkable Functional Dependencies – A Survey of Approaches. IEEE Transactions on Knowledge and Data Engineering 28, 1 (2016), 147–165. to analyze the minimal fds and rfds for each time instant. Thus, [4] Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2019. Mining to facilitate the interpretation of discovered dependencies over relaxed functional dependencies from data. Data Mining and Knowledge time, we have proposed the tool devis, which relies on several Discovery (2019), To appear. [5] Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2019. Visual- visual components to let users actively analyze and explore the ization of (multimedia) dependencies from big data. Multimedia Tools and evolution of holding rfds discovered through incremental fd or Applications 78, 23 (2019), 33151–33167. [6] Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2020. Discover- rfd discovery algorithms. ing Relaxed Functional Dependencies based on Multi-attribute Dominance. In the future, we would like to investigate the analysis tasks IEEE Transactions on Knowledge and Data Engineering (2020), To appear. involved during the incremental discovery of rfds by conducting [7] Loredana Caruccio, Giuseppe Polese, and Genoveffa Tortora. 2016. Synchro- nization of queries and views upon schema evolutions: A survey. ACM Trans- interviews to potential users. The latter can belong to many actions on Database Systems (TODS) 41, 2 (2016), 1–41. [8] Wei Chen, Cong Xie, Pingping Shang, and Qunsheng Peng. 2017. Visual analysis of user-driven association rule mining. Journal of Visual Languages & Computing 42 (2017), 76–85. [9] Luca Corcella, Marco Manca, Fabio Paternò, and Carmen Santoro. 2018. A Visual Tool for Analysing IoT Trigger/Action Programming. In International Conference on Human-Centred Software Engineering. Springer, 189–206. [10] MC Ferreira De Oliveira and Haim Levkowitz. 2003. From visual data explo- ration to visual data mining: a survey. IEEE Transactions on Visualization and Computer Graphics 9, 3 (2003), 378–394. [11] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19, 1 (2007), 1–16. [12] Daniel A Keim, Jörn Schneidewind, and Mike Sips. 2004. CircleView: a new ap- proach for visualizing time-related multidimensional data sets. In Proceedings of the working conference on Advanced visual interfaces. ACM, 179–182. [13] Jyrki Kivinen and Heikki Mannila. 1995. Approximate inference of functional dependencies from relations. Theoretical Computer Science 149, 1 (1995), 129– 149. [14] Sebastian Kruse, David Hahn, Marius Walter, and Felix Naumann. 2017. Metacrate: Organize and analyze millions of data profiles. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 2483–2486. [15] Sebastian Kruse and Felix Naumann. 2018. Efficient discovery of approximate dependencies. Proceedings of the VLDB Endowment 11, 7 (2018), 759–772. [16] Chien-Min Lin, Yu-Lung Hsieh, Kuo-Cheng Yin, Ming-Chuan Hung, and Don- Lin Yang. 2013. ADMiner: An Incremental Data Mining Approach Using a Compressed FP-tree. JSW 8, 8 (2013), 2095–2103. [17] Adam Marcus, Michael S Bernstein, Osama Badar, David R Karger, Samuel Madden, and Robert C Miller. 2012. Processing and visualizing the data in tweets. ACM SIGMOD Record 40, 4 (2012), 21–27. [18] Felix Naumann. 2014. Data profiling revisited. ACM SIGMOD Record 42, 4 (2014), 40–49. [19] Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and Felix Naumann. 2015. Data profiling with Metanome. Proceedings of the VLDB Endowment 8, 12 (2015), 1860–1863. [20] Thorsten Papenbrock and Felix Naumann. 2016. A hybrid approach to func- tional dependency discovery. In Proceedings of the 2016 International Conference on Management of Data. ACM, 821–833. [21] Hemant Saxena, Lukasz Golab, and Ihab F Ilyas. 2019. Distributed Discovery of Functional Dependencies. In IEEE 35th International Conference on Data Engineering (ICDE ’19). IEEE, 1590–1593. [22] Philipp Schirmer, Thorsten Papenbrock, Sebastian Kruse, Dennis Hempfing, Torben Meyer, Daniel Neuschäfer-Rube, and Felix Naumann. 2019. DynFD: Functional Dependency Discovery in Dynamic Datasets. In Proceedings of the 22nd International Conference on Extending Database Technology (EDBT ’19). 253–264. [23] Vinícius Segura and Simone DJ Barbosa. 2017. Historyviewer: Instrumenting a visual analytics application to support revisiting a session of interactive data analysis. Proceedings of the ACM on Human-Computer Interaction 1, EICS (2017), 11. [24] Yoones A. Sekhavat and Orland Hoeber. 2013. Visualizing Association Rules Using Linked Matrix, Graph, and Detail Views. International Journal of Intelli- gence Science 3 (2013), 34–49. [25] Shaoxu Song and Lei Chen. 2013. Efficient discovery of similarity constraints for matching dependencies. Data & Knowledge Engineering 87 (2013), 146–166. [26] Ziheng Wei and Sebastian Link. 2019. Discovery and ranking of functional dependencies. In IEEE 35th International Conference on Data Engineering (ICDE ’19). IEEE, 1526–1537.