=Paper= {{Paper |id=Vol-2578/BigVis5 |storemode=property |title=Visualizing Dependencies during Incremental Discovery Processes |pdfUrl=https://ceur-ws.org/Vol-2578/BigVis5.pdf |volume=Vol-2578 |authors=Bernardo Breve,Loredana Caruccio,Stefano Cirillo,Vincenzo Deufemia,Giuseppe Polese |dblpUrl=https://dblp.org/rec/conf/edbt/BreveCCDP20 }} ==Visualizing Dependencies during Incremental Discovery Processes== https://ceur-ws.org/Vol-2578/BigVis5.pdf
        Visualizing Dependencies during Incremental Discovery
                              Processes
                Bernardo Breve                                       Loredana Caruccio                               Stefano Cirillo
      Department of Computer Science                          Department of Computer Science                Department of Computer Science
           University of Salerno                                   University of Salerno                         University of Salerno
            Fisciano (SA), Italy                                    Fisciano (SA), Italy                          Fisciano (SA), Italy
             bbreve@unisa.it                                        lcaruccio@unisa.it                             scirillo@unisa.it

                                        Vincenzo Deufemia                                    Giuseppe Polese
                                  Department of Computer Science                     Department of Computer Science
                                       University of Salerno                              University of Salerno
                                        Fisciano (SA), Italy                               Fisciano (SA), Italy
                                        deufemia@unisa.it                                   gpolese@unisa.it

ABSTRACT                                                                             attributes, which might be exploited in several advanced data-
Functional dependencies (fds), and their extensions relaxed func-                    base operations, such as query optimization, data cleansing, and
tional dependencies (rfds), represent an important semantic                          so forth. In particular, rfds relax some constraints of canonical
property of data. They have been widely used over the years for                      fds by admitting the possibility for a functional dependency to
several advanced database operations. Thanks to the availability                     hold on a subset of data (also referred to as rfds relaxing on the
of discovery algorithms for inferring them from data, in the last                    extent), and/or by relying on approximate paradigms to compare
years (relaxed) fds have been exploited in many new application                      pairs of tuples (also referred to as rfds relaxing on the attribute
contexts, including data cleansing and query relaxation. One of                      comparison) [3].
the main problems in this context is the possible “big” number                           fds were traditionally specified at design time, based on the
of rfds that might hold on a given dataset, which might make it                      semantics of attributes, and could only change upon schema
difficult for a user getting insights from them. On the other hand,                  evolution operations [7]. rfds can also be specified at design time,
one of the main challenges that has recently arisen is the possi-                    but due to the thresholds they require this would be a complex
bility of monitoring how dependencies change during discovery                        task. Moreover, since they are mainly used in the big data context,
processes run over data streams, also known as continuous dis-                       it is desirable to have means to automatically discover them from
covery processes. To this end, in this paper we present a tool for                   data, so as to be able to capture their evolutions upon changes to
visualizing the evolution of discovered rfds during continuous                       the database instances [2][16].
discovery processes. It permits to analyze detailed results for dif-                     Although the problem of discovering fds and rfds from data
ferent types of rfds, and uses quantitative measures to monitor                      is extremely complex, the recent definition of efficient algorithms
how discovery results evolve. Finally, in order to facilitate the                    enabled their discovery from “big” data collections. Among these,
analysis of results in long discovery processes, the tool enables                    it is worth to mention [20, 21, 26] for fd discovery, and [4, 6,
the comparison among rfds holding in different time-slots. The                       15, 25] for rfd discovery. Moreover, recent proposals dealt with
effectiveness of the proposed tool has been evaluated in a case                      incremental or continuous data profiling scenarios [2, 22]. We
study focused on dependencies discovered from streams of data                        particularly focus on such kind of scenarios, where the goal is
associated to the tweets posted over the Twitter social network.                     to get holding fds/rfds even when the input data dynamically
                                                                                     change over time, permitting the discovery of rfds also from data
KEYWORDS                                                                             streams. Nevertheless, in the context of continuous profiling, the
                                                                                     possibly huge quantity of holding rfds at each point in time yields
relaxed functional dependencies, visual analytics, continuous
                                                                                     the necessity to graphically visualize them. Indeed, a proper
discovery, data profiling
                                                                                     analysis of how rfds change over time cannot be accomplished
                                                                                     by looking at such a huge number of holding rfds over millions
1    INTRODUCTION                                                                    of time instants. To the best of our knowledge, the literature does
                                                                                     not offer solutions for mastering such a problem.
Metadata have been recognized as a fundamental mean to assess
                                                                                         In this paper, we present devis (DEpendency VISualizer), a
the quality and integrity of data. They range from basic statis-
                                                                                     tool for analyzing and comparing rfd discovery results. It permits
tics on domain distributions and cardinalities to more complex
                                                                                     to analyze the set of rfds extracted from data streams and their
properties of data, such as data dependencies [18]. The necessity
                                                                                     evolution over time. Moreover, devis enables the user to i) visual-
to collect metadata from big data collections is the goal of data
                                                                                     ize the trend related to the number of holding rfds over time, ii)
profiling approaches. Among the different types of data profil-
                                                                                     compare the rfds holding at different time-slots, and iii) interact
ing tasks, in this paper we focus on the discovery of functional
                                                                                     with the discovery results by hiding details and/or attributes, and
dependencies (fds), and their extensions relaxed functional de-
                                                                                     by filtering them according to a specific Right-Hand-Side (RHS)
pendencies (rfds). They describe relationships among database
                                                                                     attribute. devis and its visual components can be used over all
                                                                                     the available incremental rfd discovery algorithms, also thanks
© 2020 Copyright held by the owner/author(s). Published in the Workshop Proceed-
ings of the EDBT/ICDT 2020 Joint Conference, (March 30-April 2, 2020 Copenhagen,     to a parsing module enabling the standardization of discovery
Denmark) on CEUR-WS.org .                                                            results.
Distribution of this paper is permitted under Creative Commons License Attribution
4.0 International (CC BY 4.0) .
   The paper is organized as follows. Section 2 describes related                 (X implies Y ), such that, given an instance r of R, X → Y is
visualization approaches and tools. Section 3 presents definitions                satisfied in r if and only if for every pair of tuples (t 1 , t 2 ) in r ,
and theoretical foundations of fds and rfds. Section 4 presents                   whenever t 1 [X ] = t 2 [X ], then t 1 [Y ] = t 2 [Y ]. X and Y represent
devis, whereas Section 5 reports the experimental case study on                   the Left-Hand-Side (LHS) and Right-Hand-Side (RHS) of the fd,
streams of tweets. Finally, summary and concluding remarks are                    respectively.
included in Section 6.                                                               Starting from the canonical definition of fd over 35 extended
                                                                                  definitions have been provided in the literature, which have been
2    RELATED WORK                                                                 generalized under the concept of Relaxed Functional Dependency
                                                                                  (rfd) [3]. In particular, rfds enable the consideration of two main
The availability of efficient rfd (or fd) discovery algorithms
                                                                                  relaxation criteria. The first one generalizes the equality con-
yields the necessity to manage big result sets of rfds, most of
                                                                                  straint used into the fd definition. It requires that the projections
which differing only for some values of relaxation parameters.
                                                                                  of two tuples over a subset of attributes are compared by means
However, only recently such algorithms are becoming capable to
                                                                                  of the equality function. Instead, rfds consider a constraint as a
scale over big data sources, but there are no many solutions in the
                                                                                  predicate evaluating whether the distance or similarity between
literature for handling the complexity related to the visualization
                                                                                  two values of an attribute satisfies a given threshold. Thus, it
of possibly huge numbers of discovered rfds. Among these, one
                                                                                  involves a similarity or difference function, such as the Jaro or
of the most effective platforms for data profiling is the Metanome
                                                                                  the Edit distance [11], whose results are verified according to a
project [19], which embeds several algorithms to automatically
                                                                                  comparison operator and a threshold.
discover complex metadata, including functional and inclusion
                                                                                     The second relaxation criterion, defined as relaxation on the
dependencies. Moreover, it embeds various result management
                                                                                  extent, admits the possibility that the dependency might hold for
techniques, such as list-based ranking techniques, and interactive
                                                                                  a subset rather than all the tuples. In particular, the satisfiability
diagrams of discovery results. To this end, a representative scal-
                                                                                  degree of an rfd can be formally specified by means of condi-
able platform for analyzing data profiles is Metacrate [14], which
                                                                                  tions restricting the application domain of the rfd or through a
permits the storage of different meta-data and their integrations,
                                                                                  coverage measure. The latter measures the minimum number or
enabling users to perform several ad-hoc analysis. In this context,
                                                                                  percentage of tuples on which the rfd must hold. Examples of
the first proposal for visualizing large sets of rfds is described
                                                                                  widely used coverage measures are the confidence and д3-error
in [5]. It presents several metaphors for representing rfds at
                                                                                  [13].
different levels of detail. Starting from a high-level visualization
                                                                                     rfd definition. Consider a relational database schema R, and
of attribute correlations, details are interactively revealed, also
                                                                                  a relation schema R = (A1, . . . , Am ) of R. An rfd φ on R is
including details on the relaxation criteria.
                                                                                  denoted by
    A related research context is represented by visual data min-
                                                                                                                          Ψ ≤ε
ing. In the literature, there are many approaches and tools that                                             Dc : X Φ1 −−−−→ YΦ2                              (1)
aim to improve the understanding of data mining algorithms
                                                                                  where
and results (see [10] for a survey). Among these, it is worth to                                           m
                                                                                       • Dc = t ∈ dom(R) |   c i (t[Ai ])}
                                                                                             
mention Association Rules (ARs) visualization approaches, since
                                                                                                           Ó
the concept of AR is somehow related to that of rfd. Effective                                                     i=1
                                                                                         with c = (c 1, . . . , cm ), and each c i is a predicate on dom(Ai )
examples are the tool in [24], which provides multiple views
                                                                                         filtering the tuples on which φ applies;
to visually inspect the overall set of ARs; and the hierarchical
                                                                                       • X = B 1, . . . , Bh and Y = C 1, . . . , Ck , with X, Y ⊆ attr (R)
matrix-based visualization technique presented in [8].
                                                                                         and X ∩ Y = ∅;
    Although all of these approaches represent effective tools to
                                                                                       • Φ1 =         ϕ i [Bi ] (Φ2 =         ϕ j [C j ], resp.), where ϕ i (ϕ j ,
                                                                                                 Ó                        Ó
visualize and explore properties and metadata after the execution                               B i ∈X                   C j ∈Y
of mining/discovery algorithms, our proposal goes beyond the                             resp.) is a conjunction of predicates on Ci (C j , resp.) with
only result visualization problem, since it allows users to explore                      i = 1, . . . , h (j = 1, . . . , k, resp.). For any pair of tuples (t 1 ,
how rfds change over time, and to perform result comparisons                             t 2 )∈ dom(R), the constraint Φ1 (Φ2 , resp.) is true if t 1 [Bi ]
among different time-slots. This meets the recent need of re-                            and t 2 [Bi ] (t 1 [C j ] and t 2 [C j ], resp.) satisfy the constraint
searchers to focus on more complex scenarios and issues, such                            ϕ i (ϕ j , resp.) ∀ i ∈ [1, h] (j ∈ [1, k], resp.).
as the possibility to explore how data change [1] and to perform                       • Ψ is a coverage measure defined on dom(R), quantify-
continuous profiling [18]. For these reasons, even if in the liter-                      ing the amount of tuples violating or satisfying φ. It can
ature several time-related visualization approaches/tools have                           be defined as a function Ψ : dom(X ) × dom(Y ) → R+ ,
been proposed [9, 12, 17, 23], they focus on different application                       where dom(X ) is the cartesian product of the domains of
scenarios. Thus, to the best of our knowledge devis is the first                         attributes composing X .
proposal that enables the exploration and comparison of rfd                            • ε is a threshold indicating the upper bound (or lower bound
discovery results in dynamic scenarios.                                                  in case the comparison operator is ≥) for the result of the
                                                                                         coverage measure.
3    PRELIMINARES                                                                    Given r ⊆ Dc a relation instance on R, r satisfies the rfd φ,
Before describing the proposed tool we will review the definition                 denoted by r |= φ, if and only if: ∀ t 1, t 2 ∈ r , if Φ1 indicates
of fd, and the general definition of rfd.                                         true, then almost always Φ2 indicates true. Here, almost always
   fd definition. Let R be a relational database schema defined                   is expressed by the constraint Ψ ≤ ε. A more general definition
over a set of attributes attr (R), and r an instance of R, then we                covering more types of rfds has been provided in [3].
use {A, B, C, . . . } to denote single attributes in attr (R), {X, Y,W , . . .}      In the following we consider fds, rfds relaxing on the tuple
sets of attributes in attr (R), t[A] and t[X ] the projection of t                comparison only, rfds relaxing only on the extent through a
onto A and X , respectively. An fd over R is a statement X → Y                    coverage measure, and a hybrid version of them, with the general
acronym rfd. In fact, they can be described according to the
equation (1). In this study, we do not consider rfds relaxing on
the extent through a condition of domain values.
    For rfds relaxing on the tuple comparison only, when no cou-
ple of tuples yields an rfd violation, the expression Ψ(X, Y ) = 0
is omitted from the rfd expression. Moreover, for sake of simplic-
ity in what follows we always consider the д3-error as coverage
measure, the operator ≤ for tuple comparison, and several dis-
tance functions, such as the edit distance for textual values, the
absolute difference for numerical values, and so forth. Finally,
without loss of generality, we can consider rfds with a single
value on the RHS.
    As an example, in a database of tweets, it is likely to have the
same number of followed accounts (#Friends) for accounts with
the same name and location. Thus, an fd {Name, Loc}→ #Friends
might hold. However, this property might also hold for names
and locations stored using different abbreviations and/or similar                      Figure 1: The system architecture.
numbers of friends, hence the following rfd might hold:
                    {Name ≤2, Loc ≤4 } →
                                       − #Friends ≤10
Moreover, since accounts might change location during their
account’s life, or there might be changes in the number of friends,
the previous rfd should tolerate possible exceptions. This can
be modeled by introducing a different coverage measure into the
rfd:
                         ψ (N ame,Loc,#F r iends)≤0.03                 Figure 2: Examples of rfds processed by the parser mod-
    {Name ≤2, Loc ≤4 } −−−−−−−−−−−−−−−−−−−−−−−−−−→ #Friends ≤10        ule.
   Among all rfds holding on a given relation database schema
R, only a subset of them can be considered as the most mean-
                                                                       each point in time can be significantly big, 2) the presence of
ingful ones. In fact, in the context of fds, Armstrong’s inference
                                                                       several visualization components could require frequent updates
rules have been theoretically defined in order to derive the mini-
                                                                       in short time, and 3) discovery algorithms rely on different im-
mal set of fds. The latter represents the set of fds from which
                                                                       plementation technologies. To this end, devis has been designed
all valid ones can be derived. Into the context of rfds, an rfd
       ψ (X ,A)≤ε                                                      aiming to enable users monitor results during the execution of
X Φ1 −−−−−−−−−→ Aϕ2 is said to be minimal if and only if 1) it is      rfd discovery algorithms through a responsive visual interface.
non-trivial, i.e. no attributes are shared between the LHS and         Moreover, it is based on a client-server architecture, whereas each
RHS; 2) it has the minimum possible number of LHS attributes;          of its modules is standalone and shares information with other
3) it has LHS attributes with maximum possible threshold values;       modules by using the JSON standard. High component modular-
and 4) it has the RHS attribute with minimum possible threshold        ity and maintainability allow for the substitution of any back-end
value.                                                                 component, provided that its output is formatted according to
    Algorithms for discovering rfds typically return the set of        the JSON standard defined for the interaction.
minimal rfds satisfied by the database instance provided in the           Figure 1 shows the architecture of devis. In particular, the
input. However, as said before, the number of holding rfds can be      client is a web application based on the model-view-controller
huge, especially when relaxation criteria settings increase. This      (MVC) architecture that communicates with the back-end mod-
prevents a proper analysis of rfds, and mostly their evolution         ules through live queries. The back-end server is a long-lived
over time. Aiming to solve this problem, we propose devis, which       Node.js application running distributed jobs while keeping a live
is described in the next section.                                      connection with the database. Back-end modules receive data
                                                                       from live queries, process and represent them on the devis’s
4     A TOOL FOR ANALYZING AND                                         dashboard.
      COMPARING RFD DISCOVERY RESULTS                                     Although the architecture is flexible, to make devis compatible
In this section, we describe devis. It enables monitoring of rfd       with most fd and rfd discovery algorithms, it is necessary to use
discovery results over time. In particular, we first present the       a parsing module that uniforms the syntax of the dependencies
system architecture (Section 4.1), and then provide details on         regardless of possible thresholds. Figure 2 shows an example
i) the visual interface (Section 4.2), ii) the feature enabling the    of a parsing operation. In particular, it shows three rfds on its
comparison between two different time-slots (Section 4.3), and         left-bottom part. The first one is an fd, and the other two are
iii) the interactions that a user can perform (Section 4.4).           rfds relaxing on the attribute comparison, but with different
                                                                       thresholds. The module receives the dependencies, manipulates
4.1     System architecture                                            their syntax, and extracts a compact version so as to store it in
                                                                       the real-time database RethinkDB1 . The latter guarantees high
Visualizing rfd discovery results during the execution of incre-
                                                                       scalability, REST API, and real-time monitoring for the storage
mental algorithms is a quite complex problem. There are several
                                                                       operations. It allowed us to build multi-language connection
issues related to discovery processes that lead to specific choices
for the system architecture: 1) the amount of rfds processed at        1 https://rethinkdb.com
modules to integrate different discovery algorithms into devis. As     column labeled “RHS” informs the user about the RHS attribute
said before, devis focuses on continuous data profiling algorithms     of the dependency. Instead, the other columns display details on
[18] and therefore it requires to maximize fluidity and to minimize    all the attributes that can appear in a dependency.
processing times within the visual interface. Thus, all selected           We provided devis with statistical counters displaying the
technologies for both client- and server-side support real-time        number of rfds that are valid, invalid and minimal, on a given
updating of data.                                                      time instant (Figure 3(a)). Furthermore, at the bottom of the
                                                                       interface, the number of minimal dependencies are grouped by
4.2    rfds visualization                                              each RHS attribute appearing in the discovered rfds (Figure 3(b)).
Systems for data relationship visualization vary in what they dis-
play and how users interact with them. However, most of these          4.3    Comparing dependencies over time
data visualization systems are not intended for the dynamic elab-
                                                                       An important feature of devis is the possibility to monitor the
oration of data, restricting themselves to a static representation
                                                                       variation of the dependency status in two different time intervals.
of large sets of data. In the case of rfd discovery algorithms, a
                                                                       This process has no impact on the discovery algorithm, which
static representation only allows for the visualization of results.
                                                                       continues to provide new status variations while the user is
However, in the context of continuous data profiling it provides
                                                                       focused on the analysis.
much more useful insights studying how dependencies evolve
                                                                          A comparison between discovery results at different time in-
over time.
                                                                       tervals by processing a dedicated button placed in the side menu,
   The dynamic representation of a large portion of data requires
                                                                       which duplicates both the line-plot and the dependency table.
the application of interactive graphs, capable of highlighting the
                                                                       Both the line plots and the dependency tables are continuously
arrival of new information without losing track of pre-existent
                                                                       updated, but they react differently to the interaction of the user.
information. Hence, a dynamic visual representation has been de-
                                                                          Figure 5 shows the disposition of the line plots. By interacting
fined and implemented through a line plot, showing the variation
                                                                       with the bottom line-plots, the user can select the time intervals
on the number of dependencies being discovered or invalidated.
                                                                       to be analyzed, and visually compare the dependencies through
Moreover, a dependency table has been used to highlight further
                                                                       the two associated dependency tables.
details concerning discovered dependencies.
   Figure 4 shows the line plot, which is responsible to display
information about the number of valid dependencies discovered          4.4    Interaction in depth
over time. The line plot block is divided into two sections, the       As mentioned above, users interact with the interface through
upper one is the main line plot, showing the number of valid           the bottom line plot(s) by selecting time intervals. To this end, the
rfds on its y-axis and the time-stamp on its x-axis; the line plot     user places the mouse pointer in correspondence with the start-
at the bottom of the figure is a replica of the upper one, but it      ing point of the interval and drags it by holding the left mouse
enables the interaction with the user, allowing him/her to select      button up to the end point of the time interval. The selection is
an interval of time through a brush in order to visualize details on   highlighted with a grey rectangle on the bottom line plot. During
how the number of discovered rfds changed during the selected          the process the top line plot reacts consequently, reducing the
period. Both these line plots are updated, so that users can see       scale on the x-axis in order to adapt the range boundaries spec-
how the trend changes at any time instant, as the rfd discovery        ified by the user. Nevertheless, the discovery process is still in
process progresses.                                                    progress, and the user can verify this through the UI when s/he
   The dependency table displayed below the line plot is struc-        stops the brushing process by pressing the left mouse button at
tured so that each row represents an rfd, whereas each column          any point of the bottom line plot.
represents an attribute of the dataset. More specifically, the first       Concerning the dependency table, Figure 6 highlights all the
                                                                       interaction functionalities offered by devis. In order to allow the
                                                                       user to reduce the table content, we added two different types
                                                                       of filters: a search text field (Figure 6, yellow rectangle) allowing
                                                                       for a global search over the attributes in the table, and a column-
                                                                       based filter (Figure 6, green rectangle). Based on the text the
                                                                       user inserts in the search text field, the table content is adapted
                   (a) Dependency validation counters.
                                                                       showing only the rows containing that specific value. If the user
                                                                       writes a value in the column-based filter, the table content shows
                                                                       only the rows having that value for the attribute corresponding to
                                                                       the specific column. Instead, if more column filters get compiled,
                                                                       only the rows having those values for all corresponding columns
                                                                       are shown. We planned a different behavior in the case the user
                                                                       decides to filter out the dependency on the table typing a value
                                                                       for the column “RHS”. Indeed, for this particular column only, as
                                                                       the user provides information about an attribute name, the query
                                                                       also affects the top line plot, presenting a new blue line showing
                                                                       information about how the number of dependencies having that
                                                                       attribute on their RHS has changed (Figure 7).
            (b) Minimal dependencies on single RHS attributes.
                                                                           The dependencies are sorted by default in a way that the most
                                                                       recently altered ones get added on top of the table. However, users
      Figure 3: Statistical counter included into devis.               can also define column-based sorting criteria by clicking on the
                                                  Figure 4: Line plot visualization.




                              Figure 5: Comparing dependencies between two different time-slots.




                                         Figure 6: Interacting with the dependency table.


chosen column and selecting either an ascending or descending         be displayed in a single table page, preventing devis’s UI to be
order.                                                                overfilled with too many elements.
   Due to the conspicuous amount of dependencies that might              Finally, the buttons below the table allow the user to decide
be discovered, we also added a paging functionality (Figure 6, red    which column to hide or show in the table (Figure 6, orange
rectangle), allowing the user to decide how many rows should          rectangle). Pressing on the “x” symbol hides the column and
                                                Figure 7: Interacting with the line plot.


              Initial End              #Analyzed #Minimal
                                                                                     Ñ        A        B       A∩B
ID #Tweet
               Time Time
                          #Validations
                                          fds      fds                                       150      175       90
                                                                                     C       A∩C      B∩C     A∩B∩C
A     1347    11:42   16:51      967077         3731         150
                                                                                    172       87      114       76
B     1450    19:00   23:59      876810         3600         175
C     1053    11:40   17:05      604652         3503         172                     D      A∩D       B∩D     A∩B∩D
D     1150    19:38   23:47      472650         2885         159                    159      117       85       81
     Table 1: Statistics on the different evaluation sessions.
                                                                                   C∩D A∩C∩D B∩C∩D A∩B∩C∩D
                                                                                     79       73       70       68
                                                                        Table 2: Statistics about the minimal fds obtained from
                                                                        the tweet streams of the two experimental sessions.
makes the corresponding rectangle turn to black. Another click
on the “x” symbol reverses this process.

5    ANALYZING FDS FROM TWITTER                                         algorithm and the number of distinct fds involved in the total
     STREAMS: A CASE STUDY                                              session time. Last column reports the number of minimal fds
In order to verify the effectiveness of devis on a real-world sce-      discovered at the end of the given session. We can notice that
nario we analyzed discovery results on the data stream of Twitter.      at the end the number of minimal fds are quite similar, even
We selected this scenario for its stressful data load, and the possi-   if the number of performed validations is not necessarily close.
bility to monitor fd and rfd discovery algorithms for long time.        This is due to the fact that we had no influence on tweets caught
In particular, we monitored the results of the incremental fd           on the stream, and therefore each session may yield different
discovery algorithm described in [2], extended to discover fds in       validation/invalidation processes.
data streams.                                                               To get some knowledge on the fd validation trend, devis en-
   More specifically, we selected 11 attributes concerning tweet’s      abled us to analyze line plots. As an example, Figure 8 shows the
account details from the items of Twitter, avoiding all the key at-     line plots of two executions, both performed in the morning. As
tributes, images and so on, (i.e., Name, Email, Description, #Follow-   expected, the two plots present a high variation at the beginning,
ers, #FriendsCount, CreatedAt, #Favourites, TimeZone, Lang, #Sta-       and then tend to be stable. However, by comparing the two plots
tuses, #Listed), and filtered tweets by considering only those con-     we can notice that the two trends present several differences,
taining specific keywords (“2020”, “newyear”, “leapyear”, “happy”).     especially for the first hour of execution.
Tests have been performed by considering 4 execution sessions of            As said above, the final number of minimal fds discovered
the algorithm, totally lasting about 4 hours, during which devis        in the different sessions is quite similar. Thus, devis allowed
continuously updated the set of discovered dependencies as new          us to compare the holding and minimal fds at the end of the
tweets were read.                                                       single execution. Quantitative comparison results between all
   Table 1 shows some statistics concerning the performed eval-         considered sessions are reported in Table 2, which show that the
uation, such as the initial and the end-time of each session, the       different sessions shared many fds. This result is not obvious,
number of the tweets caught in the given period, and some de-           since in most cases the tweets of the sessions are created by
tails on the number of validations performed by the fd discovery        different accounts.
                                                                 (a) Session A.




                                                                 (b) Session B.


                                            Figure 8: Line plots of two execution sessions.


    Some examples of minimal fds shared between the two consid-             different categories, depending on the application context in
ered sessions are shown in Table 3. The first fd is quite intuitive,        which rfds are exploited. So far rfds have been mainly destinated
whereas the other ones require some explanation. As an example,             at data engineers and scientists. The aim of such interviews will
the second fd states that given two tweets, if they are created by          also be to identify new filtering criteria for comparing rfds, and
the same account (Name) and the number of followers is the same             to embed them within devis. Finally, based on the results of
(#Followers), then the number of the account’s friends does not             users interviews, we would like to design new visual metaphors
change (#Friends). This might be due to the fact that in the con-           to graphically represent the general evolution of holding rfds,
sidered period of time the users focused on writing tweets (this            so as to better support the identified analysis tasks.
is the baseline of our scenario) without following new accounts.
The breadth of the time interval also affected the other three                    Examples of fds
fds. In particular, by comparing the last two fds we can deduce                   TimeZone → Lang
that the number of likes (#Favourites) changes more frequently                    Name, #Statutes → #Friends
wrt. the number of lists to which the account owner is registered                 Descr, Lang, #Statuses → CreatedAt
(#Listed). In fact, #Favourites required another attribute #Lang on               Descr, #Followers, CreatedAt → #Listed
the LHS to be determined.                                                         Descr, #Followers, Lang, CreatedAt → #Favourites
    From these experiments of Twitter data devis allowed us to               Table 3: Some meaningful fds shared into the sessions.
figure out the monitored characteristics of different accounts
obey to similar fds. This was an interesting insight, because
such findings might be obvious for fake accounts, but not for the
genuine accounts we used in the experiments.                                REFERENCES
                                                                             [1] Tobias Bleifuß, Leon Bornemann, Theodore Johnson, Dmitri V Kalashnikov,
                                                                                 Felix Naumann, and Divesh Srivastava. 2018. Exploring change: a new di-
6   CONCLUSION AND VISION                                                        mension of data analytics. Proceedings of the VLDB Endowment 12, 2 (2018),
                                                                                 85–98.
Continuous profiling permits to consider extremely complex sce-              [2] Loredana Caruccio, Stefano Cirillo, Vincenzo Deufemia, and Giuseppe Polese.
narios, where data are dynamically produced. In particular, each                 2019. Incremental Discovery of Functional Dependencies with a Bit-vector
                                                                                 Algorithm. In Proceedings of the 27th Italian Symposium on Advanced Database
modification to data might produce the evolution of many fds                     Systems.
and/or rfds. In some cases, due to both the quantity of holding              [3] Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2016. Relaxed
fds and rfds and the high-frequency of updates, it is unthinkable                Functional Dependencies – A Survey of Approaches. IEEE Transactions on
                                                                                 Knowledge and Data Engineering 28, 1 (2016), 147–165.
to analyze the minimal fds and rfds for each time instant. Thus,             [4] Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2019. Mining
to facilitate the interpretation of discovered dependencies over                 relaxed functional dependencies from data. Data Mining and Knowledge
time, we have proposed the tool devis, which relies on several                   Discovery (2019), To appear.
                                                                             [5] Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2019. Visual-
visual components to let users actively analyze and explore the                  ization of (multimedia) dependencies from big data. Multimedia Tools and
evolution of holding rfds discovered through incremental fd or                   Applications 78, 23 (2019), 33151–33167.
                                                                             [6] Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2020. Discover-
rfd discovery algorithms.                                                        ing Relaxed Functional Dependencies based on Multi-attribute Dominance.
   In the future, we would like to investigate the analysis tasks                IEEE Transactions on Knowledge and Data Engineering (2020), To appear.
involved during the incremental discovery of rfds by conducting              [7] Loredana Caruccio, Giuseppe Polese, and Genoveffa Tortora. 2016. Synchro-
                                                                                 nization of queries and views upon schema evolutions: A survey. ACM Trans-
interviews to potential users. The latter can belong to many                     actions on Database Systems (TODS) 41, 2 (2016), 1–41.
 [8] Wei Chen, Cong Xie, Pingping Shang, and Qunsheng Peng. 2017. Visual
     analysis of user-driven association rule mining. Journal of Visual Languages
     & Computing 42 (2017), 76–85.
 [9] Luca Corcella, Marco Manca, Fabio Paternò, and Carmen Santoro. 2018. A
     Visual Tool for Analysing IoT Trigger/Action Programming. In International
     Conference on Human-Centred Software Engineering. Springer, 189–206.
[10] MC Ferreira De Oliveira and Haim Levkowitz. 2003. From visual data explo-
     ration to visual data mining: a survey. IEEE Transactions on Visualization and
     Computer Graphics 9, 3 (2003), 378–394.
[11] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007.
     Duplicate record detection: A survey. IEEE Transactions on Knowledge and
     Data Engineering 19, 1 (2007), 1–16.
[12] Daniel A Keim, Jörn Schneidewind, and Mike Sips. 2004. CircleView: a new ap-
     proach for visualizing time-related multidimensional data sets. In Proceedings
     of the working conference on Advanced visual interfaces. ACM, 179–182.
[13] Jyrki Kivinen and Heikki Mannila. 1995. Approximate inference of functional
     dependencies from relations. Theoretical Computer Science 149, 1 (1995), 129–
     149.
[14] Sebastian Kruse, David Hahn, Marius Walter, and Felix Naumann. 2017.
     Metacrate: Organize and analyze millions of data profiles. In Proceedings
     of the 2017 ACM on Conference on Information and Knowledge Management.
     ACM, 2483–2486.
[15] Sebastian Kruse and Felix Naumann. 2018. Efficient discovery of approximate
     dependencies. Proceedings of the VLDB Endowment 11, 7 (2018), 759–772.
[16] Chien-Min Lin, Yu-Lung Hsieh, Kuo-Cheng Yin, Ming-Chuan Hung, and Don-
     Lin Yang. 2013. ADMiner: An Incremental Data Mining Approach Using a
     Compressed FP-tree. JSW 8, 8 (2013), 2095–2103.
[17] Adam Marcus, Michael S Bernstein, Osama Badar, David R Karger, Samuel
     Madden, and Robert C Miller. 2012. Processing and visualizing the data in
     tweets. ACM SIGMOD Record 40, 4 (2012), 21–27.
[18] Felix Naumann. 2014. Data profiling revisited. ACM SIGMOD Record 42, 4
     (2014), 40–49.
[19] Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and
     Felix Naumann. 2015. Data profiling with Metanome. Proceedings of the VLDB
     Endowment 8, 12 (2015), 1860–1863.
[20] Thorsten Papenbrock and Felix Naumann. 2016. A hybrid approach to func-
     tional dependency discovery. In Proceedings of the 2016 International Conference
     on Management of Data. ACM, 821–833.
[21] Hemant Saxena, Lukasz Golab, and Ihab F Ilyas. 2019. Distributed Discovery
     of Functional Dependencies. In IEEE 35th International Conference on Data
     Engineering (ICDE ’19). IEEE, 1590–1593.
[22] Philipp Schirmer, Thorsten Papenbrock, Sebastian Kruse, Dennis Hempfing,
     Torben Meyer, Daniel Neuschäfer-Rube, and Felix Naumann. 2019. DynFD:
     Functional Dependency Discovery in Dynamic Datasets. In Proceedings of the
     22nd International Conference on Extending Database Technology (EDBT ’19).
     253–264.
[23] Vinícius Segura and Simone DJ Barbosa. 2017. Historyviewer: Instrumenting
     a visual analytics application to support revisiting a session of interactive
     data analysis. Proceedings of the ACM on Human-Computer Interaction 1, EICS
     (2017), 11.
[24] Yoones A. Sekhavat and Orland Hoeber. 2013. Visualizing Association Rules
     Using Linked Matrix, Graph, and Detail Views. International Journal of Intelli-
     gence Science 3 (2013), 34–49.
[25] Shaoxu Song and Lei Chen. 2013. Efficient discovery of similarity constraints
     for matching dependencies. Data & Knowledge Engineering 87 (2013), 146–166.
[26] Ziheng Wei and Sebastian Link. 2019. Discovery and ranking of functional
     dependencies. In IEEE 35th International Conference on Data Engineering (ICDE
     ’19). IEEE, 1526–1537.