=Paper= {{Paper |id=Vol-2841/BigVis_4 |storemode=property |title= Real-time visualization of profiling metadata upon data insertions |pdfUrl=https://ceur-ws.org/Vol-2841/BigVis_4.pdf |volume=Vol-2841 |authors=Loredana Caruccio,Stefano Cirillo,Vincenzo Deufemia,Giuseppe Polese |dblpUrl=https://dblp.org/rec/conf/edbt/CaruccioCDP21 }} == Real-time visualization of profiling metadata upon data insertions== https://ceur-ws.org/Vol-2841/BigVis_4.pdf
                        Real-time visualization of profiling metadata
                                    upon data insertions
                        Loredana Caruccio, Stefano Cirillo, Vincenzo Deufemia, Giuseppe Polese
                                                            Department of Computer Science
                                                                   University of Salerno
                                                                    Fisciano (SA), Italy
                                                      {lcaruccio,scirillo,deufemia,gpolese}@unisa.it

ABSTRACT                                                                                   Many data cleaning approaches exploit profiling metadata, that
Data-intensive processes must deal with the problem of monitor-                        permit to describe properties that must hold on data, and conse-
ing the quality of data. To this end, metadata can be exploited                        quently they can be used to monitor the quality of the data them-
in order to highlight errors and support the cleaning of data.                         selves. Examples of metadata are unique column combinations,
In this paper, we present a novel DBMS plugin for validating                           functional dependencies, denial constraints, and so on. More-
profiling metadata during data insertions, aiming to assist the                        over, nowadays the exploitation and validation of these metadata
user in a-priori checking the quality of data being inserted into                      are enhanced thanks to recent data profiling algorithms, which
a database. It has been implemented within the MySQL Work-                             enable their automatic discovery from “big” datasets [1, 4, 10, 21].
bench client and is able to intercept and validate data insertion                          DataBase Management Systems (DBMSs) have been recog-
queries. More specifically, using such a plugin it is possible to                      nized as a useful tool enabling easy interactions between the user
verify in real-time whether the data to be inserted into a database                    and the stored data. In particular, they permit to inquire data
instance will produce some violations on specific metadata, such                       with little effort [2]. They can also be used to perform data fusion
as unique column combinations or functional dependencies. A                            activities, after schema matching processes [5]. In such scenarios,
user study highlighted the effectiveness of the proposed plugin                        it would be desirable that the validation of profiling metadata
by also emphasizing its strengths and weaknesses.                                      be managed by users through the definition of some constraints.
                                                                                       However, to the best of our knowledge, DBMSs do not permit to
                                                                                       verify the validity of metadata, such as fd’s satisfiability, on the
KEYWORDS
                                                                                       current database instance, and/or verify whether some of them
Data profiling, Functional dependencies, Data insertions, MySQL                        could be invalidated upon data insertion operations.
workbench                                                                                  In this paper we introduce a novel visual tool named indi-
                                                                                       tio (INteracting with metaData during data InserTIOns), which
                                                                                       permits to monitor in real-time, and possibly validate, specific
1    INTRODUCTION                                                                      profiling metadata upon data insertion operations. In particu-
                                                                                       lar, the tool has been implemented as a plugin of the graphical
Data quality is becoming a vital activity in most application do-                      client MySQL Workbench, and it enables the user to intercept
mains. The presence of errors and inconsistencies in the data                          data insertion queries, in order to validate specified (or uploaded)
drastically reduce the value of data and processing capabilities                       metadata before the insertion is committed. In this way, a user
on them [13]. Nowadays, data-intensive systems are involved                            can check the correctness of the data s/he is planning to insert,
in everyday life and enable complex analysis based on the data                         and decide either to force the insertion nonetheless the data, or
they process. Examples are data warehousing and data mining                            try to fix them. Consequently, inditio manages the specification
systems, through which statistics on data, classification tasks,                       (uploading) of metadata and the visualization of validation re-
and in general novel insights can be generated. Data is a valuable                     sults together with values inducing possible violations. Moreover,
asset for customers, companies, and governments, which yields                          different statistical counters and visual components enable users
their quality to have a fundamental role in data governance mod-                       to have an overview of the general impact of data insertions
els [12]. Nevertheless, the possible presence of dirty data might                      on the considered metadata. The current implementation of the
produce negative effects on their results. To this end, big efforts                    plugin is able to manage unique column combinations (uccs)
have been devoted to cleaning data before employing them in                            and functional dependencies (fds), but it could be extended to
data-intensive processes. In fact, many approaches to automat-                         include other kinds of profiling metadata.
ically perform cleaning tasks, such as de-duplication, missing                             In general, the goal that guided such work is threefold: i)
value imputation, and entity resolution represent a landmark for                       improve the quality of data by driving users in the data insertion
guaranteeing a proper drawing of the reality through the data de-                      process, ii) sensitize more users to the data profiling concepts, and
scribing it. Such approaches permit to automate and/or optimize                        iii) support an implicit analysis on the significance of metadata
data preparation tasks at semantic-level, which would otherwise                        from domain experts.
require bigger efforts, tedious and long lasting pre-processing                            The paper is organized as follows. Section 2 describes ap-
activities. In fact, it is well-known that data scientists spend most                  proaches to data quality and data profiling tools. Section 3 presents
of their time in data preparation tasks [18].                                          the theoretical foundations of considered profiling metadata. Sec-
                                                                                       tion 4 presents inditio, whereas Section 5 reports a user study
© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed-
                                                                                       we performed to analyze its effectiveness and usability. Finally,
ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus)       summary and future directions are included in Section 6.
on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0)
2    RELATED WORK                                                        kinds of properties, ranging from statistics, domain cardinalities,
Metadata have been recognized as a fundamental tool for guar-            frequent patterns to clusters, outliers, and data dependencies. All
anteeing good levels of quality to the extension of data, mainly         of them might be exploited in several advanced database oper-
due to their involvement in data cleaning processes [19]. Among          ations, such as query optimization, data cleaning, and so forth.
the different data cleaning activities, it is possible to find: schema   The proposed plugin considers two specific types of metadata,
mapping, de-duplication, classification and mastering, spotting          e.g., Unique Column Combination (ucc) and Functional Depen-
errors and violations (e.g., outliers), repairing incorrect values,      dency (fd), which are described in what follows, after a brief
missing value imputation [16].                                           introduction to relational databases.
   Data cleaning activities have also been included in several                A relational database schema R is defined as a collection of
data preparation commercial tools [15]. Two of the main frame-           relation schemas (𝑅1 ,. . ., 𝑅𝑛 ), where each 𝑅𝑖 is defined over a
works that can be used to perform data cleaning are: Llunatic            set 𝑎𝑡𝑡𝑟 (𝑅𝑖 ) of attributes (𝐴1 ,. . ., 𝐴𝑚 ). Each attribute 𝐴𝑘 has as-
[14] and HoloClean [22]. The first represents one of the first uni-      sociated a domain 𝑑𝑜𝑚(𝐴𝑘 ), which can be finite or infinite. A
form frameworks for data-cleaning activities, which also intro-          relation instance (or simply a relation) 𝑟𝑖 of 𝑅𝑖 is a set of tuples
duced novel semantics and notions in the data cleaning research          (𝑡 1, . . . , 𝑡𝑝 ) such that ∀ 𝐴𝑘 ∈ 𝑎𝑡𝑡𝑟 (𝑅𝑖 ) 𝑡 𝑗 [𝐴𝑘 ] ∈ 𝑑𝑜𝑚(𝐴𝑘 ), where
area. The latter relies on classical concepts, such as integrity         𝑡 𝑗 [𝐴𝑘 ] denotes the projection of 𝑡 𝑗 onto 𝐴𝑘 . A database instance
constraints and external data sources, but also on statistical prop-     𝑟 of R is a collection of relations (𝑟 1 ,. . .,𝑟𝑛 ), where 𝑟𝑖 is a relation
erties of the input data. The main goal of HoloClean is to auto-         instance of 𝑅𝑖 , for 𝑖 ∈ [1, 𝑛].
matically generate a probabilistic program that performs data                 In the context of relational databases, one of the main property
cleaning.                                                                is represented by candidate keys. They permit to define possible
   Data cleaning and metadata verification activities have also          tuple identifiers of a relation instance since no repetition in value
been included into, mainly commercial, data governance suites,           combinations are allowed. They can be effectively identified by
such as Talend [23] and IBM InfoShere [11]. Among the many               exploiting unique column combinations (uccs).
advanced database operations, these suites permit to validate and             ucc definition. A ucc over a relation schema 𝑅 is a sets of
possibly to graphically check data quality constraints.                  attributes 𝐾 ⊆ 𝑎𝑡𝑡𝑟 (𝑅) such that given an instance 𝑟 of 𝑅, for
   The possibility of exploiting metadata is given by the existence      every pair of tuples (𝑡 1 , 𝑡 2 ) in 𝑟 then 𝑡 1 [𝐾] ≠ 𝑡 2 [𝐾].
of algorithms capable of discovering them from big data sources               Another relevant property of a relational database is repre-
[1, 4, 10, 21]. Metadata discovery processes are also provided           sented by functional dependencies (fds), which are used to im-
by effective platforms for data profiling, such as the Metanome          prove the quality of database schemas and to reduce manipulation
project [20], which embeds several algorithms to automatically           anomalies.
discover complex metadata, including functional and inclusion                 fd definition. An fd over a database schema R is a statement
dependencies; and Metacrate [17], which permits the storage              𝑋 → 𝑌 (𝑋 implies 𝑌 ) defined between two sets of attributes
of different meta-data and their integration, enabling users to          𝑋, 𝑌 ⊆ 𝑎𝑡𝑡𝑟 (R), such that, given an instance 𝑟 of R, 𝑋 → 𝑌 is
perform several ad-hoc analysis.                                         satisfied in 𝑟 if and only if for every pair of tuples (𝑡 1 , 𝑡 2 ) in 𝑟 ,
   Nevertheless, since automatic discovery processes could out-          whenever 𝑡 1 [𝑋 ] = 𝑡 2 [𝑋 ], then 𝑡 1 [𝑌 ] = 𝑡 2 [𝑌 ]. 𝑋 and 𝑌 represent
put many metadata, some novel proposals started to visually              the Left-Hand-Side (LHS) and Right-Hand-Side (RHS) of the fd,
manage the complexity related to an effective visualization of           respectively.
metadata holding on a given dataset, by using novel metaphors                 In general, an fd is said to be non-trivial if and only if 𝑋 ∩𝑌 = ∅.
for representing metadata at different levels of detail [9]. More-       Moreover, an fd is said to be minimal if and only if there is no
over, other proposals allow users to explore how metadata change         attribute 𝐵 ∈ 𝑋 such that 𝑋 \𝐵 → 𝑌 holds on 𝑟 .
over time, and to compare the results obtained among different
time-slots [6], or to represent how discovery results change into        4 INDITIO: A MYSQL WORKBENCH PLUGIN
the search space [7].                                                    In this section, we describe inditio a MySQL Workbench plugin
   All of the above-mentioned approaches try to repair/analyze           capable of validating profiling metadata upon the insertion of new
available datasets aiming at improving their quality and/or de-          tuples in a database. MySQL Workbench1 is the official graphical
tect/visualize possible holding metadata. On the contrary, the           client of MySQL [3], which covers many functionalities, among
proposed inditio plugin aims at improving the quality of data,           which the most important are the possibility 𝑖) to graphically
trying to directly make users, i.e., who define data, aware of the       create models of database schemas, 𝑖𝑖) to edit tables, columns,
possible errors that they can introduce into database instances.         indices, and so forth, and 𝑖𝑖𝑖) to create and manage connections to
For this reason, we propose a tool that could be properly inte-          database servers, along with providing the capability to execute
grated into a DBMS. In this way, users and/or domain experts can         SQL queries using the built-in SQL Editor (see Figure 1).
approach this kind of concepts by simply using the environment              inditio has been implemented by exploiting the libraries ex-
they are familiar with. In other words, inditio aims at including        posed from Oracle, to directly interact with the main components
data profiling discovery results within DBMSs, facilitating their        of the software. Thanks to the most recent versions of the MySQL
interpretation to users through a plugin that also includes several      Workbench source packages, it was possible to develop the pro-
visual components. To the best of our knowledge, in the litera-          posed plugin using Python 2.7. Although this is not the latest
ture, there are no other DBMS integrated plugins considering the         version of this programming language, it is the only supported
interaction among data definition statements and metadata.               version by MySQL Workbench. In fact, it is important to notice
                                                                         that MySQL Workbench has been developed by exploiting C++ li-
3    DATA PROFILING                                                      braries, and supports Python libraries only by means of a wrapper
                                                                         that is able to translate Python code into C++ code. This makes
Collecting metadata from big datasets is the goal of the data
profiling research area [19]. Profiling metadata refer to many           1 https://www.mysql.com/it/products/workbench/
                                                                        Finally, it is always possible to “Copy Selected Metadata” as text
                                                                        by means of a specific button.
                                                                           Notice that metadata are described through letters or numbers,
                                                                        e.g., alias, in order to identify attributes. This facilitates users in
                                                                        focusing on attributes and/or in defining new metadata. In fact,
                                                                        possible long (or inappropriate) attribute names could confuse
                                                                        the user. However, inditio provides a suitable form to show the
                                                                        mapping between attribute names and their associated alias (see
        Figure 1: The MySQL Workbench SQL Editor.                       Figure 2(c)).
                                                                           All selected metadata can be validated by clicking on “Run
                                                                        Validation”, which triggers the execution of a validation module
MySQL Workbench a real challenge for external developers look-
                                                                        whose aim is to check if the new tuples violate the selected
ing to extend its functionalities, since it is not always possible to
                                                                        metadata. According to the validation process, each metadata
install new modules. For this reason, the most popular currently
                                                                        can be classified in one of the following categories:
available plugins created by external developers 2 are divided
into two categories: 𝑖) MySQL database documentation genera-                • Valid Metadata, when the new tuples do not produce any
tion; and 𝑖𝑖) MySQL Workbench export. Both types of plugins                   violation;
work in batch and are bound to execute a single operation. On               • Not Valid Metadata, when the new tuples entail at least
the contrary, inditio fully exploits the functionalities of MySQL             one violation; or
Workbench as described in the following section.                            • Impossible to validate, when the metadata cannot be vali-
                                                                              dated. This occurs when the user introduces errors in the
4.1     Interacting with metadata upon data                                   metadata, such as when the attribute names do not exist
                                                                              in the considered database.
        insertions
inditio is able to intercept data insertion queries provided by            Example 4.1. Let us consider a database storing smartphone
users into the SQL Editor. Due to the possible big number of            characteristics. Figure 2(e-g) show validation results of the con-
errors that can be introduced during the insertion of new tuples,       sidered fds (Figure 2(d)) according to the new tuples the user is
DBMSs should enable users to visualize profiling metadata that          planning to insert (see Figure 2(a)). In particular, three out of six
newly inserted data could possibly invalidate, by also giving the       fds are valid, two are invalid, and one cannot be validated. In
possibility to interact with them. For this reason, it should be nec-   fact, K,M → D includes attribute M that does not appear in the
essary to evaluate such metadata upon data insertion operations.        considered database. Instead, C → E, i.e., ram → display_ppi, is
To this end, inditio extends MySQL Workbench functionalities            invalidated if the new tuples are inserted.
by enabling users to validate uccs and fds upon the insertion of
new tuples.                                                                In general, the impact of the new tuples on the considered
   Figure 2 shows the general visual interface of inditio. In gen-      metadata is summarized by inditio in a new form, named re-
eral, it permits to evaluate the impact of new data on a set of         port form, shown in Figure 4. This form graphically shows the
holding metadata. Thus, it enables users to visualize the new tu-       percentage of validation/invalidation produced on the selected
ples being inserted (Figure 2(a)), the metadata to validate (Figure     metadata by the tuples the user is planning to insert. Moreover,
2(b-d)), and the results of the metadata validation process (Figure     the report form ranks database attributes in descending order ac-
2(e-h)). Moreover, inditio also provides several functionalities        cording to the number of invalidated metadata containing them.
enabling users to interact with both SQL statements and meta-           More specifically, the form represents this kind of information
data, as described in the following. A demonstration video of           according to the type of metadata, i.e., by splitting the informa-
inditio is available on YouTube3 .                                      tion about invalidation on LHS and RHS when considering fds
                                                                        (see Figure 4(a)).
    4.1.1 Analysis of metadata. The main novelty introduced by
inditio is the possibility to evaluate some metadata directly into         Example 4.2. Figure 4(a) shows the fd validation report for
the MySQL Workbench. In fact, a user can upload uccs or fds by          the validation results represented in Figure 2. In particular, the
selecting the type of metadata s/he plans to monitor. In particular,    form shows that the impact of invalidations is 33% of the ana-
the metadata will be directly uploaded through a default file, but      lyzed metadata. Moreover, among the attributes involved in the
they can be changed by selecting a new JSON file (Figure 2(b)).         invalidated metadata, attribute C (e.g., ram) is involved in two
    The uploaded metadata are shown in the middle form (Figure          invalidated fds; whereas attributes E (e.g., cpu) and D (e.g., dis-
2(d)), which is customized according to the type of metadata the        play_ppi) are involved in one invalidated fd. This could suggest
user selects. For instance, fds are divided into LHS and RHS, each      to verify the values of attributes ram, cpu, and display_ppi of the
containing some attributes, in order to graphically visualize the       new tuples.
implication property. Instead, the ucc customized form visualizes          4.1.2 Interacting with data insertions. inditio not only en-
each metadata by considering a single group of attributes (Figure       ables users to visualize the impact of new tuples on holding
3). Aside from the metadata uploaded via file, a user can always        metadata, but it also permits them to interact with the new data.
add new metadata. Moreover, through this form it is possible to         First of all, INSERT INTO statements can always be modified
select which metadata must be considered during the validation.         within the inditio interface (see Figure 2(a)), triggering subse-
Indeed, each metadata can be selected by means of the check             quent validation processes with modified tuples. As said above,
box, and/or by using the “Select All” or “Unselect All” buttons.        inditio freezes the execution of INSERT INTO statements while
2 https://github.com/cciro94/MySQL-Workbench-Plugins                    verifying the possibility to correct values being inserted so as
3 https://youtu.be/u03Vftge8pA                                          not to invalidate holding metadata. Nevertheless, inditio always
                                                Figure 2: The inditio visual interface.


                                                                       results represented in Figure 2. In particular, the form shows
                                                                       that four specific values on attribute C (e.g. ram), i.e. 1Gb, 2Gb,
                                                                       3Gb, 512mb, each implies different values of attribute E (e.g. dis-
                                                                       play_ppi). Moreover, it is also possible to see that the value 2Gb
                                                                       is the one implying the highest number of distinct values. In-
                                                                       stead, Figure 5(a) shows that for the ucc K,E,L (e.g. id_brand,
                                                                       display_ppi, id_os) there are six specific value combinations in-
                                                                       ducing duplicate values. In general, this form could suggest to
               Figure 3: Visualization of uccs.                        correct values on the new tuples concerning attributes involved
                                                                       in the considered violated metadata, or to exclude the metadata
                                                                       from the validation process.
gives users the possibility to overlook possible violations of meta-
data and to force the execution of data insertion operations by        5     USER STUDY
means of the “Execute Statements” button (see Figure 2(a)).
                                                                       The user study presented in this section aims to show that inditio
   On the other hand, one of the main goals of inditio is to help
                                                                       makes metadata validation a simple and effective process for
users in correcting possible errors. To this end, after a validation
                                                                       improving data quality.
process (i.e., by clicking “Run Validation”), a user can visualize
data yielding violations by interacting with the “Not Valid Meta-      5.1    Method
data” form (see Figure 2(f)). More specifically, by clicking on any
metadata in such form, inditio shows a violation detail form, as       We recruited 86 students majoring in Computer Science who
shown in Figure 5. In particular, concerning fds, apart from the       just attended the Fundamentals of database. We also recruited 3
details of the selected metadata, the violation form describes for     Ph.D. students and 1 Ph.D., all of which were familiar with the
each LHS value combination involved in a violation: 𝑖) the value       given domain. Statistics about participants have been collected
combination of the LHS, 𝑖𝑖) the corresponding distinct values          through a background survey, as shown in Table 1(Q1-Q4), and
found on the RHS, and 𝑖𝑖𝑖) their total number (see Figure 5(a)).       whose results are reported in Figure 6. In particular, about 85%
Instead, concerning uccs, apart from the details of the selected       of the recruited people were men, 15% were women, and most of
metadata, the violation form describes for each value combination      them were undergraduate students. Moreover, on average they
involved in value duplication: 𝑖) the value combination involved       declared, through a Likert scale, a medium level of knowledge
in a duplication, 𝑖𝑖) the number of duplications (see Figure 5(b)).    concerning MySQL and MySQL Workbench. Before the evalu-
The latter should represent the functionality that drives users in     ation started, participants underwent a 45-min tutorial on the
accomplishing the best possible correction of errors.                  theoretical foundations of data profiling and inditio.
                                                                          Each of the 90 participants was given a database concerning
  Example 4.3. Figure 5(a) shows the violation details of the          personal data, three data insertion statements, and two sets of
fd C → E (e.g. ram → display_ppi) according to the validation          uccs and fds metadata, respectively. Moreover, we requested
      (a) A form showing validation statistics after the fd validation process.                        (b) A form showing validation statistics after the ucc validation process.


                                                        Figure 4: Validation statistics provided by inditio.




(a) A form showing values that invalidate metadata after the fd validation process.                (b) A form showing values that invalidate metadata after the ucc validation process.



                                                                Figure 5: Violation details of inditio.


them to check the correctness of data insertion statements ac-                                     validate the metadata by directly analyzing the data source or
cording to the provided metadata, and if necessary, to correct                                     using the SQL language to compose specific queries. To this end,
statements aiming to guarantee the validity of the provided meta-                                  they could interact only with the tools already integrated into
data. More specifically, we conducted a within subjects study                                      MySQL Workbench, both to validate the metadata and to correct
by considering two scenarios: with and without inditio, and                                        the values within the data.
requested to accomplish the task in one scenario first, and then                                      After completing the assigned tasks, participants were re-
with the other one. Half participants considered first the scenario                                quested to fill some questionnaires, aiming to highlight the ad-
without inditio, while the remaining ones used inditio first.                                      vantages and drawbacks of inditio (see Table 1(Q4-Q20)). More
Notice that, the provided data insertion statements were different                                 specifically, questions from Q4 to Q8 have been filled after par-
but equivalent in complexity. In particular, in the case partici-                                  ticipants performed each task (with and without inditio, or vice
pants performed the tasks without the tool, they were able to                                      versa), whereas the remaining ones have been included in a final
                                                                                                   survey. Moreover, questions from Q4 to Q20 are quantitative, and
                                                                                                   they have been measured through a Likert scale, ranging from 1,
                High School Diploma   Bachelor Degree      Master Degree             Ph.D.         mapping “Strongly disagree” response, to 5, mapping “Strongly
 Q2                                       81                                          5      4 1   agree” response. Finally, to further evaluate the effectiveness of
                                                                                                   both processes (with and without inditio) we measured the time
                                                                                                   required for completing the task and the number of errors.
                   1 = Sufficient     2             3      4         5 = Very Good
 Q3 5       8                         49                                   23                5
                                                                                                   5.2      Results and Discussion
 Q4    5         16                            47                               19            3    Figure 8 shows the results achieved from each participant while
                                                                                                   executing the assigned tasks with both the compared scenarios
                                                                                                   (without and with the tool). The results achieved from the tasks
                                                                                                   performed without the tool have been considered as the compar-
  Figure 6: Statistics concerning involved participants.                                           ative baseline (purple line), while the results obtained with the
                  Survey               Alias   Question
                                               Gender


                          Background
                                        Q1
                                        Q2     Qualification
                                        Q3     Level of knowledge of MySQL
                                        Q4     Level of knowledge of MySQL Workbench
                                               I completed the tasks quickly and easily
                         Comparative

                                        Q5
                                        Q6     The instructions for completing the tasks are clear and easy to read
                                        Q7     The metadata validation process has been simple
                                        Q8     The values that invalidated the metadata have been easy to find
                                       Q9      The tool is simple to use
                                       Q10     The tool is simple to learn
                                       Q11     I was able to retrieve back the process, whenever I made a mistake
                                       Q12     The tool shows the information very clearly
                                       Q13     The tool is pleasant to use
                                       Q14     The user interface is pleasant and informative
                                       Q15     The tool made transparent the execution of the underlying validation processes
                                               It was simple to understand the metadata syntax
                      Final




                                       Q16
                                       Q17     The tool presents all the features I expected
                                       Q18     I am in general satisfied about the tool
                                       Q19     In the future, I would like to use the tool
                                       Q20     The tool simplified the validation process with respect to the manual process
                                       Q21     What is your general impression about the tool?
                                       Q22     Do you have any improvements to suggest?
                                       Q23     Which feature of the tool did you like the least?
                                       Q24     Which feature of the tool did you like the most?
                                                     Table 1: Questions proposed to participants.



proposed tool were described by the bars. In particular, the plot               simple to learn, and pleasant to use (see Figure 9: answers for Q10
highlights the improvement obtained by using inditio, e.g., a                   and Q13). Nevertheless, some work should be made to further
value of 2 indicates that the results achieved with inditio are 2               improve the general usability of the plugin and the transparency
times better than those achieved without it, whereas a value less               of the validation process, according to answers to questions to
than 1 indicates the opposite case. In general, it can be observed              Q11, Q12, and Q15.
that most of the participants performed better with inditio, even                  The open questions in the final questionnaire (see Q21-Q24
if satisfactory results have been achieved also without the use                 in Table 1) aimed at highlighting the strengths and weakness
of the plugin. Concerning the time employed to complete the                     of inditio. In particular, concerning the general impressions
assigned tasks, on average participants took thirty minutes with                about the plugin (see Q21 in Table 1), many participants said that
inditio, ranging from five to sixty-seven minutes, and forty-five               inditio shows a simple and intuitive interface, in which the com-
minutes without it, ranging from five to eighty-four minutes. In                ponents appear well organized in the frame. Another part of them
general, we noticed that the manual validation task times were                  expressed their opinion on the usefulness and efficiency of the
particularly long, especially for users with less knowledge of
MySQL. On the contrary, with inditio almost all users have
reduced the time of the validation processes by more than 50%.                                Without INDITIO                 With INDITIO
    Figure 7 depicts the box plots derived from the answers (on a                5.0
Likert scale from 1 to 5) to questions ranging from Q5 to Q8. In
                                                                                 4.5
particular, users answered the same questions after performing
tasks for each considered scenario (with and without the tool).                  4.0
A boxplot shows the median (horizontal lines), the interquar-
                                                                                 3.5
tile ranges (boxes), the largest and the smallest observations
(whiskers). By comparing the results achieved without the plugin                 3.0
(the first four box plots in Figure 7) against those with the plugin,
                                                                                 2.5
(the remaining box plots) we can conclude that participants felt
more comfortable and effective when working with inditio.                        2.0
    Concerning the quantitative questions in the final question-
                                                                                 1.5
naire, inditio obtained the general agreement of participants,
while evaluating its usability and effectiveness (see Figure 9).                 1.0
In particular, according to answers for Q20, the capability of                         Q5     Q6          Q7    Q8    Q5     Q6         Q7   Q8
simplifying the metadata validation process has been widely rec-
ognized to inditio. The latter has turned out to be comfortable                 Figure 7: Comparative boxplots showing distribution of
and useful according to answers to questions to Q18 and Q19,                    user answers to the quantitative questionnaire.
              2.50
                                                                                                                       With INDITIO   Without INDITIO

              2.25


              2.00


              1.75
Improvement




              1.50


              1.25


              1.00


              0.75


              0.50
                      1
                      2
                      3
                      4
                      5
                      6
                      7
                      8
                      9
                     10
                     11
                     12
                     13
                     14
                     15
                     16
                     17
                     18
                     19
                     20
                     21
                     22
                     23
                     24
                     25
                     26
                     27
                     28
                     29
                     30
                     31
                     32
                     33
                     34
                     35
                     36
                     37
                     38
                     39
                     40
                     41
                     42
                     43
                     44
                     45
                     46
                     47
                     48
                     49
                     50
                     51
                     52
                     53
                     54
                     55
                     56
                     57
                     58
                     59
                     60
                     61
                     62
                     63
                     64
                     65
                     66
                     67
                     68
                     69
                     70
                     71
                     72
                     73
                     74
                     75
                     76
                     77
                     78
                     79
                     80
                     81
                     82
                     83
                     84
                     85
                     86
                     87
                     88
                     89
                     90
                                                                           Person ID


                     Figure 8: Distributing scores achieved by participants for each analyzed scenario (with and without inditio).


plugin, claiming that “The tool has certainly been successful in its               Finally, we have asked users some suggestions for enhanc-
intent, significantly speeding up the time for validating metadata”.            ing inditio (see Q24 in Table 1). To this end, some users have
Moreover, some of the most interesting comments have been pro-                  suggested integrating new metadata, also allowing them to simul-
vided from users that are less familiar with the research context               taneously validate multiple metadata. Other users have suggested
and with MySQL Workbench. In fact, they report that inditio                     improving the integration with systems based on the Linux ar-
is able to improve the understanding of MySQL Workbench and                     chitecture. In fact, it has been found that some of the users using
to show that metadata potentially allows them to extract further                these operating systems tend to view some reports differently
knowledge from data, which often is not clearly visible. Only a                 from users who use Windows systems. However, this is due to
small part of inexperienced users asserted that the plugin inter-               the compatibility problems between the technologies underly-
face might be initially unclear, claiming that “Initially, the tool             ing the MySQL Workbench and different operating systems. In
seems difficult to understand. Then, once I understand how to use it,           the future, these compatibility issues might be solved with new
it is very useful for checking the validity of fd and ucc”. However,            software versions.
by practising with it, inditio allowed them to easily understand                   In summary, the four open questions of the final questionnaire
its features and to become familiar with its environment.                       revealed that some participants remarked some limitations of
    In addition, concerning the specific features that participants             the inditio user interface. Moreover, they would like to receive
liked the most and the least (see Q22-Q23 in Table 1), experienced              more hints during the statement modification process, accord-
users have greatly appreciated the functionality of identifying                 ing to validation results. Conversely, they positively judged the
values yielding the invalidation of metadata. In fact, some of them             intuitiveness the metadata validation and the error detection
claimed that “One of the most interesting features is identifying               processes. Moreover, they welcomed the tool and recognized its
the values that invalidate the metadata. This functionality could               usefulness.
be directly integrated into the MySql Workbench suite.” Instead,
inexperienced users have shown interest for the report forms (see
Figure 5) and for the simplicity through which the plugin could be              6      CONCLUSION
integrated within the MySQL Workbench suite as a simple plugin.                 In this paper we presented inditio, a MySQL plugin enabling
Although most of the comments were positive, we also investi-                   users to assess the quality of data they planned to insert into a
gated the features they liked the least. Among them, many users                 database. In particular, the plugin enables a user to observe how
have highlighted that the Query Editor component appears small                  the new tuples can affect the validation of some metadata, e.g.
and does not clearly show statements. However, this is limited                  fds and uccs, assisting him/her in correcting them, if needed. We
by the graphics components included in the MySQL Workbench.                     evaluated inditio by involving more than eighty participants in
For these reasons, we allow the users to directly interact with                 a user study, which demonstrated its usefulness, giving us the
the SQL editor of MySQL Workbench, and to import their state-                   possibility to detect the main characteristics of the tool to be
ments. Other participants have suggested adding further reports                 improved, mainly focusing on the improvement and/or extension
in the interface in order to enhance their understanding of how                 of some visual components.
data insertion statements affect metadata. Only few users have                     The potentiality of inditio yields many possible future direc-
proposed to extend the interface of inditio with new graphical                  tions. In particular, we are currently working on the enrichment
components in order to improve the interaction with both the                    of the set of possible metadata that can be taken into consid-
plugin and the MySQL Workbench.                                                 eration, with particularly emphasis on relaxed fds (rfds) [8].
                                                       1 = Strongly disagree           2            3          4              5 = Strongly agree

  Q9 1 1                           18                                        36                                                                    34

 Q10 1                        14                             28                                                                          47

 Q11   2                      13                        25                                                          33                                                  17

 Q12   2             7                  11                                        40                                                                    30

 Q13   2 2                         14                                     39                                                                       33

 Q14 1           7                                30                                                           33                                                      19

 Q15     3               7                        27                                                      30                                                      23

 Q16 1       5                               22                                                     41                                                             21

 Q17         8                               18                                            37                                                                27

 Q18 1                   11                                             49                                                                              29

 Q19 1 3                 5                              37                                                                                    44

 Q20 1       5                               21                                                                          63




                 Figure 9: Distributing participant answers to the quantitative questions in the final questionnaire.


Moreover, we would like to directly connect the plugin to in-                                        Applications 78, 23 (2019), 33151–33167.
cremental or continuous discovery algorithms [7], so that when                                  [10] Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2020. Mining
                                                                                                     relaxed functional dependencies from data. Data Mining and Knowledge
the user decides to force the insertions also in presence of in-                                     Discovery 34, 2 (2020), 443–477.
validated metadata, such algorithms could automatically update                                  [11] IBM Corporation. 2021. IBM InfoSphere Information Server. https://www.
                                                                                                     ibm.com/analytics/information-server/. last accessed: Feb 23th, 2021.
the set of holding metadata. This would implicitly represent a                                  [12] Wei Dai, Isaac Wardlaw, Yu Cui, Kashif Mehdi, Yanyan Li, and Jun Long. 2016.
means to automatically analyze the significance of metadata,                                         Data profiling technology of data governance regarding big data: review and
                                                                                                     rethinking. In Information Technology: New Generations. Springer International
while discovering them from actual data.                                                             Publishing, Cham, 439–450.
                                                                                                [13] Wenfei Fan, Floris Geerts, and Xibei Jia. 2008. Semandaq: a data quality
REFERENCES                                                                                           system based on conditional functional dependencies. Proceedings of the VLDB
                                                                                                     Endowment 1, 2 (2008), 1460–1463.
[1] Ziawasch Abedjan, Lukasz Golab, Felix Naumann, and Thorsten Papenbrock.
                                                                                                [14] Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro.
    2018. Data profiling. Synthesis Lectures on Data Management 10, 4 (2018),
                                                                                                     2013. The LLUNATIC data-cleaning framework. Proceedings of the VLDB
    1–154.
                                                                                                     Endowment 6, 9 (2013), 625–636.
[2] Ibrahim Ahmed Al-Baltah, Abdul Azim Abd Ghani, Ghilan Mohammed Al-
                                                                                                [15] Mazhar Hameed and Felix Naumann. 2020. Data Preparation: A Survey of
    Gomaei, Fua’ad Hassan Abdulrazzak, and Abdulmonem Ali Al Kharusi. 2020.
                                                                                                     Commercial Tools. SIGMOD Rec. 49, 3 (2020), 18–29.
    A scalable semantic data fusion framework for heterogeneous sensors data.
                                                                                                [16] Ihab F. Ilyas and Xu Chu. 2019. Data Cleaning. ACM, New York, NY, United
    Journal of Ambient Intelligence and Humanized Computing (2020), 1–20.
                                                                                                     States.
[3] Oracle Corporation and/or its affiliates. 2021. MySQL™ Workbench Reference
                                                                                                [17] Sebastian Kruse, David Hahn, Marius Walter, and Felix Naumann. 2017.
    Manual. https://dev.mysql.com/doc/workbench/en/. last accessed: Jan 13th,
                                                                                                     Metacrate: Organize and analyze millions of data profiles. In Proceedings
    2021.
                                                                                                     of the 2017 ACM on Conference on Information and Knowledge Management.
[4] Tobias Bleifuß, Sebastian Kruse, and Felix Naumann. 2017. Efficient denial
                                                                                                     Association for Computing Machinery, New York, NY, USA, 2483–2486.
    constraint discovery with hydra. Proceedings of the VLDB Endowment 11, 3
                                                                                                [18] Jay Lee, Hung-An Kao, Shanhu Yang, et al. 2014. Service innovation and smart
    (2017), 311–323.
                                                                                                     analytics for industry 4.0 and big data environment. Procedia Cirp 16, 1 (2014),
[5] Jens Bleiholder and Felix Naumann. 2009. Data fusion. ACM computing surveys
                                                                                                     3–8.
    (CSUR) 41, 1 (2009), 1–41.
                                                                                                [19] Felix Naumann. 2014. Data profiling revisited. ACM SIGMOD Record 42, 4
[6] Bernardo Breve, Loredana Caruccio, Stefano Cirillo, Vincenzo Deufemia, and
                                                                                                     (2014), 40–49.
    Giuseppe Polese. 2020. Visualizing Dependencies during Incremental Discov-
                                                                                                [20] Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and
    ery Processes. In Proceedings of the Workshops of the EDBT/ICDT 2020 Joint
                                                                                                     Felix Naumann. 2015. Data profiling with Metanome. Proceedings of the VLDB
    Conference (CEUR Workshop Proceedings), Vol. 2578. CEUR-WS.org, Aachen,
                                                                                                     Endowment 8, 12 (2015), 1860–1863.
    1–8.
                                                                                                [21] Thorsten Papenbrock and Felix Naumann. 2016. A hybrid approach to func-
[7] Loredana Caruccio and Stefano Cirillo. 2020. Incremental discovery of impre-
                                                                                                     tional dependency discovery. In Proceedings of the 2016 International Conference
    cise functional dependencies. Journal of Data and Information Quality (JDIQ)
                                                                                                     on Management of Data. Association for Computing Machinery, New York,
    12, 4 (2020), 1–25.
                                                                                                     NY, USA, 821–833.
[8] Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2016. Relaxed
                                                                                                [22] Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. Holo-
    Functional Dependencies – A Survey of Approaches. IEEE Transactions on
                                                                                                     Clean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow.
    Knowledge and Data Engineering 28, 1 (2016), 147–165.
                                                                                                     10, 11 (2017), 1190–1201.
[9] Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2019. Visual-
                                                                                                [23] Talend. 2021. Talend Studio. https://https://www.talend.com/products/
    ization of (multimedia) dependencies from big data. Multimedia Tools and
                                                                                                     data-quality/. last accessed: Feb 23th, 2021.