Real-time visualization of profiling metadata upon data insertions Loredana Caruccio, Stefano Cirillo, Vincenzo Deufemia, Giuseppe Polese Department of Computer Science University of Salerno Fisciano (SA), Italy {lcaruccio,scirillo,deufemia,gpolese}@unisa.it ABSTRACT Many data cleaning approaches exploit profiling metadata, that Data-intensive processes must deal with the problem of monitor- permit to describe properties that must hold on data, and conse- ing the quality of data. To this end, metadata can be exploited quently they can be used to monitor the quality of the data them- in order to highlight errors and support the cleaning of data. selves. Examples of metadata are unique column combinations, In this paper, we present a novel DBMS plugin for validating functional dependencies, denial constraints, and so on. More- profiling metadata during data insertions, aiming to assist the over, nowadays the exploitation and validation of these metadata user in a-priori checking the quality of data being inserted into are enhanced thanks to recent data profiling algorithms, which a database. It has been implemented within the MySQL Work- enable their automatic discovery from “big” datasets [1, 4, 10, 21]. bench client and is able to intercept and validate data insertion DataBase Management Systems (DBMSs) have been recog- queries. More specifically, using such a plugin it is possible to nized as a useful tool enabling easy interactions between the user verify in real-time whether the data to be inserted into a database and the stored data. In particular, they permit to inquire data instance will produce some violations on specific metadata, such with little effort [2]. They can also be used to perform data fusion as unique column combinations or functional dependencies. A activities, after schema matching processes [5]. In such scenarios, user study highlighted the effectiveness of the proposed plugin it would be desirable that the validation of profiling metadata by also emphasizing its strengths and weaknesses. be managed by users through the definition of some constraints. However, to the best of our knowledge, DBMSs do not permit to verify the validity of metadata, such as fd’s satisfiability, on the KEYWORDS current database instance, and/or verify whether some of them Data profiling, Functional dependencies, Data insertions, MySQL could be invalidated upon data insertion operations. workbench In this paper we introduce a novel visual tool named indi- tio (INteracting with metaData during data InserTIOns), which permits to monitor in real-time, and possibly validate, specific 1 INTRODUCTION profiling metadata upon data insertion operations. In particu- lar, the tool has been implemented as a plugin of the graphical Data quality is becoming a vital activity in most application do- client MySQL Workbench, and it enables the user to intercept mains. The presence of errors and inconsistencies in the data data insertion queries, in order to validate specified (or uploaded) drastically reduce the value of data and processing capabilities metadata before the insertion is committed. In this way, a user on them [13]. Nowadays, data-intensive systems are involved can check the correctness of the data s/he is planning to insert, in everyday life and enable complex analysis based on the data and decide either to force the insertion nonetheless the data, or they process. Examples are data warehousing and data mining try to fix them. Consequently, inditio manages the specification systems, through which statistics on data, classification tasks, (uploading) of metadata and the visualization of validation re- and in general novel insights can be generated. Data is a valuable sults together with values inducing possible violations. Moreover, asset for customers, companies, and governments, which yields different statistical counters and visual components enable users their quality to have a fundamental role in data governance mod- to have an overview of the general impact of data insertions els [12]. Nevertheless, the possible presence of dirty data might on the considered metadata. The current implementation of the produce negative effects on their results. To this end, big efforts plugin is able to manage unique column combinations (uccs) have been devoted to cleaning data before employing them in and functional dependencies (fds), but it could be extended to data-intensive processes. In fact, many approaches to automat- include other kinds of profiling metadata. ically perform cleaning tasks, such as de-duplication, missing In general, the goal that guided such work is threefold: i) value imputation, and entity resolution represent a landmark for improve the quality of data by driving users in the data insertion guaranteeing a proper drawing of the reality through the data de- process, ii) sensitize more users to the data profiling concepts, and scribing it. Such approaches permit to automate and/or optimize iii) support an implicit analysis on the significance of metadata data preparation tasks at semantic-level, which would otherwise from domain experts. require bigger efforts, tedious and long lasting pre-processing The paper is organized as follows. Section 2 describes ap- activities. In fact, it is well-known that data scientists spend most proaches to data quality and data profiling tools. Section 3 presents of their time in data preparation tasks [18]. the theoretical foundations of considered profiling metadata. Sec- tion 4 presents inditio, whereas Section 5 reports a user study © 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed- we performed to analyze its effectiveness and usability. Finally, ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus) summary and future directions are included in Section 6. on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) 2 RELATED WORK kinds of properties, ranging from statistics, domain cardinalities, Metadata have been recognized as a fundamental tool for guar- frequent patterns to clusters, outliers, and data dependencies. All anteeing good levels of quality to the extension of data, mainly of them might be exploited in several advanced database oper- due to their involvement in data cleaning processes [19]. Among ations, such as query optimization, data cleaning, and so forth. the different data cleaning activities, it is possible to find: schema The proposed plugin considers two specific types of metadata, mapping, de-duplication, classification and mastering, spotting e.g., Unique Column Combination (ucc) and Functional Depen- errors and violations (e.g., outliers), repairing incorrect values, dency (fd), which are described in what follows, after a brief missing value imputation [16]. introduction to relational databases. Data cleaning activities have also been included in several A relational database schema R is defined as a collection of data preparation commercial tools [15]. Two of the main frame- relation schemas (𝑅1 ,. . ., 𝑅𝑛 ), where each 𝑅𝑖 is defined over a works that can be used to perform data cleaning are: Llunatic set 𝑎𝑡𝑡𝑟 (𝑅𝑖 ) of attributes (𝐴1 ,. . ., 𝐴𝑚 ). Each attribute 𝐴𝑘 has as- [14] and HoloClean [22]. The first represents one of the first uni- sociated a domain 𝑑𝑜𝑚(𝐴𝑘 ), which can be finite or infinite. A form frameworks for data-cleaning activities, which also intro- relation instance (or simply a relation) 𝑟𝑖 of 𝑅𝑖 is a set of tuples duced novel semantics and notions in the data cleaning research (𝑡 1, . . . , 𝑡𝑝 ) such that ∀ 𝐴𝑘 ∈ 𝑎𝑡𝑡𝑟 (𝑅𝑖 ) 𝑡 𝑗 [𝐴𝑘 ] ∈ 𝑑𝑜𝑚(𝐴𝑘 ), where area. The latter relies on classical concepts, such as integrity 𝑡 𝑗 [𝐴𝑘 ] denotes the projection of 𝑡 𝑗 onto 𝐴𝑘 . A database instance constraints and external data sources, but also on statistical prop- 𝑟 of R is a collection of relations (𝑟 1 ,. . .,𝑟𝑛 ), where 𝑟𝑖 is a relation erties of the input data. The main goal of HoloClean is to auto- instance of 𝑅𝑖 , for 𝑖 ∈ [1, 𝑛]. matically generate a probabilistic program that performs data In the context of relational databases, one of the main property cleaning. is represented by candidate keys. They permit to define possible Data cleaning and metadata verification activities have also tuple identifiers of a relation instance since no repetition in value been included into, mainly commercial, data governance suites, combinations are allowed. They can be effectively identified by such as Talend [23] and IBM InfoShere [11]. Among the many exploiting unique column combinations (uccs). advanced database operations, these suites permit to validate and ucc definition. A ucc over a relation schema 𝑅 is a sets of possibly to graphically check data quality constraints. attributes 𝐾 ⊆ 𝑎𝑡𝑡𝑟 (𝑅) such that given an instance 𝑟 of 𝑅, for The possibility of exploiting metadata is given by the existence every pair of tuples (𝑡 1 , 𝑡 2 ) in 𝑟 then 𝑡 1 [𝐾] ≠ 𝑡 2 [𝐾]. of algorithms capable of discovering them from big data sources Another relevant property of a relational database is repre- [1, 4, 10, 21]. Metadata discovery processes are also provided sented by functional dependencies (fds), which are used to im- by effective platforms for data profiling, such as the Metanome prove the quality of database schemas and to reduce manipulation project [20], which embeds several algorithms to automatically anomalies. discover complex metadata, including functional and inclusion fd definition. An fd over a database schema R is a statement dependencies; and Metacrate [17], which permits the storage 𝑋 → 𝑌 (𝑋 implies 𝑌 ) defined between two sets of attributes of different meta-data and their integration, enabling users to 𝑋, 𝑌 ⊆ 𝑎𝑡𝑡𝑟 (R), such that, given an instance 𝑟 of R, 𝑋 → 𝑌 is perform several ad-hoc analysis. satisfied in 𝑟 if and only if for every pair of tuples (𝑡 1 , 𝑡 2 ) in 𝑟 , Nevertheless, since automatic discovery processes could out- whenever 𝑡 1 [𝑋 ] = 𝑡 2 [𝑋 ], then 𝑡 1 [𝑌 ] = 𝑡 2 [𝑌 ]. 𝑋 and 𝑌 represent put many metadata, some novel proposals started to visually the Left-Hand-Side (LHS) and Right-Hand-Side (RHS) of the fd, manage the complexity related to an effective visualization of respectively. metadata holding on a given dataset, by using novel metaphors In general, an fd is said to be non-trivial if and only if 𝑋 ∩𝑌 = ∅. for representing metadata at different levels of detail [9]. More- Moreover, an fd is said to be minimal if and only if there is no over, other proposals allow users to explore how metadata change attribute 𝐵 ∈ 𝑋 such that 𝑋 \𝐵 → 𝑌 holds on 𝑟 . over time, and to compare the results obtained among different time-slots [6], or to represent how discovery results change into 4 INDITIO: A MYSQL WORKBENCH PLUGIN the search space [7]. In this section, we describe inditio a MySQL Workbench plugin All of the above-mentioned approaches try to repair/analyze capable of validating profiling metadata upon the insertion of new available datasets aiming at improving their quality and/or de- tuples in a database. MySQL Workbench1 is the official graphical tect/visualize possible holding metadata. On the contrary, the client of MySQL [3], which covers many functionalities, among proposed inditio plugin aims at improving the quality of data, which the most important are the possibility 𝑖) to graphically trying to directly make users, i.e., who define data, aware of the create models of database schemas, 𝑖𝑖) to edit tables, columns, possible errors that they can introduce into database instances. indices, and so forth, and 𝑖𝑖𝑖) to create and manage connections to For this reason, we propose a tool that could be properly inte- database servers, along with providing the capability to execute grated into a DBMS. In this way, users and/or domain experts can SQL queries using the built-in SQL Editor (see Figure 1). approach this kind of concepts by simply using the environment inditio has been implemented by exploiting the libraries ex- they are familiar with. In other words, inditio aims at including posed from Oracle, to directly interact with the main components data profiling discovery results within DBMSs, facilitating their of the software. Thanks to the most recent versions of the MySQL interpretation to users through a plugin that also includes several Workbench source packages, it was possible to develop the pro- visual components. To the best of our knowledge, in the litera- posed plugin using Python 2.7. Although this is not the latest ture, there are no other DBMS integrated plugins considering the version of this programming language, it is the only supported interaction among data definition statements and metadata. version by MySQL Workbench. In fact, it is important to notice that MySQL Workbench has been developed by exploiting C++ li- 3 DATA PROFILING braries, and supports Python libraries only by means of a wrapper that is able to translate Python code into C++ code. This makes Collecting metadata from big datasets is the goal of the data profiling research area [19]. Profiling metadata refer to many 1 https://www.mysql.com/it/products/workbench/ Finally, it is always possible to “Copy Selected Metadata” as text by means of a specific button. Notice that metadata are described through letters or numbers, e.g., alias, in order to identify attributes. This facilitates users in focusing on attributes and/or in defining new metadata. In fact, possible long (or inappropriate) attribute names could confuse the user. However, inditio provides a suitable form to show the mapping between attribute names and their associated alias (see Figure 1: The MySQL Workbench SQL Editor. Figure 2(c)). All selected metadata can be validated by clicking on “Run Validation”, which triggers the execution of a validation module MySQL Workbench a real challenge for external developers look- whose aim is to check if the new tuples violate the selected ing to extend its functionalities, since it is not always possible to metadata. According to the validation process, each metadata install new modules. For this reason, the most popular currently can be classified in one of the following categories: available plugins created by external developers 2 are divided into two categories: 𝑖) MySQL database documentation genera- • Valid Metadata, when the new tuples do not produce any tion; and 𝑖𝑖) MySQL Workbench export. Both types of plugins violation; work in batch and are bound to execute a single operation. On • Not Valid Metadata, when the new tuples entail at least the contrary, inditio fully exploits the functionalities of MySQL one violation; or Workbench as described in the following section. • Impossible to validate, when the metadata cannot be vali- dated. This occurs when the user introduces errors in the 4.1 Interacting with metadata upon data metadata, such as when the attribute names do not exist in the considered database. insertions inditio is able to intercept data insertion queries provided by Example 4.1. Let us consider a database storing smartphone users into the SQL Editor. Due to the possible big number of characteristics. Figure 2(e-g) show validation results of the con- errors that can be introduced during the insertion of new tuples, sidered fds (Figure 2(d)) according to the new tuples the user is DBMSs should enable users to visualize profiling metadata that planning to insert (see Figure 2(a)). In particular, three out of six newly inserted data could possibly invalidate, by also giving the fds are valid, two are invalid, and one cannot be validated. In possibility to interact with them. For this reason, it should be nec- fact, K,M → D includes attribute M that does not appear in the essary to evaluate such metadata upon data insertion operations. considered database. Instead, C → E, i.e., ram → display_ppi, is To this end, inditio extends MySQL Workbench functionalities invalidated if the new tuples are inserted. by enabling users to validate uccs and fds upon the insertion of new tuples. In general, the impact of the new tuples on the considered Figure 2 shows the general visual interface of inditio. In gen- metadata is summarized by inditio in a new form, named re- eral, it permits to evaluate the impact of new data on a set of port form, shown in Figure 4. This form graphically shows the holding metadata. Thus, it enables users to visualize the new tu- percentage of validation/invalidation produced on the selected ples being inserted (Figure 2(a)), the metadata to validate (Figure metadata by the tuples the user is planning to insert. Moreover, 2(b-d)), and the results of the metadata validation process (Figure the report form ranks database attributes in descending order ac- 2(e-h)). Moreover, inditio also provides several functionalities cording to the number of invalidated metadata containing them. enabling users to interact with both SQL statements and meta- More specifically, the form represents this kind of information data, as described in the following. A demonstration video of according to the type of metadata, i.e., by splitting the informa- inditio is available on YouTube3 . tion about invalidation on LHS and RHS when considering fds (see Figure 4(a)). 4.1.1 Analysis of metadata. The main novelty introduced by inditio is the possibility to evaluate some metadata directly into Example 4.2. Figure 4(a) shows the fd validation report for the MySQL Workbench. In fact, a user can upload uccs or fds by the validation results represented in Figure 2. In particular, the selecting the type of metadata s/he plans to monitor. In particular, form shows that the impact of invalidations is 33% of the ana- the metadata will be directly uploaded through a default file, but lyzed metadata. Moreover, among the attributes involved in the they can be changed by selecting a new JSON file (Figure 2(b)). invalidated metadata, attribute C (e.g., ram) is involved in two The uploaded metadata are shown in the middle form (Figure invalidated fds; whereas attributes E (e.g., cpu) and D (e.g., dis- 2(d)), which is customized according to the type of metadata the play_ppi) are involved in one invalidated fd. This could suggest user selects. For instance, fds are divided into LHS and RHS, each to verify the values of attributes ram, cpu, and display_ppi of the containing some attributes, in order to graphically visualize the new tuples. implication property. Instead, the ucc customized form visualizes 4.1.2 Interacting with data insertions. inditio not only en- each metadata by considering a single group of attributes (Figure ables users to visualize the impact of new tuples on holding 3). Aside from the metadata uploaded via file, a user can always metadata, but it also permits them to interact with the new data. add new metadata. Moreover, through this form it is possible to First of all, INSERT INTO statements can always be modified select which metadata must be considered during the validation. within the inditio interface (see Figure 2(a)), triggering subse- Indeed, each metadata can be selected by means of the check quent validation processes with modified tuples. As said above, box, and/or by using the “Select All” or “Unselect All” buttons. inditio freezes the execution of INSERT INTO statements while 2 https://github.com/cciro94/MySQL-Workbench-Plugins verifying the possibility to correct values being inserted so as 3 https://youtu.be/u03Vftge8pA not to invalidate holding metadata. Nevertheless, inditio always Figure 2: The inditio visual interface. results represented in Figure 2. In particular, the form shows that four specific values on attribute C (e.g. ram), i.e. 1Gb, 2Gb, 3Gb, 512mb, each implies different values of attribute E (e.g. dis- play_ppi). Moreover, it is also possible to see that the value 2Gb is the one implying the highest number of distinct values. In- stead, Figure 5(a) shows that for the ucc K,E,L (e.g. id_brand, display_ppi, id_os) there are six specific value combinations in- ducing duplicate values. In general, this form could suggest to Figure 3: Visualization of uccs. correct values on the new tuples concerning attributes involved in the considered violated metadata, or to exclude the metadata from the validation process. gives users the possibility to overlook possible violations of meta- data and to force the execution of data insertion operations by 5 USER STUDY means of the “Execute Statements” button (see Figure 2(a)). The user study presented in this section aims to show that inditio On the other hand, one of the main goals of inditio is to help makes metadata validation a simple and effective process for users in correcting possible errors. To this end, after a validation improving data quality. process (i.e., by clicking “Run Validation”), a user can visualize data yielding violations by interacting with the “Not Valid Meta- 5.1 Method data” form (see Figure 2(f)). More specifically, by clicking on any metadata in such form, inditio shows a violation detail form, as We recruited 86 students majoring in Computer Science who shown in Figure 5. In particular, concerning fds, apart from the just attended the Fundamentals of database. We also recruited 3 details of the selected metadata, the violation form describes for Ph.D. students and 1 Ph.D., all of which were familiar with the each LHS value combination involved in a violation: 𝑖) the value given domain. Statistics about participants have been collected combination of the LHS, 𝑖𝑖) the corresponding distinct values through a background survey, as shown in Table 1(Q1-Q4), and found on the RHS, and 𝑖𝑖𝑖) their total number (see Figure 5(a)). whose results are reported in Figure 6. In particular, about 85% Instead, concerning uccs, apart from the details of the selected of the recruited people were men, 15% were women, and most of metadata, the violation form describes for each value combination them were undergraduate students. Moreover, on average they involved in value duplication: 𝑖) the value combination involved declared, through a Likert scale, a medium level of knowledge in a duplication, 𝑖𝑖) the number of duplications (see Figure 5(b)). concerning MySQL and MySQL Workbench. Before the evalu- The latter should represent the functionality that drives users in ation started, participants underwent a 45-min tutorial on the accomplishing the best possible correction of errors. theoretical foundations of data profiling and inditio. Each of the 90 participants was given a database concerning Example 4.3. Figure 5(a) shows the violation details of the personal data, three data insertion statements, and two sets of fd C → E (e.g. ram → display_ppi) according to the validation uccs and fds metadata, respectively. Moreover, we requested (a) A form showing validation statistics after the fd validation process. (b) A form showing validation statistics after the ucc validation process. Figure 4: Validation statistics provided by inditio. (a) A form showing values that invalidate metadata after the fd validation process. (b) A form showing values that invalidate metadata after the ucc validation process. Figure 5: Violation details of inditio. them to check the correctness of data insertion statements ac- validate the metadata by directly analyzing the data source or cording to the provided metadata, and if necessary, to correct using the SQL language to compose specific queries. To this end, statements aiming to guarantee the validity of the provided meta- they could interact only with the tools already integrated into data. More specifically, we conducted a within subjects study MySQL Workbench, both to validate the metadata and to correct by considering two scenarios: with and without inditio, and the values within the data. requested to accomplish the task in one scenario first, and then After completing the assigned tasks, participants were re- with the other one. Half participants considered first the scenario quested to fill some questionnaires, aiming to highlight the ad- without inditio, while the remaining ones used inditio first. vantages and drawbacks of inditio (see Table 1(Q4-Q20)). More Notice that, the provided data insertion statements were different specifically, questions from Q4 to Q8 have been filled after par- but equivalent in complexity. In particular, in the case partici- ticipants performed each task (with and without inditio, or vice pants performed the tasks without the tool, they were able to versa), whereas the remaining ones have been included in a final survey. Moreover, questions from Q4 to Q20 are quantitative, and they have been measured through a Likert scale, ranging from 1, High School Diploma Bachelor Degree Master Degree Ph.D. mapping “Strongly disagree” response, to 5, mapping “Strongly Q2 81 5 4 1 agree” response. Finally, to further evaluate the effectiveness of both processes (with and without inditio) we measured the time required for completing the task and the number of errors. 1 = Sufficient 2 3 4 5 = Very Good Q3 5 8 49 23 5 5.2 Results and Discussion Q4 5 16 47 19 3 Figure 8 shows the results achieved from each participant while executing the assigned tasks with both the compared scenarios (without and with the tool). The results achieved from the tasks performed without the tool have been considered as the compar- Figure 6: Statistics concerning involved participants. ative baseline (purple line), while the results obtained with the Survey Alias Question Gender Background Q1 Q2 Qualification Q3 Level of knowledge of MySQL Q4 Level of knowledge of MySQL Workbench I completed the tasks quickly and easily Comparative Q5 Q6 The instructions for completing the tasks are clear and easy to read Q7 The metadata validation process has been simple Q8 The values that invalidated the metadata have been easy to find Q9 The tool is simple to use Q10 The tool is simple to learn Q11 I was able to retrieve back the process, whenever I made a mistake Q12 The tool shows the information very clearly Q13 The tool is pleasant to use Q14 The user interface is pleasant and informative Q15 The tool made transparent the execution of the underlying validation processes It was simple to understand the metadata syntax Final Q16 Q17 The tool presents all the features I expected Q18 I am in general satisfied about the tool Q19 In the future, I would like to use the tool Q20 The tool simplified the validation process with respect to the manual process Q21 What is your general impression about the tool? Q22 Do you have any improvements to suggest? Q23 Which feature of the tool did you like the least? Q24 Which feature of the tool did you like the most? Table 1: Questions proposed to participants. proposed tool were described by the bars. In particular, the plot simple to learn, and pleasant to use (see Figure 9: answers for Q10 highlights the improvement obtained by using inditio, e.g., a and Q13). Nevertheless, some work should be made to further value of 2 indicates that the results achieved with inditio are 2 improve the general usability of the plugin and the transparency times better than those achieved without it, whereas a value less of the validation process, according to answers to questions to than 1 indicates the opposite case. In general, it can be observed Q11, Q12, and Q15. that most of the participants performed better with inditio, even The open questions in the final questionnaire (see Q21-Q24 if satisfactory results have been achieved also without the use in Table 1) aimed at highlighting the strengths and weakness of the plugin. Concerning the time employed to complete the of inditio. In particular, concerning the general impressions assigned tasks, on average participants took thirty minutes with about the plugin (see Q21 in Table 1), many participants said that inditio, ranging from five to sixty-seven minutes, and forty-five inditio shows a simple and intuitive interface, in which the com- minutes without it, ranging from five to eighty-four minutes. In ponents appear well organized in the frame. Another part of them general, we noticed that the manual validation task times were expressed their opinion on the usefulness and efficiency of the particularly long, especially for users with less knowledge of MySQL. On the contrary, with inditio almost all users have reduced the time of the validation processes by more than 50%. Without INDITIO With INDITIO Figure 7 depicts the box plots derived from the answers (on a 5.0 Likert scale from 1 to 5) to questions ranging from Q5 to Q8. In 4.5 particular, users answered the same questions after performing tasks for each considered scenario (with and without the tool). 4.0 A boxplot shows the median (horizontal lines), the interquar- 3.5 tile ranges (boxes), the largest and the smallest observations (whiskers). By comparing the results achieved without the plugin 3.0 (the first four box plots in Figure 7) against those with the plugin, 2.5 (the remaining box plots) we can conclude that participants felt more comfortable and effective when working with inditio. 2.0 Concerning the quantitative questions in the final question- 1.5 naire, inditio obtained the general agreement of participants, while evaluating its usability and effectiveness (see Figure 9). 1.0 In particular, according to answers for Q20, the capability of Q5 Q6 Q7 Q8 Q5 Q6 Q7 Q8 simplifying the metadata validation process has been widely rec- ognized to inditio. The latter has turned out to be comfortable Figure 7: Comparative boxplots showing distribution of and useful according to answers to questions to Q18 and Q19, user answers to the quantitative questionnaire. 2.50 With INDITIO Without INDITIO 2.25 2.00 1.75 Improvement 1.50 1.25 1.00 0.75 0.50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 Person ID Figure 8: Distributing scores achieved by participants for each analyzed scenario (with and without inditio). plugin, claiming that “The tool has certainly been successful in its Finally, we have asked users some suggestions for enhanc- intent, significantly speeding up the time for validating metadata”. ing inditio (see Q24 in Table 1). To this end, some users have Moreover, some of the most interesting comments have been pro- suggested integrating new metadata, also allowing them to simul- vided from users that are less familiar with the research context taneously validate multiple metadata. Other users have suggested and with MySQL Workbench. In fact, they report that inditio improving the integration with systems based on the Linux ar- is able to improve the understanding of MySQL Workbench and chitecture. In fact, it has been found that some of the users using to show that metadata potentially allows them to extract further these operating systems tend to view some reports differently knowledge from data, which often is not clearly visible. Only a from users who use Windows systems. However, this is due to small part of inexperienced users asserted that the plugin inter- the compatibility problems between the technologies underly- face might be initially unclear, claiming that “Initially, the tool ing the MySQL Workbench and different operating systems. In seems difficult to understand. Then, once I understand how to use it, the future, these compatibility issues might be solved with new it is very useful for checking the validity of fd and ucc”. However, software versions. by practising with it, inditio allowed them to easily understand In summary, the four open questions of the final questionnaire its features and to become familiar with its environment. revealed that some participants remarked some limitations of In addition, concerning the specific features that participants the inditio user interface. Moreover, they would like to receive liked the most and the least (see Q22-Q23 in Table 1), experienced more hints during the statement modification process, accord- users have greatly appreciated the functionality of identifying ing to validation results. Conversely, they positively judged the values yielding the invalidation of metadata. In fact, some of them intuitiveness the metadata validation and the error detection claimed that “One of the most interesting features is identifying processes. Moreover, they welcomed the tool and recognized its the values that invalidate the metadata. This functionality could usefulness. be directly integrated into the MySql Workbench suite.” Instead, inexperienced users have shown interest for the report forms (see Figure 5) and for the simplicity through which the plugin could be 6 CONCLUSION integrated within the MySQL Workbench suite as a simple plugin. In this paper we presented inditio, a MySQL plugin enabling Although most of the comments were positive, we also investi- users to assess the quality of data they planned to insert into a gated the features they liked the least. Among them, many users database. In particular, the plugin enables a user to observe how have highlighted that the Query Editor component appears small the new tuples can affect the validation of some metadata, e.g. and does not clearly show statements. However, this is limited fds and uccs, assisting him/her in correcting them, if needed. We by the graphics components included in the MySQL Workbench. evaluated inditio by involving more than eighty participants in For these reasons, we allow the users to directly interact with a user study, which demonstrated its usefulness, giving us the the SQL editor of MySQL Workbench, and to import their state- possibility to detect the main characteristics of the tool to be ments. Other participants have suggested adding further reports improved, mainly focusing on the improvement and/or extension in the interface in order to enhance their understanding of how of some visual components. data insertion statements affect metadata. Only few users have The potentiality of inditio yields many possible future direc- proposed to extend the interface of inditio with new graphical tions. In particular, we are currently working on the enrichment components in order to improve the interaction with both the of the set of possible metadata that can be taken into consid- plugin and the MySQL Workbench. eration, with particularly emphasis on relaxed fds (rfds) [8]. 1 = Strongly disagree 2 3 4 5 = Strongly agree Q9 1 1 18 36 34 Q10 1 14 28 47 Q11 2 13 25 33 17 Q12 2 7 11 40 30 Q13 2 2 14 39 33 Q14 1 7 30 33 19 Q15 3 7 27 30 23 Q16 1 5 22 41 21 Q17 8 18 37 27 Q18 1 11 49 29 Q19 1 3 5 37 44 Q20 1 5 21 63 Figure 9: Distributing participant answers to the quantitative questions in the final questionnaire. Moreover, we would like to directly connect the plugin to in- Applications 78, 23 (2019), 33151–33167. cremental or continuous discovery algorithms [7], so that when [10] Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2020. Mining relaxed functional dependencies from data. Data Mining and Knowledge the user decides to force the insertions also in presence of in- Discovery 34, 2 (2020), 443–477. validated metadata, such algorithms could automatically update [11] IBM Corporation. 2021. IBM InfoSphere Information Server. https://www. ibm.com/analytics/information-server/. last accessed: Feb 23th, 2021. the set of holding metadata. This would implicitly represent a [12] Wei Dai, Isaac Wardlaw, Yu Cui, Kashif Mehdi, Yanyan Li, and Jun Long. 2016. means to automatically analyze the significance of metadata, Data profiling technology of data governance regarding big data: review and rethinking. In Information Technology: New Generations. Springer International while discovering them from actual data. Publishing, Cham, 439–450. [13] Wenfei Fan, Floris Geerts, and Xibei Jia. 2008. Semandaq: a data quality REFERENCES system based on conditional functional dependencies. Proceedings of the VLDB Endowment 1, 2 (2008), 1460–1463. [1] Ziawasch Abedjan, Lukasz Golab, Felix Naumann, and Thorsten Papenbrock. [14] Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro. 2018. Data profiling. Synthesis Lectures on Data Management 10, 4 (2018), 2013. The LLUNATIC data-cleaning framework. Proceedings of the VLDB 1–154. Endowment 6, 9 (2013), 625–636. [2] Ibrahim Ahmed Al-Baltah, Abdul Azim Abd Ghani, Ghilan Mohammed Al- [15] Mazhar Hameed and Felix Naumann. 2020. Data Preparation: A Survey of Gomaei, Fua’ad Hassan Abdulrazzak, and Abdulmonem Ali Al Kharusi. 2020. Commercial Tools. SIGMOD Rec. 49, 3 (2020), 18–29. A scalable semantic data fusion framework for heterogeneous sensors data. [16] Ihab F. Ilyas and Xu Chu. 2019. Data Cleaning. ACM, New York, NY, United Journal of Ambient Intelligence and Humanized Computing (2020), 1–20. States. [3] Oracle Corporation and/or its affiliates. 2021. MySQL™ Workbench Reference [17] Sebastian Kruse, David Hahn, Marius Walter, and Felix Naumann. 2017. Manual. https://dev.mysql.com/doc/workbench/en/. last accessed: Jan 13th, Metacrate: Organize and analyze millions of data profiles. In Proceedings 2021. of the 2017 ACM on Conference on Information and Knowledge Management. [4] Tobias Bleifuß, Sebastian Kruse, and Felix Naumann. 2017. Efficient denial Association for Computing Machinery, New York, NY, USA, 2483–2486. constraint discovery with hydra. Proceedings of the VLDB Endowment 11, 3 [18] Jay Lee, Hung-An Kao, Shanhu Yang, et al. 2014. Service innovation and smart (2017), 311–323. analytics for industry 4.0 and big data environment. Procedia Cirp 16, 1 (2014), [5] Jens Bleiholder and Felix Naumann. 2009. Data fusion. ACM computing surveys 3–8. (CSUR) 41, 1 (2009), 1–41. [19] Felix Naumann. 2014. Data profiling revisited. ACM SIGMOD Record 42, 4 [6] Bernardo Breve, Loredana Caruccio, Stefano Cirillo, Vincenzo Deufemia, and (2014), 40–49. Giuseppe Polese. 2020. Visualizing Dependencies during Incremental Discov- [20] Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and ery Processes. In Proceedings of the Workshops of the EDBT/ICDT 2020 Joint Felix Naumann. 2015. Data profiling with Metanome. Proceedings of the VLDB Conference (CEUR Workshop Proceedings), Vol. 2578. CEUR-WS.org, Aachen, Endowment 8, 12 (2015), 1860–1863. 1–8. [21] Thorsten Papenbrock and Felix Naumann. 2016. A hybrid approach to func- [7] Loredana Caruccio and Stefano Cirillo. 2020. Incremental discovery of impre- tional dependency discovery. In Proceedings of the 2016 International Conference cise functional dependencies. Journal of Data and Information Quality (JDIQ) on Management of Data. Association for Computing Machinery, New York, 12, 4 (2020), 1–25. NY, USA, 821–833. [8] Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2016. Relaxed [22] Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. Holo- Functional Dependencies – A Survey of Approaches. IEEE Transactions on Clean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. Knowledge and Data Engineering 28, 1 (2016), 147–165. 10, 11 (2017), 1190–1201. [9] Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2019. Visual- [23] Talend. 2021. Talend Studio. https://https://www.talend.com/products/ ization of (multimedia) dependencies from big data. Multimedia Tools and data-quality/. last accessed: Feb 23th, 2021.