1. Introduction

Italian Symposium on Advanced Database Systems, June

Imputation of Missing Values through Profiling Metadata

Bernardo Breve

Loredana Caruccio

Vincenzo Deufemia

Giuseppe Polese

0 0 University of Salerno , via Giovanni Paolo II, 132, Fisciano (SA), 84084 , Italy

2022

1 9 22

Among the several problems related to the management of database instances, missing values represents a crucial factor that could severely compromise the integrity and the meaningfulness of such data representations. Thus, the data imputation research field focuses its eforts on solutions for filling missing values by means of plausible candidates, while still preserving the overall semantic integrity the database instance is characterized by. To keep imputation times low while still keeping high accuracy, the employment of metadata has made its way through research proposals. This discussion paper presents our efort in the definition of RENUVER, a novel data imputation algorithm relying on Relaxed Functional Dependencies (rfds) for identifying value candidates best guaranteeing the semantic integrity of data. Experimental results on real-world datasets highlighted the efectiveness of RENUVER in terms of both iflling accuracy and imputation times, also compared to other well-known approaches.

eol>Data imputation Profiling metadata Relaxed Functional Dependencies Data quality

1. Introduction

attributes, yielding an accurate and somewhat fast solution for the imputation of missing values within relational database instances. In fact, rfds are still widely considered for detecting and repairing many types of errors, such as duplicates, outliers, and constraint violations [ 4 ]. Thus, we made use them for identifying suitable candidate values for replacing missing ones in the data imputation process. RENUVER exploits rfds for: i) identifying the candidate tuples useful for the imputation of missing values, ii) ranking candidate tuples based on their similarity with respect to the tuples containing missing values, and iii) evaluating each imputation to guarantee the semantic consistency of the whole dataset.

In particular, RENUVER generates candidate tuples and rank them, according to rfds implying the attribute on which a value is missing. Moreover, the imputation strategy of RENUVER does not alter value consistency with respect to the ones in the original dataset. Finally, RENUVER exploits rfds to also judge whether it is possible to impute a missing value, in order to preserve the integrity of data and to avoid the insertion of inconsistent information.

The efectiveness of RENUVER has been evaluated on real-world datasets1 in terms of accuracy, and execution time. In order to extract rfds, we relied on an existing rfd discovery algorithm [ 5 ], since the problem of discovering rfds is out of the scope of this paper. Moreover, we introduce a novel method for the automatic evaluation of data imputation results, which permits to judge the imputed values even with diferent syntactical representations. Evaluation results demonstrate that RENUVER outperforms other data imputation approaches [ 6, 7, 8 ].

The paper is organized as follows: Section 2 provides preliminary notions on rfds. Section 3 introduces RENUVER’s logic through the employment of the rfds in the data imputation problem. An experimental evaluation measuring the efectiveness RENUVER is presented in Section 4. Finally, conclusions and further research are reported in Section 5.

2. Preliminaries

Before describing how we approached the imputation problem through the employment of rfds, let us introduce some propaedeutics notions to our methodology.

Functional Dependency. Given a relational database schema ℛ, and = {1, . . . , } one of its relation schemas, and a tuple ∈ , we use [], with 0 ≤ ≤ , to denote the projection of onto ; similarly, for a set of attributes = {1 , . . . , }, with 1 ≤ ≤ , [] ∈ (1 ) × . . . × ( ) represents the projection of onto , also denoted with Π (). An fd on ℛ is a statement → ( implies ), with , ⊆ (), such that, given an instance of , → is satisfied in if and only if for each pair of tuples (1, 2) in , whenever 1[] = 2[], then 1[ ] = 2[ ]. The sets of attributes and are named Left-Hand-Side (LHS) and Right-Hand-Side (RHS) of the fd, respectively.

With respect to fd definition, the rfd generalizes the comparison paradigm, by including similarity/distance-based comparisons between tuple projections, also admitting the possibility for a dependency to hold only on a subset of tuples. The latter can be defined through either a coverage measure, quantifying the portion of the dataset on which a dependency holds or a condition restricting the domain on which a dependency can hold [ 9 ]. Since the proposed 1https://github.com/DastLab/RENUVER-evaluation-datasets approach exploits only rfds relying on a similarity/distance-based tuple comparison method, in what follows we provide only the definition of this type of rfds, known as rfdc. For a more general definition of rfd, see [ 9 ]. rfdc. Given a relational database schema ℛ, and = {1, . . . , } one of its relation schemas, an rfdc on ℛ Φ1 → Φ2 (1) where • , ⊆ (); • Φ 1 contains (for each attribute ∈ ) a constraint [] that can be used to determine whether pair of tuples with values in () are “similar” enough (likewise for each attribute ∈ with [ ] ∈ Φ 2). More specifically, each [] ( [ ] resp.) requires the specification of a similarity/distance function defined on the domain of ( , resp.), an operator, and a threshold setting the boundaries for the satisfaction of the constraint. holds on a relation instance (denoted by |= ) if and only if for each pair of tuples (1, 2) ∈ for which 1[] and 2[] satisfy the constraint [] for each ∈ , then 1[ ] and 2[ ] satisfy the constraint [] for each ∈ .

For sake of simplicity, in the following, we apply a more compact notation for the constraints, showing only the operator and the numeric threshold associated with each attribute. Example. Let us consider the sample relation shown in Table 1, derived from a database of restaurants in USA. Within this database, each tuple represents a restaurant providing information about its name, address, city, phone number, type of cuisine, and class. The latter is a numeric id associated to the type of cuisine. On such dataset, the following rfdc holds: Name(≤ 4→)− Phone(≤ 1) which states that, if two restaurants have a similar name, then they also have a similar phone number. This should be true despite the names and/or the phone numbers of restaurants being written in diferent ways or using diferent abbreviations.

From a theoretical point of view, rfdcs permit to use any type of similarity/distance functions, e.g., edit distance, abs diferences, and so forth. However, they are usually inherited from the functions involved in the automatic rfdc discovery process [ 5 ]. For the scope of this proposal, without loss of generality, we can consider rfdcs with a single attribute on the RHS, and the associated constraint 2. In particular, we considered 2 composed of a distance function, the operator ≤ , and a distance threshold.

A particular type of rfdc is the key-rfdc, which is defined in the following.

Key rfdc. Given a relation schema , and an instance of , an rfdc : Φ1 → 2 is said to be key if and only if holds on ( |= ), but there is no pair of distinct tuples (1, 2) ∈ , for which 1[] and 2[] satisfy all the constraints in Φ 1[].

3. The RENUVER imputation approach

In this section, we formalize the data imputation problem by defining some of its underlying concepts, then describing the basics of the proposed imputation approach. Let us start defining the concept of missing value.

Missing value. Given a relation schema , defined over a set of attributes (), an instance of , an attribute ∈ (), and a tuple ∈ , a missing value of tuple on the attribute , denoted as [] = _, is such that [] is null.

Here, is said to be an incomplete instance, and ˆ ⊆ contains only incomplete tuples.

The general missing value imputation problem is formally defined as follows. Missing value imputation problem. Given a relation schema , and an instance of , for every tuple ∈ and every attribute ∈ () for which [] = _, the imputation problem consists of finding a plausible value ∈ (), such that the database instance ′ resulting from the imputation process does not contain inconsistent values.

A missing value imputation approach also requires the application of constraints for evaluating the consistency of values at the end of the imputation process. The proposed approach exploits rfds to both guarantee the verification of the semantic consistency, and to drive the searching of meaningful candidates for all missing values.

Semantically consistent imputation. Given a relation schema , defined over a set of attributes (), an instance of ,

and a set of rfdcs, Σ , holding on ( |= Σ ), an instance ′ of resulting from an imputation process over the instance , denoted as ′ = (), is semantically consistent if ′ |= Σ . One of the possible strategies that could guarantee the semantic consistency of the imputation process is to find candidate values for [] = _ by considering a set ⊆ of plausible candidate tuples for imputing [], such that ∀ ∈ , [] ̸= _ and is similar to on some attributes beyond .

In what follows we define the criteria used by RENUVER for deciding when a tuple can be considered as a plausible candidate, which is based on rfdcs.

Plausible candidate tuple. Given a missing value []=_ over a database instance of a relation schema , and an rfdc : Φ1 → 2 holding on , a tuple ′ ∈ can be considered as a plausible candidate tuple for imputing [] according to if and ′, are similar according to the constraints in Φ 1.

The candidate tuple generation process performed according to the definition presented above, has to be generalized in order to perform the imputation process on tuples containing more than one missing value, and for each ∈ ˆ.

Missing value imputation for a tuple. Let be a relational schema defined over a set of attributes (), an instance of , a tuple of , ⊂ () a set of attributes such that for each ∈ [] = _, and Σ a set of rfdcs holding on . An imputation process for consists of selecting a plausible candidate tuple for each ∈ such that [] = _, so that [] can be set equal to []. However, when for a [] = _ it is not possible to identify a plausible candidate tuple guaranteeing a semantic consistent imputation, it is better to leave [] unimputed. Although this strategy has been widely applied in other approaches [ 7 ], it a) Data pre-processing

Name(≤ 8), Phone(≤ 0), Class(≤ 1) ➝ Type(≤ 0) Class(≤ 0) ➝ Type(≤ 5) City(≤ 2) ➝ Phone(≤ 2) Name(≤ 4) ➝ Phone(≤ 1) Name(≤ 8), Phone(≤ 0) ➝ City(≤ 9) Name(≤ 6), City(≤ 9) ➝ Phone(≤ 0)

Phone(≤ 1) ➝ Class(≤ 0) ... ... b) RFDc selection 0Phone : Na:mNea(m≤e6()≤, 6C)i,tyC(≤ity9()≤ ➝9) ➝PhoPnheo(n≤e0()≤ 0) Phone Phone : Name(≤ 4) ➝ Phone(≤ 1)

: Name(≤ 4) ➝ Phone(≤ 1) : City(≤ 2) ➝ Phone(≤ 2)

: City(≤ 2) ➝ Phone(≤ 2) c) Imputing missing values

Phone Phone Phone : Name(≤ 6), City(≤ 9) ➝ Phone(≤ 0) : Name(≤ 4) ➝ Phone(≤ 1) : City(≤ 2) ➝ Phone(≤ 2) : Phone(≤ 1) ➝ Class(≤ 0)

violated! : Phone(≤ 1) ➝ Class(≤ 0)

NOT violated!

Name t1 Granita t2 Chinos Main t3 Citrus t4 Citrus t5 Fenix t6 Fenix Argyle t7 C. Main

City Malibu

LA Los Angeles Los Angeles Hollywood

_ Los Angeles

Phone yields to another important issue that RENUVER deals with, i.e., minimizing the number of non-imputed values. we show how the aforesaid definitions empower the imputation of a missing value in the Restaurant dataset, previously introduced. In details, we can identify three major phases yielding the imputation of certain missing value, that are: • Pre-processing: during this phase, missing values within a database instance are identi2 A deep overview of RENUVER, together with a more exhaustive evaluation has been carried out in [ 3 ]. ifed and isolated. Furthermore, RENUVER excludes all key-rfdcs from the set of the rfdcs which can be employed for the imputation of any missing value (see Figure 1.a). • rfdc selection: following the selection of a missing value to impute, during this phase RENUVER identifies all the rfdcs that can be useful for its imputation. rfdcs are then organized in a set of clusters according to their threshold on the RHS (see Figure 1.b). • Imputing missing values: during this phase, RENUVER performs a series of operations leading to the imputation of a missing value by retrieving the value from a set of plausible candidate tuples relying on the same database instance (see Figure 1.c). In particular, RENUVER iteratively performs the following operations: – generates a set of plausible candidate tuples that satisfy the LHS constraints of an rfdcs belonging to one of the clusters previously generated. – computes a distance value for each plausible candidate tuple with respect to the tuple having the missing value. The evaluation is performed by considering the LHS attributes of the rfdcs selected. Finally the candidate tuple having the minimum distance is the exploited for the imputation of the missing value. – verifies whether the imputed value causes a violation of holding rfdcs. In this case,

RENUVER selects the next plausible candidate tuple with the lowest distance value.

These operations are repeated for each cluster as long as the imputation is not successful.

4. Experimental Evaluation

In this section, we present a comparative evaluation of RENUVER w.r.t. other approaches exploiting diferent imputation strategies. In particular, we benchmarked RENUVER against an holistic-machine learning-based approach, namely Holoclean [ 6 ], (considering its attentionbased expansion module AimNet [ 10 ]) and a diferential dependencies guided approach [ 7 ] named Derand, for which we employed the same rfdcs as RENUVER. All evaluations were performed under the same conditions on an iMac Pro with an 8-core CPU and 32GB RAM. Datasets. The considered algorithms have been evaluated on two real-world datasets 2 in order to perform a stress test on RENUVER and all compared imputation approaches, aiming to determine their time and memory requirements. To this end, we stopped the executions exceeding 48 hours of execution time and/or 30GB of memory consumption, respectively.

Furthermore, in order to obtain an accurate comparison between the imputed values and the expected ones, missing values have been artificially injected in a random manner. Moreover, to avoid an arrangement of missing values over one algorithm, for each missing injection we produced five diferent datasets, yielding a total of twenty-five variants of the same dataset. The metrics adopted for the comparison are then averaged over each missing rate. Evaluation metrics. The efectiveness of the data imputation approaches have been evaluated by considering three diferent metrics: precision, recall, F1-measure. Which can be formally defined as: precision = |tru|ei m⋂︀piumtpedu|ted| recall = |tru|em⋂i︀smsi nisgs|ing| F1-measure = 2 × pprreecciissiioonn+×rreeccaallll where true represents the correctly imputed missing values at the end of the imputation process, imputed represents all the imputed missing values, and missing the missing values in the dataset. registered the best performances on all the considered qualitative metrics.

The second evaluation session is focused on the Physician dataset, by fixing the missing rate and by varying the number of tuples to be considered. This dataset is particularly complex to analyze, since it also contains a high number of attributes (i.e., 13 attributes). In fact, this dataset allowed us to catch a time and/or memory limit for all considered approaches (i.e., RENUVER, Derand, and Holoclean), as shown in Table 2. In particular, we can notice that, on average, both RENUVER and Holoclean registered faster execution times than Derand. In fact, the latter exceeds the time limit of 48h on the datasets having 2072 and 10359 tuples, respectively. On the other hand, Holoclean manages to achieve reasonable executions times, but the huge amount of consumed memory makes it exceed the 30GB memory limit on the dataset having 10359 tuples. Finally, RENUVER also exceeds the time limit on the largest dataset, despite a more reasonable memory consumption. This evaluation session proved the capability of RENUVER to outperform the compared approaches on the considered qualitative metrics. It also emphasized that Derand’s execution times are strongly dependent on the number of missing values, whereas although Holoclean provided overall faster execution times, it resulted heavily memory-consuming.

5. Conclusion

In this paper, we proposed RENUVER, a data imputation algorithm that exploits relaxed functional dependencies. The latter enables RENUVER to select and evaluate tuple candidates to be used during the imputation process. The whole imputation process preserves the semantic consistency of the data, by guaranteeing that no imputation can violate any rfdc. Evaluation results demonstrated that RENUVER outperforms recent approaches using diferent imputation strategies: machine learning-based (Holoclean) and dependency-based (Derand).

In the future, we would like to extend RENUVER with the possibility of selecting plausible candidate tuples among multiple datasets. Finally, we would like to study the applicability of RENUVER over incremental scenarios, like for example those related to the imputation of time series [ 11 ], which would require the usage of incremental rfdc discovery algorithms [ 12, 13 ].

[1]

M. V.

Martinez ,

Molinaro ,

Grant ,

Subrahmanian , Customized policies for handling partial information in relational databases , IEEE Transactions on Knowledge and Data Engineering 25 ( 2012 ) 1254 - 1271 .

[2]

Montesdeoca ,

Luengo ,

Maillo ,

García-Gil ,

García ,

Herrera , A first approach on big data missing values imputation , in: Proceedings of 5th International Conference on Internet of Things , Big Data and Security (IoTBDS) , SciTePress , 2019 , pp. 315 - 323 .

[3]

Breve ,

Caruccio ,

Deufemia , G. Polese, RENUVER: A missing value imputation algorithm based on relaxed functional dependencies , in: To appear in Proceedings of the 25th International Conference on Extending Database Technology , (EDBT), OpenProceedings.org , 2022 .

[4]

I. F.

Ilyas ,

Chu , et al., Trends in cleaning relational data: consistency and deduplication, Foundations and Trends® in Databases 5 ( 2015 ) 281 - 393 .

[5]

Caruccio ,

Deufemia ,

Naumann , G. Polese, Discovering relaxed functional dependencies based on multi-attribute dominance , IEEE Transactions on Knowledge and Data Engineering 33 ( 2021 ) 3212 - 3228 .

[6]

Rekatsinas ,

Chu ,

I. F.

Ilyas ,

Ré , Holoclean: holistic data repairs with probabilistic inference , Proceedings of VLDB Endowment 10 ( 2017 ) 1190 - 1201 .

[7]

Song ,

Sun ,

Zhang , L. Chen,

Wang , Enriching data imputation under similarity rule constraints , IEEE Transactions on Knowledge and Data Engineering 32 ( 2020 ) 275 - 287 .

[8] C.-C. Huang , H. -M. Lee , A grey-based nearest neighbor approach for missing attribute value prediction , Applied Intelligence 20 ( 2004 ) 239 - 252 .

[9]

Caruccio ,

Deufemia , G. Polese, Relaxed functional dependencies-A survey of approaches , IEEE Transactions on Knowledge and Data Engineering 28 ( 2016 ) 147 - 165 .

[10]

Wu ,

Zhang , I. Ilyas, T. Rekatsinas, Attention-based learning for missing data imputation in holoclean , Proceedings of Machine Learning and Systems 2 ( 2020 ) 307 - 325 .

[11]

Khayati ,

Lerner ,

Tymchenko ,

Cudré-Mauroux , Mind the gap: An experimental evaluation of imputation of missing values techniques in time series , Proceedings VLDB Endowment 13 ( 2020 ) 768 - 782 .

[12]

Caruccio ,

Cirillo ,

Deufemia , G. Polese, Incremental discovery of functional dependencies with a bit-vector algorithm , in: Proceedings of Italian Symposium on Advanced Database Systems , volume 2400 of SEBD '19 , CEUR-WS .org, 2019 , pp. 1 - 12 .

[13]

Caruccio ,

Cirillo , Incremental discovery of imprecise functional dependencies , Journal of Data and Information Quality (JDIQ) 12 ( 2020 ) 1 - 25 .