Imputation of Missing Values through Profiling Metadata

Imputation of Missing Values through Profiling Metadata BernardoBreve University of Salerno

via Giovanni Paolo II, 132, Fisciano (SA 84084 Italy

LoredanaCaruccio University of Salerno

via Giovanni Paolo II, 132, Fisciano (SA 84084 Italy

VincenzoDeufemia University of Salerno

via Giovanni Paolo II, 132, Fisciano (SA 84084 Italy

GiuseppePolese University of Salerno

via Giovanni Paolo II, 132, Fisciano (SA 84084 Italy

Imputation of Missing Values through Profiling Metadata 1613-0073 1BDAF79D0642BC11F225BA3512EE071A GROBID - A machine learning software for extracting information from scholarly documents Data imputation Profiling metadata Relaxed Functional Dependencies Data quality

Among the several problems related to the management of database instances, missing values represents a crucial factor that could severely compromise the integrity and the meaningfulness of such data representations. Thus, the data imputation research field focuses its efforts on solutions for filling missing values by means of plausible candidates, while still preserving the overall semantic integrity the database instance is characterized by. To keep imputation times low while still keeping high accuracy, the employment of metadata has made its way through research proposals. This discussion paper presents our effort in the definition of RENUVER, a novel data imputation algorithm relying on Relaxed Functional Dependencies (rfds) for identifying value candidates best guaranteeing the semantic integrity of data. Experimental results on real-world datasets highlighted the effectiveness of RENUVER in terms of both filling accuracy and imputation times, also compared to other well-known approaches.

Introduction

With the advent of big data, the presence of missing values inside database instances has been widely recognized as a complex problem to handle, especially for Relational Database Management Systems [1]. Moreover, several application contexts might require the absence of this data quality issue inside their datasets. For instance, machine learning processes could not provide good accuracy scores if trained on data with many missing values. In general, it is not possible to infer reliable knowledge using datasets with incomplete information [2].

The identification of the best values in a dataset to impute the missing ones is an extremely complex task, since it entails the evaluation of all possible combinations in the value distribution. Most of the approaches proposed in the literature focus on maximizing the number of imputed values, overshadowing the accuracy of single imputations. This discussion paper presents the data imputation algorithm proposed in [3], namely RENUVER (Rfd basEd NUll ValuE Repairer), which relies on Relaxed Functional Dependencies (rfds) for imputing missing values within a relational database instance. By adopting the concept of rfds as metadata for supporting the imputation process, we are able to perform a broader analysis of the correlations among SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy bbreve@unisa.it (B. Breve); lcaruccio@unisa.it (L. Caruccio); deufemia@unisa.it (V. Deufemia); gpolese@unisa.it (G. Polese) 0000-0002-3898-7512 (B. Breve); 0000-0002-2418-1606 (L. Caruccio); 0000-0002-6711-3590 (V. Deufemia); 0000-0002-8496-2658 (G. Polese) attributes, yielding an accurate and somewhat fast solution for the imputation of missing values within relational database instances. In fact, rfds are still widely considered for detecting and repairing many types of errors, such as duplicates, outliers, and constraint violations [4]. Thus, we made use them for identifying suitable candidate values for replacing missing ones in the data imputation process. RENUVER exploits rfds for: i) identifying the candidate tuples useful for the imputation of missing values, ii) ranking candidate tuples based on their similarity with respect to the tuples containing missing values, and iii) evaluating each imputation to guarantee the semantic consistency of the whole dataset.

In particular, RENUVER generates candidate tuples and rank them, according to rfds implying the attribute on which a value is missing. Moreover, the imputation strategy of RENUVER does not alter value consistency with respect to the ones in the original dataset. Finally, RENUVER exploits rfds to also judge whether it is possible to impute a missing value, in order to preserve the integrity of data and to avoid the insertion of inconsistent information.

The effectiveness of RENUVER has been evaluated on real-world datasets1 in terms of accuracy, and execution time. In order to extract rfds, we relied on an existing rfd discovery algorithm [5], since the problem of discovering rfds is out of the scope of this paper. Moreover, we introduce a novel method for the automatic evaluation of data imputation results, which permits to judge the imputed values even with different syntactical representations. Evaluation results demonstrate that RENUVER outperforms other data imputation approaches [6,7,8].

The paper is organized as follows: Section 2 provides preliminary notions on rfds. Section 3 introduces RENUVER's logic through the employment of the rfds in the data imputation problem. An experimental evaluation measuring the effectiveness RENUVER is presented in Section 4. Finally, conclusions and further research are reported in Section 5.

Preliminaries

Before describing how we approached the imputation problem through the employment of rfds, let us introduce some propaedeutics notions to our methodology. Functional Dependency. Given a relational database schema ℛ, and 𝑅 = {𝐴 1 , . . . , 𝐴 𝑚 } one of its relation schemas, and a tuple 𝑡 ∈ 𝑟, we use 𝑡[𝐴 𝑖 ], with 0 ≤ 𝑖 ≤ 𝑚, to denote the projection of 𝑡 onto 𝐴 𝑖 ; similarly, for a set of attributes 𝑋 = {𝐴 𝑖 1 , . . . , 𝐴 𝑖 𝑘 }, with 1 ≤ 𝑘 ≤ 𝑚, 𝑡[𝑋] ∈ 𝑑𝑜𝑚(𝐴 𝑖 1 ) × . . . × 𝑑𝑜𝑚(𝐴 𝑖 𝑘 ) represents the projection of 𝑡 onto 𝑋, also denoted with Π 𝑋 (𝑡). An fd on ℛ is a statement 𝑋 → 𝑌 (𝑋 implies 𝑌 ), with 𝑋, 𝑌 ⊆ 𝑎𝑡𝑡𝑟(𝑅), such that, given an instance 𝑟 of 𝑅, 𝑋 → 𝑌 is satisfied in 𝑟 if and only if for each pair of tuples (𝑡

1 , 𝑡 2 ) in 𝑟, whenever 𝑡 1 [𝑋] = 𝑡 2 [𝑋], then 𝑡 1 [𝑌 ] = 𝑡 2 [𝑌 ].

The sets of attributes 𝑋 and 𝑌 are named Left-Hand-Side (LHS) and Right-Hand-Side (RHS) of the fd, respectively.

With respect to fd definition, the rfd generalizes the comparison paradigm, by including similarity/distance-based comparisons between tuple projections, also admitting the possibility for a dependency to hold only on a subset of tuples. The latter can be defined through either a coverage measure, quantifying the portion of the dataset on which a dependency holds or a condition restricting the domain on which a dependency can hold [9]. Since the proposed

) ∈ 𝑟 for which 𝑡 1 [𝑋] and 𝑡 2 [𝑋] satisfy the constraint 𝜑 𝑖 [𝑋 𝑖 ] for each 𝑋 𝑖 ∈ 𝑋, then 𝑡 1 [𝑌 ] and 𝑡 2 [𝑌 ] satisfy the constraint 𝜑 𝑖 [𝑌 𝑖 ] for each 𝑌 𝑖 ∈ 𝑌 .

For sake of simplicity, in the following, we apply a more compact notation for the constraints, showing only the operator and the numeric threshold associated with each attribute. 1, derived from a database of restaurants in USA. Within this database, each tuple represents a restaurant providing information about its name, address, city, phone number, type of cuisine, and class. The latter is a numeric id associated to the type of cuisine. On such dataset, the following rfd c holds: Name (≤4) − → Phone (≤1) which states that, if two restaurants have a similar name, then they also have a similar phone number. This should be true despite the names and/or the phone numbers of restaurants being written in different ways or using different abbreviations.

Example. Let us consider the sample relation shown in Table

From a theoretical point of view, rfd c s permit to use any type of similarity/distance functions, e.g., edit distance, abs differences, and so forth. However, they are usually inherited from the functions involved in the automatic rfd c discovery process [5]. For the scope of this proposal, without loss of generality, we can consider rfd c s with a single attribute on the RHS, and the associated constraint 𝜑 2 . In particular, we considered 𝜑 2 composed of a distance function, the operator ≤, and a distance threshold.

A particular type of rfd c is the key-rfd c , which is defined in the following.

Key rfd

The RENUVER imputation approach

In this section, we formalize the data imputation problem by defining some of its underlying concepts, then describing the basics of the proposed imputation approach. Let us start defining the concept of missing value.

Missing value. Given a relation schema 𝑅, defined over a set of attributes 𝑎𝑡𝑡𝑟(𝑅), an instance 𝑟 of 𝑅, an attribute 𝐴 ∈ 𝑎𝑡𝑡𝑟(𝑅), and a tuple 𝑡 ∈ 𝑟, a missing value of tuple 𝑡 on the attribute 𝐴, denoted as

𝑡[𝐴] = _, is such that 𝑡[𝐴] is null.

Here, 𝑟 is said to be an incomplete instance, and 𝑟 ˆ⊆ 𝑟 contains only incomplete tuples. The general missing value imputation problem is formally defined as follows.

Missing value imputation problem. Given a relation schema 𝑅, and an instance 𝑟 of 𝑅, for every tuple 𝑡 ∈ 𝑟 and every attribute 𝐴 ∈ 𝑎𝑡𝑡𝑟(𝑅) for which 𝑡[𝐴] = _, the imputation problem consists of finding a plausible value 𝑎 ∈ 𝑑𝑜𝑚(𝐴), such that the database instance 𝑟 ′ resulting from the imputation process does not contain inconsistent values.

A missing value imputation approach also requires the application of constraints for evaluating the consistency of values at the end of the imputation process. The proposed approach exploits rfds to both guarantee the verification of the semantic consistency, and to drive the searching of meaningful candidates for all missing values. In what follows we define the criteria used by RENUVER for deciding when a tuple can be considered as a plausible candidate, which is based on rfd c s.

Semantically

Plausible candidate tuple. Given a missing value 𝑡[𝐴]

=_ over a database instance 𝑟 of a relation schema 𝑅, and an rfd c 𝜙 : 𝑋 Φ 1 → 𝐴 𝜑 2 holding on 𝑟, a tuple 𝑡 ′ ∈ 𝑟 can be considered as a plausible candidate tuple for imputing 𝑡[𝐴] according to 𝜙 iff 𝑡 and 𝑡 ′ , are similar according to the constraints in Φ 1 .

The candidate tuple generation process performed according to the definition presented above, has to be generalized in order to perform the imputation process on tuples containing more than one missing value, and for each 𝑡 ∈ 𝑟 ˆ. 1.

Missing value imputation

yields to another important issue that RENUVER deals with, i.e., minimizing the number of non-imputed values. Figure 1 summarizes the imputation logic of RENUVER2 through an example. In particular, we show how the aforesaid definitions empower the imputation of a missing value in the Restaurant dataset, previously introduced. In details, we can identify three major phases yielding the imputation of certain missing value, that are:

• Pre-processing: during this phase, missing values within a database instance are identi-fied and isolated. Furthermore, RENUVER excludes all key-rfd c s from the set of the rfd c s which can be employed for the imputation of any missing value (see Figure 1.a).

• rfd c selection: following the selection of a missing value to impute, during this phase RENUVER identifies all the rfd c s that can be useful for its imputation. rfd c s are then organized in a set of clusters according to their threshold on the RHS (see Figure 1.b).

• Imputing missing values: during this phase, RENUVER performs a series of operations leading to the imputation of a missing value by retrieving the value from a set of plausible candidate tuples relying on the same database instance (see Figure 1.c). In particular, RENUVER iteratively performs the following operations:

generates a set of plausible candidate tuples that satisfy the LHS constraints of an rfd c s belonging to one of the clusters previously generated.

computes a distance value for each plausible candidate tuple with respect to the tuple having the missing value. The evaluation is performed by considering the LHS attributes of the rfd c selected. Finally the candidate tuple having the minimum distance is the exploited for the imputation of the missing value.

verifies whether the imputed value causes a violation of holding rfd c s. In this case, RENUVER selects the next plausible candidate tuple with the lowest distance value. These operations are repeated for each cluster as long as the imputation is not successful.

Experimental Evaluation

In this section, we present a comparative evaluation of RENUVER w.r.t. other approaches exploiting different imputation strategies. In particular, we benchmarked RENUVER against an holistic-machine learning-based approach, namely Holoclean [6], (considering its attentionbased expansion module AimNet [10]) and a differential dependencies guided approach [7] named Derand, for which we employed the same rfd c s as RENUVER. All evaluations were performed under the same conditions on an iMac Pro with an 8-core CPU and 32GB RAM.

Datasets. The considered algorithms have been evaluated on two real-world datasets 2 in order to perform a stress test on RENUVER and all compared imputation approaches, aiming to determine their time and memory requirements. To this end, we stopped the executions exceeding 48 hours of execution time and/or 30GB of memory consumption, respectively. Furthermore, in order to obtain an accurate comparison between the imputed values and the expected ones, missing values have been artificially injected in a random manner. Moreover, to avoid an arrangement of missing values over one algorithm, for each missing injection we produced five different datasets, yielding a total of twenty-five variants of the same dataset. The metrics adopted for the comparison are then averaged over each missing rate.

Evaluation metrics. The effectiveness of the data imputation approaches have been evaluated by considering three different metrics: precision, recall, F1-measure. Which can be formally defined as: precision = |true Results. The first evaluation session is focused on the Restaurant dataset by considering the following missing rates: [5%, 10%, 20%, 30%, 40%] (see Table 2). We can notice that the fastest approach is Holoclean, whereas Derand registered severely higher execution times, exceeding the 48h time limit starting from the 10% of missing rate. The faster execution times of Holoclean can be justified by the conspicuously lower number of metadata to be processed during the imputation process, i.e., 9 Denial of Constraints, compared to 1961 rfd c s. Nevertheless, RENUVER registered the best performances on all the considered qualitative metrics.

The second evaluation session is focused on the Physician dataset, by fixing the missing rate and by varying the number of tuples to be considered. This dataset is particularly complex to analyze, since it also contains a high number of attributes (i.e., 13 attributes). In fact, this dataset allowed us to catch a time and/or memory limit for all considered approaches (i.e., RENUVER, Derand, and Holoclean), as shown in Table 2. In particular, we can notice that, on average, both RENUVER and Holoclean registered faster execution times than Derand. In fact, the latter exceeds the time limit of 48h on the datasets having 2072 and 10359 tuples, respectively. On the other hand, Holoclean manages to achieve reasonable executions times, but the huge amount of consumed memory makes it exceed the 30GB memory limit on the dataset having 10359 tuples. Finally, RENUVER also exceeds the time limit on the largest dataset, despite a more reasonable memory consumption. This evaluation session proved the capability of RENUVER to outperform the compared approaches on the considered qualitative metrics. It also emphasized that Derand's execution times are strongly dependent on the number of missing values, whereas although Holoclean provided overall faster execution times, it resulted heavily memory-consuming.

Conclusion

In this paper, we proposed RENUVER, a data imputation algorithm that exploits relaxed functional dependencies. The latter enables RENUVER to select and evaluate tuple candidates to be used during the imputation process. The whole imputation process preserves the semantic consistency of the data, by guaranteeing that no imputation can violate any rfd c . Evaluation results demonstrated that RENUVER outperforms recent approaches using different imputation strategies: machine learning-based (Holoclean) and dependency-based (Derand).

In the future, we would like to extend RENUVER with the possibility of selecting plausible candidate tuples among multiple datasets. Finally, we would like to study the applicability of RENUVER over incremental scenarios, like for example those related to the imputation of time series [11], which would require the usage of incremental rfd c discovery algorithms [12,13].

consistent imputation. Given a relation schema 𝑅, defined over a set of attributes 𝑎𝑡𝑡𝑟(𝑅), an instance 𝑟 of 𝑅, and a set of rfd c s, Σ, holding on 𝑟 (𝑟 |= Σ), an instance 𝑟 ′ of 𝑅 resulting from an imputation process 𝐼 over the instance 𝑟, denoted as 𝑟 ′ = 𝐼(𝑟), is semantically consistent iff 𝑟 ′ |= Σ. One of the possible strategies that could guarantee the semantic consistency of the imputation process is to find candidate values for 𝑡[𝐴] = _ by considering a set 𝑇 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 ⊆ 𝑟 of plausible candidate tuples for imputing 𝑡[𝐴], such that ∀𝑡 𝑘 ∈ 𝑇 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 , 𝑡 𝑘 [𝐴] ̸ = _ and 𝑡 𝑘 is similar to 𝑡 on some attributes beyond 𝐴.

Figure 1 :1Figure 1: An example of RENUVER imputation on the Restaurant dataset of Table1.

= 2 × precision×recall precision+recall where true represents the correctly imputed missing values at the end of the imputation process, imputed represents all the imputed missing values, and missing the missing values in the dataset.

Table 1 A1sample of the Restaurant dataset. Given a relational database schema ℛ, and 𝑅 = {𝐴 1 , . . . , 𝐴 𝑚 } one of its relation schemas, an rfd c 𝜙 on ℛ 𝑋 Φ 1 → 𝑌 Φ 2NameCityPhoneTypeClass𝑡 1GranitaMalibu310/456-0488Californian6𝑡 2Chinois MainLA310-392-9025French5𝑡 3CitrusLos Angeles213/857-0034Californian6𝑡 4CitrusLos Angeles_Californian6𝑡 5FenixHollywood213/848-6677_5𝑡 6Fenix Argyle_213/848-6677French (new)5𝑡 7C. MainLos Angeles_French5approach exploits only rfds relying on a similarity/distance-based tuple comparison method,in what follows we provide only the definition of this type of rfds, known as rfd c . For a moregeneral definition of rfd, see [9].rfd c .

(1) where• 𝑋, 𝑌 ⊆ 𝑎𝑡𝑡𝑟(𝑅); • Φ 1 contains (for each attribute 𝑋 𝑖 ∈ 𝑋) a constraint 𝜑 𝑖 [𝑋 𝑖 ] that can be used to determine whether pair of tuples with values in 𝑑𝑜𝑚(𝑋 𝑖 ) are "similar" enough (likewise for each attribute 𝑌 𝑗 ∈ 𝑌 with 𝜑 𝑗 [𝑌 𝑗 ] ∈ Φ 2 ). More specifically, each 𝜑 𝑖 [𝑋 𝑖 ] (𝜑 𝑗 [𝑌 𝑗 ] resp.) requires the specification of a similarity/distance function defined on the domain of 𝑋 𝑖 (𝑌 𝑗 , resp.), an operator, and a threshold setting the boundaries for the satisfaction of the constraint. holds on a relation instance 𝑟 (denoted by 𝑟 |= 𝜙) if and only if for each pair of tuples (𝑡 1 , 𝑡 2

Given a relation schema 𝑅, and an instance 𝑟 of 𝑅, an rfd c 𝜙 : 𝑋 Φ 1 → 𝐴 𝜑 2 is said to be key if and only if 𝜙 holds on 𝑟 (𝑟 |= 𝜙), but there is no pair of distinct tuples (𝑡 1 , 𝑡 2 ) ∈ 𝑟, for which 𝑡 1 [𝑋] and 𝑡 2 [𝑋] satisfy all the constraints in Φ 1 [𝑋].c .

for a tuple. Let 𝑅 be a relational schema defined over a set of attributes 𝑎𝑡𝑡𝑟(𝑅), 𝑟 an instance of 𝑅, 𝑡 a tuple of 𝑟, 𝑍 ⊂ 𝑎𝑡𝑡𝑟(𝑅) a set of attributes such that for each 𝐴 ∈ 𝑍 𝑡[𝐴] = _, and Σ a set of rfd c s holding on 𝑟. An imputation process for 𝑡 consists of selecting a plausible candidate tuple 𝑡 𝑗 for each 𝐴 ∈ 𝑍 such that 𝑡[𝐴] = _, so that 𝑡[𝐴] can be set equal to 𝑡 𝑗 [𝐴]. However, when for a 𝑡[𝐴] = _ it is not possible to identify a plausible candidate tuple guaranteeing a semantic consistent imputation, it is better to leave 𝑡[𝐴] unimputed. Although this strategy has been widely applied in other approaches[7], it

Name (≤ 8) , Phone (≤ 0) , Class (≤ 1) ➝ Type (≤ 0)NameCityPhoneTypeClassClass (≤ 0) ➝ Type (≤ 5)t 1GranitaMalibu310/456-0488Californian6City (≤ 2) ➝ Phone (≤ 2)t 2Chinos MainLA310-932-9025French5Name (≤ 4) ➝ Phone (≤ 1)t 3CitrusLos Angeles 213/857-0034Californian6Name (≤ 8) , Phone (≤ 0) ➝ City (≤ 9)t 4CitrusLos Angeles_Californian6Name (≤ 6) , City (≤ 9) ➝ Phone (≤ 0) Phone (≤ 1) ➝ Class (≤ 0) ... ...t 5 t 6 t 7Fenix Fenix Argyle C. MainHollywood 213/848-6677 _ 213/848-6677 French (new) _ Los Angeles _ French5 5 5b) RFDc selectionNameCityPhoneTypeClasst 1GranitaMalibu310/456-0488Californian60: Name (≤ 6) , City (≤ 9) ➝ Phone (≤ 0) : Name (≤ 6) , City (≤ 9) ➝ Phone (≤ 0)t 2 t 3Chinos Main CitrusLA Los Angeles 213/857-0034 310-932-9025French Californian5 6: Name (≤ 4) ➝ Phone (≤ 1) : Name (≤ 4) ➝ Phone (≤ 1)t 4 t 5Citrus FenixLos Angeles Hollywood 213/848-6677 _Californian _6 5: City (≤ 2) ➝ Phone (≤ 2)t 6Fenix Argyle_213/848-6677 French (new)5: City (≤ 2) ➝ Phone (≤ 2)t 7C. MainLos Angeles_French5c) Imputing missing valuesNameCityPhoneTypeClasst 1GranitaMalibu310/456-0488Californian6Phone: Name (≤ 6) , City (≤ 9) ➝ Phone (≤ 0)t 2 t 3Chinos Main CitrusLA Los Angeles 213/857-0034 310-932-9025French Californian5 6PhoneName (≤ 4) ➝ Phone (≤ 1)4 t 5Citrus FenixLos Angeles Hollywood 213/848-6677 _Californian _6 5: City (≤ 2) ➝ Phone (≤ 2)t 6 t 7Fenix Argyle C. Main_ Los Angeles213/848-6677 French (new) _ French5 5Imputing t 7 [Phone] with t 3 [Phone]NameCityPhoneTypeClasst 1GranitaMalibu310/456-0488Californian6t 2Chinos MainLA310-932-9025French5t 3CitrusLos Angeles 213/857-0034Californian6violated!t 4 t 5Citrus FenixLos Angeles Hollywood 213/848-6677 _Californian _6 5t 6Fenix Argyle_213/848-6677 French (new)5t 7C. MainLos Angeles 213/857-0034French5Imputing t 7 [Phone] with t 2 [Phone]NameCityPhoneTypeClasst 1GranitaMalibu310/456-0488Californian6t 2Chinos MainLA310-932-9025French5t 3CitrusLos Angeles 213/857-0034Californian6t 4CitrusLos Angeles_Californian6t 5FenixHollywood 213/848-6677_5t 6Fenix Argyle_213/848-6677 French (new)5t 7C. MainLos Angeles 310-932-9025French5

Table 22Comparative evaluation of RENUVER on the Restaurants and Physician datasets.Dataset#Tuples #Attributes #Missing val. #rfdcs #DCsDataset#Tuples#Attributes #Missing val. #rfdcs #DCs259 (5%)104 (0.05%)13 (1%)1430518 (10%)208 (0.1%)27 (1%)2553Restaurant86461037 (20%)19619Physician1036 (0.5%)13135 (1%)3895741555 (30%)2072 (1%)269 (1%)57082074 (40%)10359 (5%)1319 (1%)6137Dataset Approach Recall Precision F1-Meas.TimeMem.Dataset Approach Recall Precision F1-Meas.TimeMem.0.3290.8640.476 14m 29s 1.38 GB0.33810.505470ms 1.48 GBRestaurant(varying the missing rate)RENUVER Derand Holoclean0.296 0.294 0.258 0.232 0.295 ----0.275 0.099 0.071 0.0640.832 0.845 0.828 0.726 0.419 ----0.544 0.218 0.153 0.1920.437 23m 21s 1.31 GB 0.436 33m 20s 1.36 GB 0.394 36m 37s 1.37 GB 0.349 30m 23s 1.38 GB 0.345 47h 13m 7.21 GB -TL --TL --TL --TL -0.362 14s 0.99 GB 0.131 15s 0.99 GB 0.095 14s 0.99 GB 0.095 11s 0.78 GBPhysician(varying the number of tuples)RENUVER Derand Holoclean0.328 0.326 0.254 -0.121 0.125 0.110 --0.230 0.115 0.097 0.1560.547 0.607 0.483 -0.210 0.190 0.121 --0.300 0.120 0.114 0.1670.410 0.424 0.333 -0.151 0.150 9h 49m 3.32 GB 3s 1.79 GB 1m 19s 0.71 GB 15m 1s 1.30 GB TL -1h 10s 1.25 GB 0.115 25h 40m 8.21 GB -TL --TL -0.599 7s 3.95 GB 0.117 12s 5.15 GB 0.104 1m 8s 6.16 GB 0.161 8m 21s 26.89 GB0.1650.4190.23710s 0.79 GB----MLTL: time limit of 48 hours exceeded − ML: memory limit of 30 GB exceededTL: time limit of 48 hours exceeded − ML: memory limit of 30 GB exceeded

https://github.com/DastLab/RENUVER-evaluation-datasets A deep overview of RENUVER, together with a more exhaustive evaluation has been carried out in[3].

Customized policies for handling partial information in relational databases MVMartinez CMolinaro JGrant VSubrahmanian IEEE Transactions on Knowledge and Data Engineering 25 2012 A first approach on big data missing values imputation BMontesdeoca JLuengo JMaillo DGarcía-Gil SGarcía FHerrera Proceedings of 5th International Conference on Internet of Things, Big Data and Security (IoTBDS), SciTePress 5th International Conference on Internet of Things, Big Data and Security (IoTBDS), SciTePress 2019 RENUVER: A missing value imputation algorithm based on relaxed functional dependencies BBreve LCaruccio VDeufemia GPolese Proceedings of the 25th International Conference on Extending Database Technology the 25th International Conference on Extending Database Technology OpenProceedings 2022 Trends in cleaning relational data: consistency and deduplication IFIlyas XChu Foundations and Trends® in Databases 5 2015 Discovering relaxed functional dependencies based on multi-attribute dominance LCaruccio VDeufemia FNaumann GPolese IEEE Transactions on Knowledge and Data Engineering 33 2021 Holoclean: holistic data repairs with probabilistic inference TRekatsinas XChu IFIlyas CRé Proceedings of VLDB Endowment 10 2017 Enriching data imputation under similarity rule constraints SSong YSun AZhang LChen JWang IEEE Transactions on Knowledge and Data Engineering 32 2020 A grey-based nearest neighbor approach for missing attribute value prediction C.-CHuang H.-MLee Applied Intelligence 20 2004 Relaxed functional dependencies-A survey of approaches LCaruccio VDeufemia GPolese IEEE Transactions on Knowledge and Data Engineering 28 2016 Attention-based learning for missing data imputation in holoclean RWu AZhang IIlyas TRekatsinas Proceedings of Machine Learning and Systems 2 2020 Mind the gap: An experimental evaluation of imputation of missing values techniques in time series MKhayati ALerner ZTymchenko PCudré-Mauroux Proceedings VLDB Endowment 13 2020 Incremental discovery of functional dependencies with a bit-vector algorithm LCaruccio SCirillo VDeufemia GPolese CEUR-WS.org Proceedings of Italian Symposium on Advanced Database Systems Italian Symposium on Advanced Database Systems 2019 2400 SEBD '19 Incremental discovery of imprecise functional dependencies LCaruccio SCirillo Journal of Data and Information Quality (JDIQ) 12 2020