Measuring discord among multidimensional data sources Alberto Abelló James Cheney Universitat Politècnica de Catalunya University of Edinburgh Barcelona, Spain Edinburgh, Scotland aabello@essi.upc.edu jcheney@inf.ed.ac.uk ABSTRACT COVID-19 data where missing, incomplete, or inconsistent num- Data integration is a classical problem in databases, typically bers have been blamed for bad decision making and unnecessary decomposed into schema matching, entity matching and record suffering [22]. merging. To solve the latter, it is mostly assumed that ground Thus, in such a complex context, it is necessary to have a truth can be determined, either as master data or from user feed- tool that precisely measures discrepancies for the available data. back. However, in many cases, this is not the case because firstly Indeed, [6] and [16] measure the differences in the descriptive the merging processes cannot be accurate enough, and also the multidimensional data and their structure. Instead, we aim at data gathering processes in the different sources are simply im- evaluating the reliability of the numerical indicators, given some perfect and cannot provide high quality data. Instead of enforcing required alignment declaration (e.g., aggregation or scale cor- consistency, we propose to evaluate how concordant or discordant rection). At this point, it is important to highlight that, even if sources are as a measure of trustworthiness (the more discordant some work like [24] proposes to treat textual data as indicators are the sources, the less we can trust their data). (allowing to aggregate them too), we restrict ourselves to numer- Thus, we define the discord measurement problem in which ical measures, whose discrepancies cannot be evaluated using given a set of uncertain raw observations or aggregate results string similarity metrics like the ones surveyed in [27]. These (such as case/hospitalization/death data relevant to COVID-19) would rather be part of a preliminary step of entity matching and information on the alignment of different data (for example, over dimensional descriptors. cases and deaths), we wish to assess whether the different sources Contributions. Incomplete information is typically handled are concordant, or if not, measure how discordant they are. in relational databases by using NULL values. However, it is well known that NULLs are overloaded with different meanings such as nonexisting, unknown or no-information [3]. Thus, we 1 INTRODUCTION propose to use NULL only for nonexisting or no-information, Scientists often analyse data by placing different indicators (e.g., and enrich the data model with symbolic variables that allow to number of patients or number of deaths) in a multidimensional represent the partial knowledge we might have about unknown space (e.g., geography and time). The multidimensional model numerical values. While using symbolic variables for NULLs is and OLAP tools [2] have been used in the last 30 years with not a new idea, introduced for example in classical models for this purpose, as a more powerful and structured alternative to incomplete information such as c-tables and v-tables [18] and spreadsheets. However, despite these being mature technologies, more recently in data cleaning systems such as LLUNATIC [15], problems like managing missing and contradictory information our approach generalizes unknowns to be arbitrary (linear) ex- are still not solved. pressions that in the end define a setting for the evaluation of OLAP tools are typically used in data warehousing environ- the trustworthiness of different sources of multidimensional data ments where consistent and well known data go through a well based on their concordancy/discordancy using standard linear or structured cleaning and integration process merging different quadratic programming solvers. More concretely, in this paper sources. Nevertheless, in the wild, sources are typically incom- we contribute by defining the problem of discord measurement of plete and not well aligned, and such data cleaning and integration databases under some merging processes. processes are far from trivial, resulting in imperfect comparisons. Organization. Section 2 presents a motivational example that For example, different actors often report measures at different helps to identify the problem defined in Section 3, whose solution granularities that can only be compared after aggregation or is then exemplified in Section 4. The paper concludes with the cleaning. On doing this, even if the aggregation performed is related work and conclusions in Sections 5 and 6. correct, due to reporting mistakes, mereological discrepancies, or incompleteness, it could happen that the indicator of the whole 2 RUNNING EXAMPLE (e.g., cases at country level) is different from the aggregation of indicators of its parts (e.g., states, districts), if these come from A real application of our approach to COVID-19 is available in an a different source (or even from the same source). Like in the extended version of this paper [1], but for illustration purposes, parable of the blind men describing an elephant after touching we provide here a fictitious running example of discordance different parts of its body (i.e., touching the trunk, it is like a thick evaluation. Let’s consider a scenario where a network of actors snake; the leg, like a tree stump; the ear, like a sheath of leather; (i.e., governmental institutions) take primary measurements of the tail tip, like a furry mouse; etc.), in many areas like epidemi- COVID-19 cases and derive some aggregates from those. We ology, different data sources reflect the same reality in slightly illustrate how to model this scenario using database schemas and different and partial ways. This challenge is well-illustrated by views, and describe the different problems we need to solve. Example 2.1. The statistical institute of Panem (a fictional Copyright © 2022 for this paper by its authors. country) generates census reports on the weekly excess of deaths Use permitted under Creative Commons License Attribution 4.0 International (CC (assumed attributable to COVID-19) in the country. Since we are BY 4.0). just considering Panem, we can model this information using a the number of deaths based on the number of reported cases in the country. CREATE VIEW I n f e r r e d AS SELECT week , 0 . 0 1 5 ∗ c a s e s AS d e a t h s FROM R e p o r t e d C o u n t r y ; Example 2.4. Ideally, if all COVID-19 cases were detected, and we knew the exact CFR as well as the effects of the pandemic in other causes of death, the week should unambiguously determine the number of cases and deaths (i.e., information derived from reported cases, both at district and country levels, and mortality in the census must coincide). In terms of SQL, these constraints could be checked using assertions like the following. CREATE ASSERTION SumOfCases CHECK (NOT EXISTS ( SELECT ∗ FROM R e p o r t e d C o u n t r y r JOIN A g g R e p o r t e d a Figure 1: Panem’s map with epidemiological information ON r . week=a . week WHERE r . c a s e s <>a . c a s e s ) ) ; CREATE ASSERTION NumberOfDeaths CHECK (NOT EXISTS ( SELECT ∗ FROM Census c JOIN I n f e r r e d i table (primary key underlined) indicating the number declared ON c . week= i . week WHERE c . d e a t h s <> i . d e a t h s ) ) ; per week: Census ( week , d e a t h s ) Thus, we see that SQL already provides the required mecha- nisms to freely align the different sources and impose the coin- We would have greater trust in multiple consistent reports cidence of values. Nevertheless, as explained above, achieving of the same quantity if they had been obtained independently; exact consistency seems unlikely in any real setting. Indeed, us- if they all came from a single source, we would be more skep- ing existing techniques it is possible to check consistency among tical. Therefore, in an epidemiological verification process, we data sources when there is no uncertainty, but it is not straight- gather different complementary sources of information providing forward, in the presence of unknown NULL values or suspected surrogates or approximations to the desired measurements. error in reported values, to determine whether the various data sources are consistent with the expected relationships. Example 2.2. Suppose that Panem, as depicted in Fig. 1, com- prises thirteen districts (𝐷 𝐼 , . . . , 𝐷𝑋 𝐼 𝐼 𝐼 ). In each district, there are Example 2.5. It is easy to see that the following database is not several hospitals and a person living in 𝐷𝑖 is monitored by at most consistent with our view specification, in part because the cases one hospital. Hospitals report their number of cases of COVID- of a district (i.e., 𝐷𝑋 𝐼 𝐼 𝐼 ) are not reported, but also the second 19 to their district governments, and each district government assertion is violated (i.e., too many people died—20—compared to the inferred number based on the cases reported and the CFR— reports to the Ministry of Health (MoH). only 15). Given their management autonomy, the different districts in Panem use different and imperfect monitoring mechanisms and R e p o r t e d D i s t r i c t ( " I " , " 2 1 1 0W25 " , 7 5 ) report separately the COVID-19 cases they detect every week. De- ... spite being gathered at health facilities, 𝑃𝑎𝑛𝑒𝑚 is only reporting R e p o r t e d D i s t r i c t ( " X I I " , " 2 1 1 0W25 " , 7 5 ) to the Centre for Disease Prevention and Control (CDC) partial A g g R e p o r t e d ( " 2 1 1 0W25 " , 9 0 0 ) information at the district level and the overall information of R e p o r t e d C o u n t r y ( " 2 1 1 0W25 " , 1 0 0 0 ) the country. We can model this using relational tables with the I n f e r r e d ( " 2 1 1 0W25 " , 1 5 ) weekly district and country information. Census ( " 2 1 1 0W25 " , 2 0 ) R e p o r t e d D i s t r i c t ( district, week , c a s e s ) R e p o r t e d C o u n t r y ( week , c a s e s ) Indeed, using existing mechanisms, we can easily detect the problem (i.e., assertion violations). However, we cannot measure In an idealized setting, we would expect to know all the rela- how far are the data from really being consistent. For example, tionships and have consistent measurements for each primary the country reporting a thousand cases would violate as many as- attribute, and each derived result would be computed exactly sertions as if reporting one million, but its degree of inconsistency with no error. However, some relationships may be unknown with the other sources is absolutely different. and both primary and derived attributes can be noisy, biased, unknown or otherwise imperfect. 3 PROBLEM FORMULATION Example 2.3. The following view aggregates the district-level We aim at extending DBMS functionalities for accurate concor- for each week, which should coincide with the values per country: dancy evaluation in the presence of overlapping sources for the same numerical data. Given on the one hand the queries CREATE VIEW A g g R e p o r t e d AS and views specifying the expected behavior (a.k.a. alignment of SELECT week , SUM( c a s e s ) AS c a s e s sources), and on the other the data corresponding to observa- FROM R e p o r t e d D i s t r i c t GROUP BY week ; tions of some of the inputs, intermediate results, or (expected) Moreover, it is already known that COVID-19 mortality de- outputs, is the observed numeric data complete and concordant pends on the age distribution and vaccination status of the pop- considering the alignment specification? If there is missing data, ulation, but let us assume an average Case-Fatality Ratio (CFR) can the existing datasets be extended to some complete instance of 1.5% which is reasonable for an unvaccinated population. In that is concordant? Finally, how far from being fully consistent terms of SQL, we would have the following view which estimates are the numerical data? Given such an idealized scenario (specified by its schema and different justifications based on the application domain. For ex- views) and a collection of actual observations (both primary and ample, we can replace uncertain values 𝑣 with “𝑣 · (1 + 𝑥)” (or derived), we can still consider two different problems: simply 𝑥 if 𝑣 = 0) where 𝑥 is an error variable. On the other hand, to handle NULLs in s-tables we simply replaced each NULL with (A) Value estimation: estimate the values of numerical attributes a distinct variable. of interest (e.g., the number of cases and deaths across Panem) that make the system consistent. Example 4.2. It is easy to see that there are many possibilities (B) Discord evaluation: Evaluate how far is the actual, discor- of assigning cases of COVID-19 to the different districts of 𝑃𝑎𝑛𝑒𝑚 dant dataset from an idealized concordant one. that add up to 1,000 per week, and consequently improve the Problem (A) is the well-studied statistical estimation prob- consistency of our database, which may be easily represented by lem. However, many sources behave as black boxes, and it can replacing constants by symbolic expressions “75(1 + 𝑥𝑖 )”, where be very difficult to precisely quantify the uncertainty and un- 𝑥𝑖 is an error parameter representing that cases may be missed derlying assumptions in many situations, especially where the or overreported in every district. The cases for district 𝐷𝑋 𝐼 𝐼 𝐼 , interrelationships among different data sources are complex. In- that were not reported at all, could then be simply represented stead, we consider problem (B). Given a (probably incomplete but by a variable 𝑥𝑋 𝐼 𝐼 𝐼 . On the other hand, we also know that at- overlapping) set of instances, we assume only a merging process tributing all the excess deaths to COVID-19 involves some im- specification in the form of expectations about their alignment, ex- precision, so we should apply some error term “(1 + 𝑧)” to the pressed using database queries and views. Our goal in this paper numbers coming from the census, too. Nevertheless, this may is not to find a realistic estimate of the true values of unknown not completely explain the mismatch between cases reported or uncertain data, but instead to quantify how close the data are at the country level and deaths, and there might also be some to our expectations under the given alignment. It is important doubly-counted or hidden cases in 𝑃𝑎𝑛𝑒𝑚 (for example in the to clarify that while the approach we will adopt does produce Capitol which is assumed not to have any cases), which we repre- estimates for the uncertain values as a side-effect, they are not sent by variable “(1+𝑦)”. Therefore, s-tables Ⓢ𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑𝐷𝑖𝑠𝑡𝑟𝑖𝑐𝑡 : guaranteed to have any statistical validity unless additional work {𝑑𝑖𝑠𝑡𝑟𝑖𝑐𝑡, 𝑤𝑒𝑒𝑘 }⊲{𝑐𝑎𝑠𝑒𝑠}, Ⓢ𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑𝐶𝑜𝑢𝑛𝑡𝑟𝑦 : {𝑤𝑒𝑒𝑘 }⊲{𝑐𝑎𝑠𝑒𝑠} is done to characterize the sources of uncertainty, which we see and Ⓢ𝐶𝑒𝑛𝑠𝑢𝑠 : {𝑤𝑒𝑒𝑘 } ⊲ {𝑑𝑒𝑎𝑡ℎ𝑠} would contain: as a separate problem. Ⓢ R e p o r t e d D i s t r i c t ( " I " , " 2 1 1 0W25 " , 75 ∗ (1 + 𝑥 𝐼 ) ) Therefore, the key contribution of this paper is that both check- ... ing concordance and measuring discord can be done by augment- Ⓢ R e p o r t e d D i s t r i c t ( " X I I " , " 2 1 1 0W25 " , 75 ∗ (1 + 𝑥𝑋 𝐼 𝐼 ) ) ing the data model with symbolic expressions, and this in turn can Ⓢ R e p o r t e d D i s t r i c t ( " X I I I " , " 2 1 1 0W25 " ,𝑥𝑋 𝐼 𝐼 𝐼 ) be done consistently and efficiently in a RDBMS with the right Ⓢ R e p o r t e d C o u n t r y ( " 2 1 1 0W25 " , 1000 ∗ (1 + 𝑦) ) set of algebraic operations. Indeed, we define and measure the Ⓢ Census ( " 2 1 1 0W25 " , 20 ∗ (1 + 𝑧 ) ) degree of discordance of different data sources with complemen- tary multidimensional information, where uncertainty may arise Example 4.3. Given the s-tables in Example 4.2, the SQL views from NULLs standing for unknown values, or reported measure- in Section 2 can be algebraically expressed as: ments that have some unknown error. To do so, we need to (1) Ⓢ𝐴𝑔𝑔𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑 := 𝛾 {𝑤𝑒𝑒𝑘 };{𝑐𝑎𝑠𝑒𝑠 } (Ⓢ𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑𝐷𝑖𝑠𝑡𝑟𝑖𝑐𝑡) define a variant of relational algebra for queries over (sets of) Ⓢ𝐼𝑛𝑓 𝑒𝑟𝑟𝑒𝑑 := 𝜀𝑑𝑒𝑎𝑡ℎ𝑠:=0.015∗𝑐𝑎𝑠𝑒𝑠 (Ⓢ𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑𝐶𝑜𝑢𝑛𝑡𝑟𝑦) finite maps represented as symbolic tables, (2) formally define the concordance and discordance problems, and (3) show that they Where 𝛾𝐾;𝑉 is an aggregation operation that sums 𝑉 grouping by can be solved by reduction to linear or quadratic programming, 𝐾; and 𝜀𝑉𝑛𝑒𝑤 :=𝑓 (𝑉𝑜𝑙𝑑 ) derives a new value attribute as a (linear) respectively. function of the pre-existing ones. We represent the expected relationships between source and 4 PROPOSED SOLUTION BY EXAMPLE derived data using a generalization of view specification called The basic idea in this paper is to represent unknown real values alignment specifications. Alignment specifications may define with variables, which can occur multiple times in a table, or derived s-tables as the fusion of multiple views, written Ⓢ𝑅 ⊔ Ⓢ𝑇 . in different tables, representing the same unknown value, and The fusion operator combines the information from two s-tables, more generally unknown values can be represented by symbolic resulting in an s-table with constraints that ensure that the values (linear) expressions in R[𝑋 ]. However, key values used in key- reported for common keys in Ⓢ𝑅 and Ⓢ𝑇 are equal. fields are required to be known. This reflects the assumption that Example 4.4. Given the s-tables in Example 4.2 and queries in the source database is partially closed [14], that is, we assume the Example 4.3, the SQL assertions in Example 2.4 can be specified existence of master data for the keys (i.e., all potential keys are as: coincident and known). Ⓢ𝑆𝑢𝑚𝑂 𝑓 𝐶𝑎𝑠𝑒𝑠 := Ⓢ𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑𝐶𝑜𝑢𝑛𝑡𝑟𝑦 ⊔ Ⓢ𝐴𝑔𝑔𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑 Definition 4.1. A symbolic table, or s-table 𝑅 : 𝐾 ⊲ 𝑉 is a table Ⓢ𝑁𝑢𝑚𝑏𝑒𝑟𝑂 𝑓 𝐷𝑒𝑎𝑡ℎ𝑠 := Ⓢ𝐶𝑒𝑛𝑠𝑢𝑠 ⊔ Ⓢ𝐼𝑛𝑓 𝑒𝑟𝑟𝑒𝑑 (with the name prepended with Ⓢ) in which key attributes 𝐾 are mapped to discrete non-null values and value attributes 𝑉 are The discord is, intuitively, the shortest distance between the mapped to symbolic expressions in R[𝑋 ]. actual observed, uncertain data and a hypothetical concordant database instance that is consistent given the constraints intro- Suppose we are given an ordinary database instance, which duced by the alignment specification. The more distant from any may have missing values (i.e., NULLs) and uncertain values (i.e., such concordant instance, the more discordant our data are. Then, reported values which we do not believe to be exactly correct). the degree of discordance of our database given an alignment To allow such situation, we replace values with symbolic expres- and according to a distance metric (a.k.a., cost function) equals sions containing variables. This can be done in many ways, with the solution to the quadratic programming problem formed by minimizing the metric subject to the constraints introduced by Another known result in the area of DFP is that prioritizing coincident instances in different sources found on fusing them the repairs by considering preferences or priorities (like the data with ⊔ operation. sources in our case) just increases complexity. An already ex- plored idea is the use of where-provenance in the justification Example 4.5. From the specification in Example 4.4, we get the of the new value [15], but with pure direct value imputation constraints represented by the following system of equations: (without any data transformation). In contrast, we consider that 1000(1 + 𝑦) = 75(1 + 𝑥 1 ) + · · · + 75(1 + 𝑥 12 ) + 𝑥 13 there is not any master data, but multiple contradictory sources, 0.015 ∗ 1000(1 + 𝑦) = 20 ∗ (1 + 𝑧) and we allow aggregates, while [15] only uses pure equalities Obviously, even considering only positive values for the dif- (neither aggregation nor any real arithmetic) between master ferent variables, that system has many solutions. One solution and target DBs. 𝑆 1 consists of taking all 𝑥𝑖 to be zero, 𝑦 = −0.1 and 𝑧 = −0.325. From another perspective, our work is related to incomplete- This corresponds to assuming there is no error in the twelve ness in multidimensional databases, which has been typically districts’ reports and there are no cases in District XIII. Another focused on the problems generated by imprecision in hierarchi- solution 𝑆 2 sets 𝑥 𝐼 = ... = 𝑥𝑋 𝐼 𝐼 = 0 and 𝑥𝑋 𝐼 𝐼 𝐼 = 100, then 𝑦 = 0 cal information [13], [6] and [16]. Only more recently, attention and 𝑧 = −0.25 which corresponds to assuming 𝐷𝑋 𝐼 𝐼 𝐼 has all of has shifted to missing values in the measures. Bimonte et al. [8] the missing cases. Of course, whether 𝑆 1 or 𝑆 2 (or some other presents a linear programming-based framework that imputes solution) is more plausible depends strongly on domain-specific missing values under some constraints generated by sibling data knowledge. Nevertheless, given a cost function assigning a cost at the same aggregation level, as well as parent data in higher to each solution, we can compare different solutions in terms of levels. We could consider this a special case of our approach, how much correction is needed (or discord exists). For example, where there is a single data source and alignment is predefined. we might consider a cost function that simply takes the sum of The setting we have described shares many motivations in the squares of the variables: common with previous work on provenance. The semiring prove- ∑︁ nance model [17] is particularly related, explaining why why- 𝑐 1 (𝑥, ® 𝑦, 𝑧) = ( 𝑥𝑖2 ) + 𝑦 2 + 𝑧 2 provenance [10] is not enough (e.g., in the case of alternative 𝑖 ∈ {𝐼,...,𝑋 𝐼 𝐼 𝐼 } sources for the same data) and we need how-provenance to really Using this cost function, 𝑆 1 has cost ≈ 0.116 while 𝑆 2 has cost understand how different inputs contribute to the result. They 10000.0625, the first solution is much closer to being concordant, propose the use of polynomials to capture such kind of prove- because a large change to 𝑥𝑋 𝐼 𝐼 𝐼 is not needed. Alternatively, we nance. Further, Amsterdamer et al. [4] extended the semiring might give the unknown number of cases in 𝐷𝑋 𝐼 𝐼 𝐼 no weight, provenance model to aggregations by mixing together annota- reflecting that we have no knowledge about what it might be, tions and values, but the fine-grained provenance information corresponding to the cost function may become prohibitively large. However, to the best of our ∑︁ knowledge no practical implementations exist. As noted earlier, 𝑐 2 (𝑥, ® 𝑦, 𝑧) = ( 𝑥𝑖2 ) + 𝑦 2 + 𝑧 2 our s-tables are similar in some respects to c-tables studied in in- 𝑖 ∈ {𝐼,...,𝑋 𝐼 𝐼 } complete information databases [18]. Our data model and queries that assigns the same cost to 𝑆 1 but assigns cost 0.0625 to 𝑆 2 , is more restricted in some ways, due to the restriction to finite indicating that if we are free to assign all unaccounted cases to maps, and the fact that we do not allow for conditions affecting 𝑥𝑋 𝐼 𝐼 𝐼 then the second solution is closer to concordance. the presence of entire rows, but our approach supports aggrega- Besides alternatives in the cost function, we could weight tion, which is critical for our application area and which was not variables considering the reliability of the different districts as handled in the original work on c-tables. well as the central government, and the historical information of There have been implementations of semiring provenance or c- the census. However, these values depend on knowledge of the tables in systems such as Orchestra [19], ProQL [20], ProvSQL [25], domain and we will leave exploration of more sophisticated cost and Mimir [23], respectively. In Orchestra provenance annota- functions to future work. tions were used for update propagation in a distributed data integration setting. ProQL and ProvSQL implement the semir- 5 RELATED WORK ing model but do not allow for symbolic expressions in data or The problems described above are related to Consistent Query support aggregation. Mimir is a system for querying uncertain Answering (CQA) [12], which tries to identify the subset of a and probabilistic data based on c-tables; however, in Mimir sym- database that fulfills some integrity constraints, and corresponds bolic expressions and conditions are not actually materialized to the problem of identifying certain answers under open world as results, instead the system fills in their values with guesses assumption [5]. In CQA, distance between two database instances in order to make queries executable on standard RDBMSs. Thus, is captured by symmetric difference of tuples. However, in our Mimir’s approach to c-tables would not suffice for our needs since case, the effects of an alignment are not only reflected in the we need to generate the symbolic constraints for the QP solver presence/absence of a tuple, but also in the values it contains. to solve. On the other hand, our work shows how some of the This leads to the much closer Database Fix Problem (DFP) [7, 9], symbolic computation involved in c-tables can be implemented which aims at determining the existence of a fix at a bounded in-database. distance measuring variations in the numerical values. We have reduced the concordancy evaluation problem to Both DFP as well as CQA become undecidable in the pres- quadratic programming, a well-studied optimization problem. ence of aggregation constraints. Nonetheless, these have been Solvers such as OSQP [26] can handle systems of equations with used to drive deduplication [11]. However, our case is different thousands of equations and variables. However, we have not since we are not questioning correspondences between entities made full use of the power of linear/quadratic programming. For to create aggregation groups, but instead trying to quantify their example, we could impose additional linear inequalities on un- (in)consistency in the presence of complex transformations. knowns to constrain that certain error or null values have to be positive or within some range. Likewise, we have defined the Data Knowl. Eng. 128 (2020), 101832. https://doi.org/10.1016/j.datak.2020. cost function in one specific way but quadratic programming 101832 [9] Philip Bohannon, Michael Flaster, Wenfei Fan, and Rajeev Rastogi. 2005. A permits many other cost functions to be defined, for example Cost-Based Model and Effective Heuristic for Repairing Constraints by Value with different weights for each variable or with additional linear Modification. In ACM SIGMOD International Conference on Management of Data. ACM, 143–154. https://doi.org/10.1145/1066157.1066175 cost factors. [10] Peter Buneman, Sanjeev Khanna, and Wang-Chiew Tan. 2001. Why and Where: As noted in Section 2, we have focused on the problem of A Characterization of Data Provenance. In 8th International Conference on evaluating concord/discord among data sources and not on using Database Theory (ICDT) (LNCS, Vol. 1973), Jan Van den Bussche and Victor Vianu (Eds.). Springer, 316–330. https://doi.org/10.1007/3-540-44503-X_20 the different data sources to estimate the actual values being [11] Surajit Chaudhuri, Anish Das Sarma, Venkatesh Ganti, and Raghav Kaushik. measured. It would be interesting to extend our framework by 2007. Leveraging aggregate constraints for deduplication. In ACM SIGMOD augmenting symbolic tables and queries with a probabilistic in- International Conference on Management of Data. ACM, 437–448. https: //doi.org/10.1145/1247480.1247530 terpretation, so that the optimal solution found by quadratic [12] Jan Chomicki. 2007. Consistent Query Answering: Five Easy Pieces. In 11th programming produces statistically meaningful consensus val- International Conference on Database Theory (ICDT) (LNCS, Vol. 4353). Springer, 1–17. https://doi.org/10.1007/11965893_1 ues (similarly to the work of Mayfield et al. [21]). [13] Curtis E. Dyreson, Torben Bach Pedersen, and Christian S. Jensen. 2003. In- complete Information in Multidimensional Databases. In Multidimensional Databases: Problems and Solutions, Maurizio Rafanelli (Ed.). Idea Group, 282– 6 CONCLUSIONS 309. https://doi.org/10.4018/978-1-59140-053-0.ch010 In many real settings, such as epidemiological surveillance, ground [14] Wenfei Fan and Floris Geerts. 2009. Relative information completeness. In Proceedings of the Twenty-Eigth ACM SIGMOD-SIGACT-SIGART Symposium truth is not known or knowable and we still need to integrate on Principles of Database Systems (PODS). ACM, 97–106. https://doi.org/10. discordant data sources with different levels of trustworthiness, 1145/1559795.1559811 completeness and self-consistency. In this setting without any [15] Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro. 2013. The LLUNATIC Data-Cleaning Framework. PVLDB 6, 9 (2013), 625–636. master data, we still would like to be able to measure how close https://doi.org/10.14778/2536360.2536363 the observed data is to our idealized expectations. Thus, we pro- [16] Matteo Golfarelli and Elisa Turricchia. 2014. A characterization of hierarchical computable distance functions for data warehouse systems. Decis. Support posed definitions of concordance and discordance capturing re- Syst. 62 (2014), 144–157. https://doi.org/10.1016/j.dss.2014.03.011 spectively when data sources we wish to fuse are compatible [17] Todd J. Green, Gregory Karvounarakis, and Val Tannen. 2007. Provenance with one another, and measuring how far away the observed semirings. In ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Data- base Systems (PODS). ACM, 31–40. https://doi.org/10.1145/1265530.1265535 data are from being concordant. Consequently, we can compare [18] Tomasz Imielinski and Witold Lipski Jr. 1984. Incomplete Information in discordance measurements over time to understand whether the Relational Databases. J. ACM 31, 4 (1984), 761–791. https://doi.org/10.1145/ different sources are becoming more or less consistent with one 1634.1886 [19] Zachary G. Ives, Todd J. Green, Grigoris Karvounarakis, Nicholas E. Taylor, another. Val Tannen, Partha Pratim Talukdar, Marie Jacob, and Fernando C. N. Pereira. Our approach to symbolic evaluation of multidimensional 2008. The ORCHESTRA Collaborative Data Sharing System. SIGMOD Rec. 37, 3 (2008), 26–32. https://doi.org/10.1145/1462571.1462577 queries appears to have further applications which we plan to [20] Grigoris Karvounarakis, Zachary G. Ives, and Val Tannen. 2010. Querying explore next, such supporting other forms of uncertainty express- data provenance. In Proceedings of the ACM SIGMOD International Conference ible as linear constraints, and adapting our approach to produce on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010, Ahmed K. Elmagarmid and Divyakant Agrawal (Eds.). ACM, 951–962. statistically meaningful estimates of the consensus values. https://doi.org/10.1145/1807167.1807269 [21] Chris Mayfield, Jennifer Neville, and Sunil Prabhakar. 2010. ERACER: a database approach for statistical inference and data cleaning. In ACM SIGMOD ACKNOWLEDGMENTS International Conference on Management of Data. ACM, 75–86. https://doi. The work of Alberto Abelló has been done under project PID2020- org/10.1145/1807167.1807178 [22] Robinson Meyer and Alexis C. Madrigal. 2021. Why the Pandemic 117191RB-I00 funded by MCIN/ AEI /10.13039/501100011033. The Experts Failed. https://www.theatlantic.com/science/archive/2021/03/ work of James Cheney was supported by ERC Consolidator Grant americas-coronavirus-catastrophe-began-with-data/618287 Skye (grant number 682315). [23] Arindam Nandi, Ying Yang, Oliver Kennedy, Boris Glavic, Ronny Fehling, Zhen Hua Liu, and Dieter Gawlick. 2016. Mimir: Bringing CTables into Practice. CoRR abs/1601.00073 (2016). arXiv:1601.00073 http://arxiv.org/abs/1601.00073 REFERENCES [24] Lamia Oukid, Omar Boussaid, Nadjia Benblidia, and Fadila Bentayeb. 2016. TLabel: A New OLAP Aggregation Operator in Text Cubes. Int. J. Data [1] Alberto Abelló and James Cheney. 2022. Eris: Measuring discord among Warehousing and Mining 12, 4 (2016), 54–74. https://doi.org/10.4018/IJDWM. multidimensional data sources (extended version). arXiv:2201.13302 [cs.DB] 2016100103 [2] Alberto Abelló and Oscar Romero. 2018. Online Analytical Processing. In [25] Pierre Senellart, Louis Jachiet, Silviu Maniu, and Yann Ramusat. 2018. ProvSQL: Encyclopedia of Database Systems, Second Edition, Ling Liu and M. Tamer Özsu Provenance and Probability Management in PostgreSQL. PVLDB 11, 12 (2018), (Eds.). Springer. https://doi.org/10.1007/978-1-4614-8265-9_252 2034–2037. https://doi.org/10.14778/3229863.3236253 [3] Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of [26] Bartolomeo Stellato, Goran Banjac, Paul Goulart, Alberto Bemporad, and Databases. Addison-Wesley. http://webdam.inria.fr/Alice Stephen P. Boyd. 2020. OSQP: an operator splitting solver for quadratic [4] Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011. Provenance for programs. Mathematical Programming Computation 12, 4 (2020), 637–672. aggregate queries. In ACM SIGMOD-SIGACT-SIGART Symposium on Principles https://doi.org/10.1007/s12532-020-00179-2 of Database Systems (PODS). ACM, 153–164. https://doi.org/10.1145/1989284. [27] Minghe Yu, Guoliang Li, Dong Deng, and Jianhua Feng. 2016. String similarity 1989302 search and join: a survey. Frontiers Comput. Sci. 10, 3 (2016), 399–417. https: [5] Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and //doi.org/10.1007/s11704-015-5900-5 Peter F. Patel-Schneider (Eds.). 2003. The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press. [6] Eftychia Baikousi, Georgios Rogkakos, and Panos Vassiliadis. 2011. Similarity measures for multidimensional data. In Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11-16, 2011, Hannover, Ger- many, Serge Abiteboul, Klemens Böhm, Christoph Koch, and Kian-Lee Tan (Eds.). IEEE Computer Society, 171–182. https://doi.org/10.1109/ICDE.2011. 5767869 [7] Leopoldo E. Bertossi, Loreto Bravo, Enrico Franconi, and Andrei Lopatenko. 2005. Complexity and Approximation of Fixing Numerical Attributes in Databases Under Integrity Constraints. In 10th International Symposium on Database Programming Languages (DBPL) (LNCS, Vol. 3774). Springer, 262–278. https://doi.org/10.1007/11601524_17 [8] Sandro Bimonte, Libo Ren, and Nestor Koueya. 2020. A linear programming- based framework for handling missing data in multi-granular data warehouses.