Measuring discord among multidimensional data sources
                             Alberto Abelló                                                           James Cheney
                 Universitat Politècnica de Catalunya                                            University of Edinburgh
                           Barcelona, Spain                                                        Edinburgh, Scotland
                        aabello@essi.upc.edu                                                      jcheney@inf.ed.ac.uk

ABSTRACT                                                                         COVID-19 data where missing, incomplete, or inconsistent num-
Data integration is a classical problem in databases, typically                  bers have been blamed for bad decision making and unnecessary
decomposed into schema matching, entity matching and record                      suffering [22].
merging. To solve the latter, it is mostly assumed that ground                       Thus, in such a complex context, it is necessary to have a
truth can be determined, either as master data or from user feed-                tool that precisely measures discrepancies for the available data.
back. However, in many cases, this is not the case because firstly               Indeed, [6] and [16] measure the differences in the descriptive
the merging processes cannot be accurate enough, and also the                    multidimensional data and their structure. Instead, we aim at
data gathering processes in the different sources are simply im-                 evaluating the reliability of the numerical indicators, given some
perfect and cannot provide high quality data. Instead of enforcing               required alignment declaration (e.g., aggregation or scale cor-
consistency, we propose to evaluate how concordant or discordant                 rection). At this point, it is important to highlight that, even if
sources are as a measure of trustworthiness (the more discordant                 some work like [24] proposes to treat textual data as indicators
are the sources, the less we can trust their data).                              (allowing to aggregate them too), we restrict ourselves to numer-
   Thus, we define the discord measurement problem in which                      ical measures, whose discrepancies cannot be evaluated using
given a set of uncertain raw observations or aggregate results                   string similarity metrics like the ones surveyed in [27]. These
(such as case/hospitalization/death data relevant to COVID-19)                   would rather be part of a preliminary step of entity matching
and information on the alignment of different data (for example,                 over dimensional descriptors.
cases and deaths), we wish to assess whether the different sources
                                                                                    Contributions. Incomplete information is typically handled
are concordant, or if not, measure how discordant they are.
                                                                                 in relational databases by using NULL values. However, it is
                                                                                 well known that NULLs are overloaded with different meanings
                                                                                 such as nonexisting, unknown or no-information [3]. Thus, we
1    INTRODUCTION                                                                propose to use NULL only for nonexisting or no-information,
Scientists often analyse data by placing different indicators (e.g.,             and enrich the data model with symbolic variables that allow to
number of patients or number of deaths) in a multidimensional                    represent the partial knowledge we might have about unknown
space (e.g., geography and time). The multidimensional model                     numerical values. While using symbolic variables for NULLs is
and OLAP tools [2] have been used in the last 30 years with                      not a new idea, introduced for example in classical models for
this purpose, as a more powerful and structured alternative to                   incomplete information such as c-tables and v-tables [18] and
spreadsheets. However, despite these being mature technologies,                  more recently in data cleaning systems such as LLUNATIC [15],
problems like managing missing and contradictory information                     our approach generalizes unknowns to be arbitrary (linear) ex-
are still not solved.                                                            pressions that in the end define a setting for the evaluation of
   OLAP tools are typically used in data warehousing environ-                    the trustworthiness of different sources of multidimensional data
ments where consistent and well known data go through a well                     based on their concordancy/discordancy using standard linear or
structured cleaning and integration process merging different                    quadratic programming solvers. More concretely, in this paper
sources. Nevertheless, in the wild, sources are typically incom-                 we contribute by defining the problem of discord measurement of
plete and not well aligned, and such data cleaning and integration               databases under some merging processes.
processes are far from trivial, resulting in imperfect comparisons.                  Organization. Section 2 presents a motivational example that
For example, different actors often report measures at different                 helps to identify the problem defined in Section 3, whose solution
granularities that can only be compared after aggregation or                     is then exemplified in Section 4. The paper concludes with the
cleaning. On doing this, even if the aggregation performed is                    related work and conclusions in Sections 5 and 6.
correct, due to reporting mistakes, mereological discrepancies, or
incompleteness, it could happen that the indicator of the whole
                                                                                 2   RUNNING EXAMPLE
(e.g., cases at country level) is different from the aggregation of
indicators of its parts (e.g., states, districts), if these come from            A real application of our approach to COVID-19 is available in an
a different source (or even from the same source). Like in the                   extended version of this paper [1], but for illustration purposes,
parable of the blind men describing an elephant after touching                   we provide here a fictitious running example of discordance
different parts of its body (i.e., touching the trunk, it is like a thick        evaluation. Let’s consider a scenario where a network of actors
snake; the leg, like a tree stump; the ear, like a sheath of leather;            (i.e., governmental institutions) take primary measurements of
the tail tip, like a furry mouse; etc.), in many areas like epidemi-             COVID-19 cases and derive some aggregates from those. We
ology, different data sources reflect the same reality in slightly               illustrate how to model this scenario using database schemas and
different and partial ways. This challenge is well-illustrated by                views, and describe the different problems we need to solve.

                                                                                    Example 2.1. The statistical institute of Panem (a fictional
Copyright © 2022 for this paper by its authors.
                                                                                 country) generates census reports on the weekly excess of deaths
Use permitted under Creative Commons License Attribution 4.0 International (CC   (assumed attributable to COVID-19) in the country. Since we are
BY 4.0).                                                                         just considering Panem, we can model this information using a
                                                                                   the number of deaths based on the number of reported cases in
                                                                                   the country.
                                                                                   CREATE VIEW I n f e r r e d AS
                                                                                   SELECT week , 0 . 0 1 5 ∗ c a s e s AS d e a t h s
                                                                                   FROM R e p o r t e d C o u n t r y ;

                                                                                      Example 2.4. Ideally, if all COVID-19 cases were detected, and
                                                                                   we knew the exact CFR as well as the effects of the pandemic in
                                                                                   other causes of death, the week should unambiguously determine
                                                                                   the number of cases and deaths (i.e., information derived from
                                                                                   reported cases, both at district and country levels, and mortality
                                                                                   in the census must coincide). In terms of SQL, these constraints
                                                                                   could be checked using assertions like the following.
                                                                                   CREATE ASSERTION SumOfCases CHECK (NOT EXISTS
                                                                                   ( SELECT ∗ FROM R e p o r t e d C o u n t r y r JOIN A g g R e p o r t e d a
Figure 1: Panem’s map with epidemiological information                             ON r . week=a . week WHERE r . c a s e s <>a . c a s e s ) ) ;
                                                                                   CREATE ASSERTION NumberOfDeaths CHECK (NOT EXISTS
                                                                                   ( SELECT ∗ FROM Census c JOIN I n f e r r e d i
table (primary key underlined) indicating the number declared                      ON c . week= i . week WHERE c . d e a t h s <> i . d e a t h s ) ) ;
per week:
Census ( week , d e a t h s )                                                         Thus, we see that SQL already provides the required mecha-
                                                                                   nisms to freely align the different sources and impose the coin-
    We would have greater trust in multiple consistent reports                     cidence of values. Nevertheless, as explained above, achieving
of the same quantity if they had been obtained independently;                      exact consistency seems unlikely in any real setting. Indeed, us-
if they all came from a single source, we would be more skep-                      ing existing techniques it is possible to check consistency among
tical. Therefore, in an epidemiological verification process, we                   data sources when there is no uncertainty, but it is not straight-
gather different complementary sources of information providing                    forward, in the presence of unknown NULL values or suspected
surrogates or approximations to the desired measurements.                          error in reported values, to determine whether the various data
                                                                                   sources are consistent with the expected relationships.
   Example 2.2. Suppose that Panem, as depicted in Fig. 1, com-
prises thirteen districts (𝐷 𝐼 , . . . , 𝐷𝑋 𝐼 𝐼 𝐼 ). In each district, there are      Example 2.5. It is easy to see that the following database is not
several hospitals and a person living in 𝐷𝑖 is monitored by at most                consistent with our view specification, in part because the cases
one hospital. Hospitals report their number of cases of COVID-                     of a district (i.e., 𝐷𝑋 𝐼 𝐼 𝐼 ) are not reported, but also the second
19 to their district governments, and each district government                     assertion is violated (i.e., too many people died—20—compared
                                                                                   to the inferred number based on the cases reported and the CFR—
reports to the Ministry of Health (MoH).
                                                                                   only 15).
   Given their management autonomy, the different districts in
Panem use different and imperfect monitoring mechanisms and                        R e p o r t e d D i s t r i c t ( " I " , " 2 1 1 0W25 " , 7 5 )
report separately the COVID-19 cases they detect every week. De-                   ...
spite being gathered at health facilities, 𝑃𝑎𝑛𝑒𝑚 is only reporting                 R e p o r t e d D i s t r i c t ( " X I I " , " 2 1 1 0W25 " , 7 5 )
to the Centre for Disease Prevention and Control (CDC) partial                     A g g R e p o r t e d ( " 2 1 1 0W25 " , 9 0 0 )
information at the district level and the overall information of                   R e p o r t e d C o u n t r y ( " 2 1 1 0W25 " , 1 0 0 0 )
the country. We can model this using relational tables with the                    I n f e r r e d ( " 2 1 1 0W25 " , 1 5 )
weekly district and country information.                                           Census ( " 2 1 1 0W25 " , 2 0 )
R e p o r t e d D i s t r i c t ( district, week , c a s e s )
R e p o r t e d C o u n t r y ( week , c a s e s )                                    Indeed, using existing mechanisms, we can easily detect the
                                                                                   problem (i.e., assertion violations). However, we cannot measure
   In an idealized setting, we would expect to know all the rela-                  how far are the data from really being consistent. For example,
tionships and have consistent measurements for each primary                        the country reporting a thousand cases would violate as many as-
attribute, and each derived result would be computed exactly                       sertions as if reporting one million, but its degree of inconsistency
with no error. However, some relationships may be unknown                          with the other sources is absolutely different.
and both primary and derived attributes can be noisy, biased,
unknown or otherwise imperfect.                                                    3     PROBLEM FORMULATION
   Example 2.3. The following view aggregates the district-level                   We aim at extending DBMS functionalities for accurate concor-
for each week, which should coincide with the values per country:                  dancy evaluation in the presence of overlapping sources for
                                                                                   the same numerical data. Given on the one hand the queries
CREATE VIEW A g g R e p o r t e d AS
                                                                                   and views specifying the expected behavior (a.k.a. alignment of
SELECT week , SUM( c a s e s ) AS c a s e s
                                                                                   sources), and on the other the data corresponding to observa-
FROM R e p o r t e d D i s t r i c t GROUP BY week ;
                                                                                   tions of some of the inputs, intermediate results, or (expected)
   Moreover, it is already known that COVID-19 mortality de-                       outputs, is the observed numeric data complete and concordant
pends on the age distribution and vaccination status of the pop-                   considering the alignment specification? If there is missing data,
ulation, but let us assume an average Case-Fatality Ratio (CFR)                    can the existing datasets be extended to some complete instance
of 1.5% which is reasonable for an unvaccinated population. In                     that is concordant? Finally, how far from being fully consistent
terms of SQL, we would have the following view which estimates                     are the numerical data?
   Given such an idealized scenario (specified by its schema and        different justifications based on the application domain. For ex-
views) and a collection of actual observations (both primary and        ample, we can replace uncertain values 𝑣 with “𝑣 · (1 + 𝑥)” (or
derived), we can still consider two different problems:                 simply 𝑥 if 𝑣 = 0) where 𝑥 is an error variable. On the other hand,
                                                                        to handle NULLs in s-tables we simply replaced each NULL with
    (A) Value estimation: estimate the values of numerical attributes
                                                                        a distinct variable.
        of interest (e.g., the number of cases and deaths across
        Panem) that make the system consistent.                            Example 4.2. It is easy to see that there are many possibilities
    (B) Discord evaluation: Evaluate how far is the actual, discor-     of assigning cases of COVID-19 to the different districts of 𝑃𝑎𝑛𝑒𝑚
        dant dataset from an idealized concordant one.                  that add up to 1,000 per week, and consequently improve the
   Problem (A) is the well-studied statistical estimation prob-         consistency of our database, which may be easily represented by
lem. However, many sources behave as black boxes, and it can            replacing constants by symbolic expressions “75(1 + 𝑥𝑖 )”, where
be very difficult to precisely quantify the uncertainty and un-         𝑥𝑖 is an error parameter representing that cases may be missed
derlying assumptions in many situations, especially where the           or overreported in every district. The cases for district 𝐷𝑋 𝐼 𝐼 𝐼 ,
interrelationships among different data sources are complex. In-        that were not reported at all, could then be simply represented
stead, we consider problem (B). Given a (probably incomplete but        by a variable 𝑥𝑋 𝐼 𝐼 𝐼 . On the other hand, we also know that at-
overlapping) set of instances, we assume only a merging process         tributing all the excess deaths to COVID-19 involves some im-
specification in the form of expectations about their alignment, ex-    precision, so we should apply some error term “(1 + 𝑧)” to the
pressed using database queries and views. Our goal in this paper        numbers coming from the census, too. Nevertheless, this may
is not to find a realistic estimate of the true values of unknown       not completely explain the mismatch between cases reported
or uncertain data, but instead to quantify how close the data are       at the country level and deaths, and there might also be some
to our expectations under the given alignment. It is important          doubly-counted or hidden cases in 𝑃𝑎𝑛𝑒𝑚 (for example in the
to clarify that while the approach we will adopt does produce           Capitol which is assumed not to have any cases), which we repre-
estimates for the uncertain values as a side-effect, they are not       sent by variable “(1+𝑦)”. Therefore, s-tables Ⓢ𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑𝐷𝑖𝑠𝑡𝑟𝑖𝑐𝑡 :
guaranteed to have any statistical validity unless additional work      {𝑑𝑖𝑠𝑡𝑟𝑖𝑐𝑡, 𝑤𝑒𝑒𝑘 }⊲{𝑐𝑎𝑠𝑒𝑠}, Ⓢ𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑𝐶𝑜𝑢𝑛𝑡𝑟𝑦 : {𝑤𝑒𝑒𝑘 }⊲{𝑐𝑎𝑠𝑒𝑠}
is done to characterize the sources of uncertainty, which we see        and Ⓢ𝐶𝑒𝑛𝑠𝑢𝑠 : {𝑤𝑒𝑒𝑘 } ⊲ {𝑑𝑒𝑎𝑡ℎ𝑠} would contain:
as a separate problem.                                                  Ⓢ R e p o r t e d D i s t r i c t ( " I " , " 2 1 1 0W25 " , 75 ∗ (1 + 𝑥 𝐼 ) )
   Therefore, the key contribution of this paper is that both check-    ...
ing concordance and measuring discord can be done by augment-           Ⓢ R e p o r t e d D i s t r i c t ( " X I I " , " 2 1 1 0W25 " , 75 ∗ (1 + 𝑥𝑋 𝐼 𝐼 ) )
ing the data model with symbolic expressions, and this in turn can      Ⓢ R e p o r t e d D i s t r i c t ( " X I I I " , " 2 1 1 0W25 " ,𝑥𝑋 𝐼 𝐼 𝐼 )
be done consistently and efficiently in a RDBMS with the right          Ⓢ R e p o r t e d C o u n t r y ( " 2 1 1 0W25 " , 1000 ∗ (1 + 𝑦) )
set of algebraic operations. Indeed, we define and measure the          Ⓢ Census ( " 2 1 1 0W25 " , 20 ∗ (1 + 𝑧 ) )
degree of discordance of different data sources with complemen-
tary multidimensional information, where uncertainty may arise             Example 4.3. Given the s-tables in Example 4.2, the SQL views
from NULLs standing for unknown values, or reported measure-            in Section 2 can be algebraically expressed as:
ments that have some unknown error. To do so, we need to (1)            Ⓢ𝐴𝑔𝑔𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑             :=     𝛾 {𝑤𝑒𝑒𝑘 };{𝑐𝑎𝑠𝑒𝑠 } (Ⓢ𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑𝐷𝑖𝑠𝑡𝑟𝑖𝑐𝑡)
define a variant of relational algebra for queries over (sets of)
                                                                             Ⓢ𝐼𝑛𝑓 𝑒𝑟𝑟𝑒𝑑          :=     𝜀𝑑𝑒𝑎𝑡ℎ𝑠:=0.015∗𝑐𝑎𝑠𝑒𝑠 (Ⓢ𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑𝐶𝑜𝑢𝑛𝑡𝑟𝑦)
finite maps represented as symbolic tables, (2) formally define the
concordance and discordance problems, and (3) show that they            Where 𝛾𝐾;𝑉 is an aggregation operation that sums 𝑉 grouping by
can be solved by reduction to linear or quadratic programming,          𝐾; and 𝜀𝑉𝑛𝑒𝑤 :=𝑓 (𝑉𝑜𝑙𝑑 ) derives a new value attribute as a (linear)
respectively.                                                           function of the pre-existing ones.
                                                                           We represent the expected relationships between source and
4     PROPOSED SOLUTION BY EXAMPLE                                      derived data using a generalization of view specification called
The basic idea in this paper is to represent unknown real values        alignment specifications. Alignment specifications may define
with variables, which can occur multiple times in a table, or           derived s-tables as the fusion of multiple views, written Ⓢ𝑅 ⊔ Ⓢ𝑇 .
in different tables, representing the same unknown value, and           The fusion operator combines the information from two s-tables,
more generally unknown values can be represented by symbolic            resulting in an s-table with constraints that ensure that the values
(linear) expressions in R[𝑋 ]. However, key values used in key-         reported for common keys in Ⓢ𝑅 and Ⓢ𝑇 are equal.
fields are required to be known. This reflects the assumption that
                                                                            Example 4.4. Given the s-tables in Example 4.2 and queries in
the source database is partially closed [14], that is, we assume the
                                                                        Example 4.3, the SQL assertions in Example 2.4 can be specified
existence of master data for the keys (i.e., all potential keys are
                                                                        as:
coincident and known).
                                                                               Ⓢ𝑆𝑢𝑚𝑂 𝑓 𝐶𝑎𝑠𝑒𝑠             :=     Ⓢ𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑𝐶𝑜𝑢𝑛𝑡𝑟𝑦 ⊔ Ⓢ𝐴𝑔𝑔𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑
   Definition 4.1. A symbolic table, or s-table 𝑅 : 𝐾 ⊲ 𝑉 is a table    Ⓢ𝑁𝑢𝑚𝑏𝑒𝑟𝑂 𝑓 𝐷𝑒𝑎𝑡ℎ𝑠                :=     Ⓢ𝐶𝑒𝑛𝑠𝑢𝑠 ⊔ Ⓢ𝐼𝑛𝑓 𝑒𝑟𝑟𝑒𝑑
(with the name prepended with Ⓢ) in which key attributes 𝐾 are
mapped to discrete non-null values and value attributes 𝑉 are              The discord is, intuitively, the shortest distance between the
mapped to symbolic expressions in R[𝑋 ].                                actual observed, uncertain data and a hypothetical concordant
                                                                        database instance that is consistent given the constraints intro-
   Suppose we are given an ordinary database instance, which            duced by the alignment specification. The more distant from any
may have missing values (i.e., NULLs) and uncertain values (i.e.,       such concordant instance, the more discordant our data are. Then,
reported values which we do not believe to be exactly correct).         the degree of discordance of our database given an alignment
To allow such situation, we replace values with symbolic expres-        and according to a distance metric (a.k.a., cost function) equals
sions containing variables. This can be done in many ways, with         the solution to the quadratic programming problem formed by
minimizing the metric subject to the constraints introduced by                   Another known result in the area of DFP is that prioritizing
coincident instances in different sources found on fusing them                the repairs by considering preferences or priorities (like the data
with ⊔ operation.                                                             sources in our case) just increases complexity. An already ex-
                                                                              plored idea is the use of where-provenance in the justification
  Example 4.5. From the specification in Example 4.4, we get the
                                                                              of the new value [15], but with pure direct value imputation
constraints represented by the following system of equations:
                                                                              (without any data transformation). In contrast, we consider that
           1000(1 + 𝑦)    =     75(1 + 𝑥 1 ) + · · · + 75(1 + 𝑥 12 ) + 𝑥 13   there is not any master data, but multiple contradictory sources,
    0.015 ∗ 1000(1 + 𝑦)   =     20 ∗ (1 + 𝑧)                                  and we allow aggregates, while [15] only uses pure equalities
    Obviously, even considering only positive values for the dif-             (neither aggregation nor any real arithmetic) between master
ferent variables, that system has many solutions. One solution                and target DBs.
𝑆 1 consists of taking all 𝑥𝑖 to be zero, 𝑦 = −0.1 and 𝑧 = −0.325.               From another perspective, our work is related to incomplete-
This corresponds to assuming there is no error in the twelve                  ness in multidimensional databases, which has been typically
districts’ reports and there are no cases in District XIII. Another           focused on the problems generated by imprecision in hierarchi-
solution 𝑆 2 sets 𝑥 𝐼 = ... = 𝑥𝑋 𝐼 𝐼 = 0 and 𝑥𝑋 𝐼 𝐼 𝐼 = 100, then 𝑦 = 0       cal information [13], [6] and [16]. Only more recently, attention
and 𝑧 = −0.25 which corresponds to assuming 𝐷𝑋 𝐼 𝐼 𝐼 has all of               has shifted to missing values in the measures. Bimonte et al. [8]
the missing cases. Of course, whether 𝑆 1 or 𝑆 2 (or some other               presents a linear programming-based framework that imputes
solution) is more plausible depends strongly on domain-specific               missing values under some constraints generated by sibling data
knowledge. Nevertheless, given a cost function assigning a cost               at the same aggregation level, as well as parent data in higher
to each solution, we can compare different solutions in terms of              levels. We could consider this a special case of our approach,
how much correction is needed (or discord exists). For example,               where there is a single data source and alignment is predefined.
we might consider a cost function that simply takes the sum of                   The setting we have described shares many motivations in
the squares of the variables:                                                 common with previous work on provenance. The semiring prove-
                                    ∑︁                                        nance model [17] is particularly related, explaining why why-
              𝑐 1 (𝑥,
                   ® 𝑦, 𝑧) = (             𝑥𝑖2 ) + 𝑦 2 + 𝑧 2                  provenance [10] is not enough (e.g., in the case of alternative
                              𝑖 ∈ {𝐼,...,𝑋 𝐼 𝐼 𝐼 }                            sources for the same data) and we need how-provenance to really
Using this cost function, 𝑆 1 has cost ≈ 0.116 while 𝑆 2 has cost             understand how different inputs contribute to the result. They
10000.0625, the first solution is much closer to being concordant,            propose the use of polynomials to capture such kind of prove-
because a large change to 𝑥𝑋 𝐼 𝐼 𝐼 is not needed. Alternatively, we           nance. Further, Amsterdamer et al. [4] extended the semiring
might give the unknown number of cases in 𝐷𝑋 𝐼 𝐼 𝐼 no weight,                 provenance model to aggregations by mixing together annota-
reflecting that we have no knowledge about what it might be,                  tions and values, but the fine-grained provenance information
corresponding to the cost function                                            may become prohibitively large. However, to the best of our
                                 ∑︁                                           knowledge no practical implementations exist. As noted earlier,
              𝑐 2 (𝑥,
                   ® 𝑦, 𝑧) = (           𝑥𝑖2 ) + 𝑦 2 + 𝑧 2
                                                                              our s-tables are similar in some respects to c-tables studied in in-
                              𝑖 ∈ {𝐼,...,𝑋 𝐼 𝐼 }
                                                                              complete information databases [18]. Our data model and queries
that assigns the same cost to 𝑆 1 but assigns cost 0.0625 to 𝑆 2 ,            is more restricted in some ways, due to the restriction to finite
indicating that if we are free to assign all unaccounted cases to             maps, and the fact that we do not allow for conditions affecting
𝑥𝑋 𝐼 𝐼 𝐼 then the second solution is closer to concordance.                   the presence of entire rows, but our approach supports aggrega-
   Besides alternatives in the cost function, we could weight                 tion, which is critical for our application area and which was not
variables considering the reliability of the different districts as           handled in the original work on c-tables.
well as the central government, and the historical information of                There have been implementations of semiring provenance or c-
the census. However, these values depend on knowledge of the                  tables in systems such as Orchestra [19], ProQL [20], ProvSQL [25],
domain and we will leave exploration of more sophisticated cost               and Mimir [23], respectively. In Orchestra provenance annota-
functions to future work.                                                     tions were used for update propagation in a distributed data
                                                                              integration setting. ProQL and ProvSQL implement the semir-
5      RELATED WORK                                                           ing model but do not allow for symbolic expressions in data or
The problems described above are related to Consistent Query                  support aggregation. Mimir is a system for querying uncertain
Answering (CQA) [12], which tries to identify the subset of a                 and probabilistic data based on c-tables; however, in Mimir sym-
database that fulfills some integrity constraints, and corresponds            bolic expressions and conditions are not actually materialized
to the problem of identifying certain answers under open world                as results, instead the system fills in their values with guesses
assumption [5]. In CQA, distance between two database instances               in order to make queries executable on standard RDBMSs. Thus,
is captured by symmetric difference of tuples. However, in our                Mimir’s approach to c-tables would not suffice for our needs since
case, the effects of an alignment are not only reflected in the               we need to generate the symbolic constraints for the QP solver
presence/absence of a tuple, but also in the values it contains.              to solve. On the other hand, our work shows how some of the
This leads to the much closer Database Fix Problem (DFP) [7, 9],              symbolic computation involved in c-tables can be implemented
which aims at determining the existence of a fix at a bounded                 in-database.
distance measuring variations in the numerical values.                           We have reduced the concordancy evaluation problem to
   Both DFP as well as CQA become undecidable in the pres-                    quadratic programming, a well-studied optimization problem.
ence of aggregation constraints. Nonetheless, these have been                 Solvers such as OSQP [26] can handle systems of equations with
used to drive deduplication [11]. However, our case is different              thousands of equations and variables. However, we have not
since we are not questioning correspondences between entities                 made full use of the power of linear/quadratic programming. For
to create aggregation groups, but instead trying to quantify their            example, we could impose additional linear inequalities on un-
(in)consistency in the presence of complex transformations.                   knowns to constrain that certain error or null values have to be
positive or within some range. Likewise, we have defined the                               Data Knowl. Eng. 128 (2020), 101832. https://doi.org/10.1016/j.datak.2020.
cost function in one specific way but quadratic programming                                101832
                                                                                       [9] Philip Bohannon, Michael Flaster, Wenfei Fan, and Rajeev Rastogi. 2005. A
permits many other cost functions to be defined, for example                               Cost-Based Model and Effective Heuristic for Repairing Constraints by Value
with different weights for each variable or with additional linear                         Modification. In ACM SIGMOD International Conference on Management of
                                                                                           Data. ACM, 143–154. https://doi.org/10.1145/1066157.1066175
cost factors.                                                                         [10] Peter Buneman, Sanjeev Khanna, and Wang-Chiew Tan. 2001. Why and Where:
   As noted in Section 2, we have focused on the problem of                                A Characterization of Data Provenance. In 8th International Conference on
evaluating concord/discord among data sources and not on using                             Database Theory (ICDT) (LNCS, Vol. 1973), Jan Van den Bussche and Victor
                                                                                           Vianu (Eds.). Springer, 316–330. https://doi.org/10.1007/3-540-44503-X_20
the different data sources to estimate the actual values being                        [11] Surajit Chaudhuri, Anish Das Sarma, Venkatesh Ganti, and Raghav Kaushik.
measured. It would be interesting to extend our framework by                               2007. Leveraging aggregate constraints for deduplication. In ACM SIGMOD
augmenting symbolic tables and queries with a probabilistic in-                            International Conference on Management of Data. ACM, 437–448. https:
                                                                                           //doi.org/10.1145/1247480.1247530
terpretation, so that the optimal solution found by quadratic                         [12] Jan Chomicki. 2007. Consistent Query Answering: Five Easy Pieces. In 11th
programming produces statistically meaningful consensus val-                               International Conference on Database Theory (ICDT) (LNCS, Vol. 4353). Springer,
                                                                                           1–17. https://doi.org/10.1007/11965893_1
ues (similarly to the work of Mayfield et al. [21]).                                  [13] Curtis E. Dyreson, Torben Bach Pedersen, and Christian S. Jensen. 2003. In-
                                                                                           complete Information in Multidimensional Databases. In Multidimensional
                                                                                           Databases: Problems and Solutions, Maurizio Rafanelli (Ed.). Idea Group, 282–
6    CONCLUSIONS                                                                           309. https://doi.org/10.4018/978-1-59140-053-0.ch010
In many real settings, such as epidemiological surveillance, ground                   [14] Wenfei Fan and Floris Geerts. 2009. Relative information completeness. In
                                                                                           Proceedings of the Twenty-Eigth ACM SIGMOD-SIGACT-SIGART Symposium
truth is not known or knowable and we still need to integrate                              on Principles of Database Systems (PODS). ACM, 97–106. https://doi.org/10.
discordant data sources with different levels of trustworthiness,                          1145/1559795.1559811
completeness and self-consistency. In this setting without any                        [15] Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro.
                                                                                           2013. The LLUNATIC Data-Cleaning Framework. PVLDB 6, 9 (2013), 625–636.
master data, we still would like to be able to measure how close                           https://doi.org/10.14778/2536360.2536363
the observed data is to our idealized expectations. Thus, we pro-                     [16] Matteo Golfarelli and Elisa Turricchia. 2014. A characterization of hierarchical
                                                                                           computable distance functions for data warehouse systems. Decis. Support
posed definitions of concordance and discordance capturing re-                             Syst. 62 (2014), 144–157. https://doi.org/10.1016/j.dss.2014.03.011
spectively when data sources we wish to fuse are compatible                           [17] Todd J. Green, Gregory Karvounarakis, and Val Tannen. 2007. Provenance
with one another, and measuring how far away the observed                                  semirings. In ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Data-
                                                                                           base Systems (PODS). ACM, 31–40. https://doi.org/10.1145/1265530.1265535
data are from being concordant. Consequently, we can compare                          [18] Tomasz Imielinski and Witold Lipski Jr. 1984. Incomplete Information in
discordance measurements over time to understand whether the                               Relational Databases. J. ACM 31, 4 (1984), 761–791. https://doi.org/10.1145/
different sources are becoming more or less consistent with one                            1634.1886
                                                                                      [19] Zachary G. Ives, Todd J. Green, Grigoris Karvounarakis, Nicholas E. Taylor,
another.                                                                                   Val Tannen, Partha Pratim Talukdar, Marie Jacob, and Fernando C. N. Pereira.
   Our approach to symbolic evaluation of multidimensional                                 2008. The ORCHESTRA Collaborative Data Sharing System. SIGMOD Rec. 37,
                                                                                           3 (2008), 26–32. https://doi.org/10.1145/1462571.1462577
queries appears to have further applications which we plan to                         [20] Grigoris Karvounarakis, Zachary G. Ives, and Val Tannen. 2010. Querying
explore next, such supporting other forms of uncertainty express-                          data provenance. In Proceedings of the ACM SIGMOD International Conference
ible as linear constraints, and adapting our approach to produce                           on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10,
                                                                                           2010, Ahmed K. Elmagarmid and Divyakant Agrawal (Eds.). ACM, 951–962.
statistically meaningful estimates of the consensus values.                                https://doi.org/10.1145/1807167.1807269
                                                                                      [21] Chris Mayfield, Jennifer Neville, and Sunil Prabhakar. 2010. ERACER: a
                                                                                           database approach for statistical inference and data cleaning. In ACM SIGMOD
ACKNOWLEDGMENTS                                                                            International Conference on Management of Data. ACM, 75–86. https://doi.
The work of Alberto Abelló has been done under project PID2020-                            org/10.1145/1807167.1807178
                                                                                      [22] Robinson Meyer and Alexis C. Madrigal. 2021. Why the Pandemic
117191RB-I00 funded by MCIN/ AEI /10.13039/501100011033. The                               Experts Failed.        https://www.theatlantic.com/science/archive/2021/03/
work of James Cheney was supported by ERC Consolidator Grant                               americas-coronavirus-catastrophe-began-with-data/618287
Skye (grant number 682315).                                                           [23] Arindam Nandi, Ying Yang, Oliver Kennedy, Boris Glavic, Ronny Fehling,
                                                                                           Zhen Hua Liu, and Dieter Gawlick. 2016. Mimir: Bringing CTables into Practice.
                                                                                           CoRR abs/1601.00073 (2016). arXiv:1601.00073 http://arxiv.org/abs/1601.00073
REFERENCES                                                                            [24] Lamia Oukid, Omar Boussaid, Nadjia Benblidia, and Fadila Bentayeb. 2016.
                                                                                           TLabel: A New OLAP Aggregation Operator in Text Cubes. Int. J. Data
 [1] Alberto Abelló and James Cheney. 2022. Eris: Measuring discord among                  Warehousing and Mining 12, 4 (2016), 54–74. https://doi.org/10.4018/IJDWM.
     multidimensional data sources (extended version). arXiv:2201.13302 [cs.DB]            2016100103
 [2] Alberto Abelló and Oscar Romero. 2018. Online Analytical Processing. In          [25] Pierre Senellart, Louis Jachiet, Silviu Maniu, and Yann Ramusat. 2018. ProvSQL:
     Encyclopedia of Database Systems, Second Edition, Ling Liu and M. Tamer Özsu          Provenance and Probability Management in PostgreSQL. PVLDB 11, 12 (2018),
     (Eds.). Springer. https://doi.org/10.1007/978-1-4614-8265-9_252                       2034–2037. https://doi.org/10.14778/3229863.3236253
 [3] Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of            [26] Bartolomeo Stellato, Goran Banjac, Paul Goulart, Alberto Bemporad, and
     Databases. Addison-Wesley. http://webdam.inria.fr/Alice                               Stephen P. Boyd. 2020. OSQP: an operator splitting solver for quadratic
 [4] Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011. Provenance for                 programs. Mathematical Programming Computation 12, 4 (2020), 637–672.
     aggregate queries. In ACM SIGMOD-SIGACT-SIGART Symposium on Principles                https://doi.org/10.1007/s12532-020-00179-2
     of Database Systems (PODS). ACM, 153–164. https://doi.org/10.1145/1989284.       [27] Minghe Yu, Guoliang Li, Dong Deng, and Jianhua Feng. 2016. String similarity
     1989302                                                                               search and join: a survey. Frontiers Comput. Sci. 10, 3 (2016), 399–417. https:
 [5] Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and              //doi.org/10.1007/s11704-015-5900-5
     Peter F. Patel-Schneider (Eds.). 2003. The Description Logic Handbook: Theory,
     Implementation, and Applications. Cambridge University Press.
 [6] Eftychia Baikousi, Georgios Rogkakos, and Panos Vassiliadis. 2011. Similarity
     measures for multidimensional data. In Proceedings of the 27th International
     Conference on Data Engineering, ICDE 2011, April 11-16, 2011, Hannover, Ger-
     many, Serge Abiteboul, Klemens Böhm, Christoph Koch, and Kian-Lee Tan
     (Eds.). IEEE Computer Society, 171–182. https://doi.org/10.1109/ICDE.2011.
     5767869
 [7] Leopoldo E. Bertossi, Loreto Bravo, Enrico Franconi, and Andrei Lopatenko.
     2005. Complexity and Approximation of Fixing Numerical Attributes in
     Databases Under Integrity Constraints. In 10th International Symposium on
     Database Programming Languages (DBPL) (LNCS, Vol. 3774). Springer, 262–278.
     https://doi.org/10.1007/11601524_17
 [8] Sandro Bimonte, Libo Ren, and Nestor Koueya. 2020. A linear programming-
     based framework for handling missing data in multi-granular data warehouses.