<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Measuring discord among multidimensional data sources</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alberto Abelló</string-name>
          <email>aabello@essi.upc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>James Cheney</string-name>
          <email>jcheney@inf.ed.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universitat Politècnica de Catalunya</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Edinburgh</institution>
          ,
          <addr-line>Edinburgh, Scotland</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Data integration is a classical problem in databases, typically decomposed into schema matching, entity matching and record merging. To solve the latter, it is mostly assumed that ground truth can be determined, either as master data or from user feedback. However, in many cases, this is not the case because firstly the merging processes cannot be accurate enough, and also the data gathering processes in the diferent sources are simply imperfect and cannot provide high quality data. Instead of enforcing consistency, we propose to evaluate how concordant or discordant sources are as a measure of trustworthiness (the more discordant are the sources, the less we can trust their data). Thus, we define the discord measurement problem in which given a set of uncertain raw observations or aggregate results (such as case/hospitalization/death data relevant to COVID-19) and information on the alignment of diferent data (for example, cases and deaths), we wish to assess whether the diferent sources are concordant, or if not, measure how discordant they are.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Scientists often analyse data by placing diferent indicators (e.g.,
number of patients or number of deaths) in a multidimensional
space (e.g., geography and time). The multidimensional model
and OLAP tools [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] have been used in the last 30 years with
this purpose, as a more powerful and structured alternative to
spreadsheets. However, despite these being mature technologies,
problems like managing missing and contradictory information
are still not solved.
      </p>
      <p>
        OLAP tools are typically used in data warehousing
environments where consistent and well known data go through a well
structured cleaning and integration process merging diferent
sources. Nevertheless, in the wild, sources are typically
incomplete and not well aligned, and such data cleaning and integration
processes are far from trivial, resulting in imperfect comparisons.
For example, diferent actors often report measures at diferent
granularities that can only be compared after aggregation or
cleaning. On doing this, even if the aggregation performed is
correct, due to reporting mistakes, mereological discrepancies, or
incompleteness, it could happen that the indicator of the whole
(e.g., cases at country level) is diferent from the aggregation of
indicators of its parts (e.g., states, districts), if these come from
a diferent source (or even from the same source). Like in the
parable of the blind men describing an elephant after touching
diferent parts of its body (i.e., touching the trunk, it is like a thick
snake; the leg, like a tree stump; the ear, like a sheath of leather;
the tail tip, like a furry mouse; etc.), in many areas like
epidemiology, diferent data sources reflect the same reality in slightly
diferent and partial ways. This challenge is well-illustrated by
COVID-19 data where missing, incomplete, or inconsistent
numbers have been blamed for bad decision making and unnecessary
sufering [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
      </p>
      <p>
        Thus, in such a complex context, it is necessary to have a
tool that precisely measures discrepancies for the available data.
Indeed, [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] measure the diferences in the descriptive
multidimensional data and their structure. Instead, we aim at
evaluating the reliability of the numerical indicators, given some
required alignment declaration (e.g., aggregation or scale
correction). At this point, it is important to highlight that, even if
some work like [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] proposes to treat textual data as indicators
(allowing to aggregate them too), we restrict ourselves to
numerical measures, whose discrepancies cannot be evaluated using
string similarity metrics like the ones surveyed in [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. These
would rather be part of a preliminary step of entity matching
over dimensional descriptors.
      </p>
      <p>
        Contributions. Incomplete information is typically handled
in relational databases by using NULL values. However, it is
well known that NULLs are overloaded with diferent meanings
such as nonexisting, unknown or no-information [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Thus, we
propose to use NULL only for nonexisting or no-information,
and enrich the data model with symbolic variables that allow to
represent the partial knowledge we might have about unknown
numerical values. While using symbolic variables for NULLs is
not a new idea, introduced for example in classical models for
incomplete information such as c-tables and v-tables [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and
more recently in data cleaning systems such as LLUNATIC [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
our approach generalizes unknowns to be arbitrary (linear)
expressions that in the end define a setting for the evaluation of
the trustworthiness of diferent sources of multidimensional data
based on their concordancy/discordancy using standard linear or
quadratic programming solvers. More concretely, in this paper
we contribute by defining the problem of discord measurement of
databases under some merging processes.
      </p>
      <p>Organization. Section 2 presents a motivational example that
helps to identify the problem defined in Section 3, whose solution
is then exemplified in Section 4. The paper concludes with the
related work and conclusions in Sections 5 and 6.
2</p>
    </sec>
    <sec id="sec-2">
      <title>RUNNING EXAMPLE</title>
      <p>
        A real application of our approach to COVID-19 is available in an
extended version of this paper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], but for illustration purposes,
we provide here a fictitious running example of discordance
evaluation. Let’s consider a scenario where a network of actors
(i.e., governmental institutions) take primary measurements of
COVID-19 cases and derive some aggregates from those. We
illustrate how to model this scenario using database schemas and
views, and describe the diferent problems we need to solve.
      </p>
      <p>Example 2.1. The statistical institute of Panem (a fictional
country) generates census reports on the weekly excess of deaths
(assumed attributable to COVID-19) in the country. Since we are
just considering Panem, we can model this information using a</p>
      <p>We would have greater trust in multiple consistent reports
of the same quantity if they had been obtained independently;
if they all came from a single source, we would be more
skeptical. Therefore, in an epidemiological verification process, we
gather diferent complementary sources of information providing
surrogates or approximations to the desired measurements.</p>
      <p>Example 2.2. Suppose that Panem, as depicted in Fig. 1,
comprises thirteen districts ( , . . . ,   ). In each district, there are
several hospitals and a person living in  is monitored by at most
one hospital. Hospitals report their number of cases of
COVID19 to their district governments, and each district government
reports to the Ministry of Health (MoH).</p>
      <p>Given their management autonomy, the diferent districts in
Panem use diferent and imperfect monitoring mechanisms and
report separately the COVID-19 cases they detect every week.
Despite being gathered at health facilities,  is only reporting
to the Centre for Disease Prevention and Control (CDC) partial
information at the district level and the overall information of
the country. We can model this using relational tables with the
weekly district and country information.</p>
      <p>R e p o r t e d D i s t r i c t ( district, week , c a s e s )
R e p o r t e d C o u n t r y ( week , c a s e s )</p>
      <p>In an idealized setting, we would expect to know all the
relationships and have consistent measurements for each primary
attribute, and each derived result would be computed exactly
with no error. However, some relationships may be unknown
and both primary and derived attributes can be noisy, biased,
unknown or otherwise imperfect.</p>
      <p>Example 2.3. The following view aggregates the district-level
for each week, which should coincide with the values per country:</p>
      <sec id="sec-2-1">
        <title>CREATE VIEW AggReported AS</title>
        <p>SELECT week , SUM( c a s e s ) AS c a s e s
FROM R e p o r t e d D i s t r i c t GROUP BY week ;</p>
        <p>Moreover, it is already known that COVID-19 mortality
depends on the age distribution and vaccination status of the
population, but let us assume an average Case-Fatality Ratio (CFR)
of 1.5% which is reasonable for an unvaccinated population. In
terms of SQL, we would have the following view which estimates
the number of deaths based on the number of reported cases in
the country.</p>
      </sec>
      <sec id="sec-2-2">
        <title>CREATE VIEW I n f e r r e d AS</title>
        <p>SELECT week , 0 . 0 1 5 ∗ c a s e s AS d e a t h s
FROM R e p o r t e d C o u n t r y ;</p>
        <p>Example 2.4. Ideally, if all COVID-19 cases were detected, and
we knew the exact CFR as well as the efects of the pandemic in
other causes of death, the week should unambiguously determine
the number of cases and deaths (i.e., information derived from
reported cases, both at district and country levels, and mortality
in the census must coincide). In terms of SQL, these constraints
could be checked using assertions like the following.</p>
      </sec>
      <sec id="sec-2-3">
        <title>CREATE ASSERTION SumOfCases CHECK (NOT EXISTS</title>
        <p>( SELECT ∗ FROM R e p o r t e d C o u n t r y r JOIN AggReported a
ON r . week=a . week WHERE r . c a s e s &lt;&gt;a . c a s e s ) ) ;</p>
      </sec>
      <sec id="sec-2-4">
        <title>CREATE ASSERTION NumberOfDeaths CHECK (NOT EXISTS</title>
        <p>( SELECT ∗ FROM Census c JOIN I n f e r r e d i
ON c . week= i . week WHERE c . d e a t h s &lt;&gt; i . d e a t h s ) ) ;</p>
        <p>Thus, we see that SQL already provides the required
mechanisms to freely align the diferent sources and impose the
coincidence of values. Nevertheless, as explained above, achieving
exact consistency seems unlikely in any real setting. Indeed,
using existing techniques it is possible to check consistency among
data sources when there is no uncertainty, but it is not
straightforward, in the presence of unknown NULL values or suspected
error in reported values, to determine whether the various data
sources are consistent with the expected relationships.</p>
        <p>Example 2.5. It is easy to see that the following database is not
consistent with our view specification, in part because the cases
of a district (i.e.,   ) are not reported, but also the second
assertion is violated (i.e., too many people died—20—compared
to the inferred number based on the cases reported and the CFR—
only 15).</p>
        <p>R e p o r t e d D i s t r i c t ( " I " , " 2 1 1 0W25 " , 7 5 )
. . .</p>
        <p>R e p o r t e d D i s t r i c t ( " X I I " , " 2 1 1 0W25 " , 7 5 )
AggReported ( " 2 1 1 0W25 " , 9 0 0 )
R e p o r t e d C o u n t r y ( " 2 1 1 0W25 " , 1 0 0 0 )
I n f e r r e d ( " 2 1 1 0W25 " , 1 5 )
Census ( " 2 1 1 0W25 " , 2 0 )</p>
        <p>Indeed, using existing mechanisms, we can easily detect the
problem (i.e., assertion violations). However, we cannot measure
how far are the data from really being consistent. For example,
the country reporting a thousand cases would violate as many
assertions as if reporting one million, but its degree of inconsistency
with the other sources is absolutely diferent.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 PROBLEM FORMULATION</title>
      <p>We aim at extending DBMS functionalities for accurate
concordancy evaluation in the presence of overlapping sources for
the same numerical data. Given on the one hand the queries
and views specifying the expected behavior (a.k.a. alignment of
sources), and on the other the data corresponding to
observations of some of the inputs, intermediate results, or (expected)
outputs, is the observed numeric data complete and concordant
considering the alignment specification? If there is missing data,
can the existing datasets be extended to some complete instance
that is concordant? Finally, how far from being fully consistent
are the numerical data?</p>
      <p>Given such an idealized scenario (specified by its schema and
views) and a collection of actual observations (both primary and
derived), we can still consider two diferent problems:
(A) Value estimation: estimate the values of numerical attributes
of interest (e.g., the number of cases and deaths across
Panem) that make the system consistent.
(B) Discord evaluation: Evaluate how far is the actual,
discordant dataset from an idealized concordant one.</p>
      <p>Problem (A) is the well-studied statistical estimation
problem. However, many sources behave as black boxes, and it can
be very dificult to precisely quantify the uncertainty and
underlying assumptions in many situations, especially where the
interrelationships among diferent data sources are complex.
Instead, we consider problem (B). Given a (probably incomplete but
overlapping) set of instances, we assume only a merging process
specification in the form of expectations about their alignment,
expressed using database queries and views. Our goal in this paper
is not to find a realistic estimate of the true values of unknown
or uncertain data, but instead to quantify how close the data are
to our expectations under the given alignment. It is important
to clarify that while the approach we will adopt does produce
estimates for the uncertain values as a side-efect, they are not
guaranteed to have any statistical validity unless additional work
is done to characterize the sources of uncertainty, which we see
as a separate problem.</p>
      <p>Therefore, the key contribution of this paper is that both
checking concordance and measuring discord can be done by
augmenting the data model with symbolic expressions, and this in turn can
be done consistently and eficiently in a RDBMS with the right
set of algebraic operations. Indeed, we define and measure the
degree of discordance of diferent data sources with
complementary multidimensional information, where uncertainty may arise
from NULLs standing for unknown values, or reported
measurements that have some unknown error. To do so, we need to (1)
define a variant of relational algebra for queries over (sets of)
ifnite maps represented as symbolic tables, (2) formally define the
concordance and discordance problems, and (3) show that they
can be solved by reduction to linear or quadratic programming,
respectively.
4</p>
    </sec>
    <sec id="sec-4">
      <title>PROPOSED SOLUTION BY EXAMPLE</title>
      <p>
        The basic idea in this paper is to represent unknown real values
with variables, which can occur multiple times in a table, or
in diferent tables, representing the same unknown value, and
more generally unknown values can be represented by symbolic
(linear) expressions in R[ ]. However, key values used in
keyifelds are required to be known. This reflects the assumption that
the source database is partially closed [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], that is, we assume the
existence of master data for the keys (i.e., all potential keys are
coincident and known).
      </p>
      <p>Definition 4.1. A symbolic table, or s-table  :  ⊲  is a table
(with the name prepended with Ⓢ) in which key attributes  are
mapped to discrete non-null values and value attributes  are
mapped to symbolic expressions in R[ ].</p>
      <p>Suppose we are given an ordinary database instance, which
may have missing values (i.e., NULLs) and uncertain values (i.e.,
reported values which we do not believe to be exactly correct).
To allow such situation, we replace values with symbolic
expressions containing variables. This can be done in many ways, with
diferent justifications based on the application domain. For
example, we can replace uncertain values  with “ · (1 +  )” (or
simply  if  = 0) where  is an error variable. On the other hand,
to handle NULLs in s-tables we simply replaced each NULL with
a distinct variable.</p>
      <p>Example 4.2. It is easy to see that there are many possibilities
of assigning cases of COVID-19 to the diferent districts of 
that add up to 1,000 per week, and consequently improve the
consistency of our database, which may be easily represented by
replacing constants by symbolic expressions “75(1 +  )”, where
 is an error parameter representing that cases may be missed
or overreported in every district. The cases for district    ,
that were not reported at all, could then be simply represented
by a variable   . On the other hand, we also know that
attributing all the excess deaths to COVID-19 involves some
imprecision, so we should apply some error term “(1 + )” to the
numbers coming from the census, too. Nevertheless, this may
not completely explain the mismatch between cases reported
at the country level and deaths, and there might also be some
doubly-counted or hidden cases in  (for example in the
Capitol which is assumed not to have any cases), which we
represent by variable “(1+)”. Therefore, s-tables Ⓢ :
{,  }⊲{ }, Ⓢ : { }⊲{ }
and Ⓢ : { } ⊲ {ℎ } would contain:
Ⓢ R e p o r t e d D i s t r i c t ( " I " , " 2 1 1 0W25 " , 75 ∗ (1 +  ) )
. . .
Ⓢ R e p o r t e d D i s t r i c t ( " X I I " , " 2 1 1 0W25 " , 75 ∗ (1 +   ) )
Ⓢ R e p o r t e d D i s t r i c t ( " X I I I " , " 2 1 1 0W25 " ,   )
Ⓢ R e p o r t e d C o u n t r y ( " 2 1 1 0W25 " , 1000 ∗ (1 +  ) )
Ⓢ Ce ns us ( " 2 1 1 0W25 " , 20 ∗ (1 +  ) )</p>
      <p>Example 4.3. Given the s-tables in Example 4.2, the SQL views
in Section 2 can be algebraically expressed as:
Ⓢ :=
Ⓢ   :=
 { };{ } (Ⓢ )
ℎ:=0.015∗ (Ⓢ)
Where ; is an aggregation operation that sums  grouping by
 ; and  := ( ) derives a new value attribute as a (linear)
function of the pre-existing ones.</p>
      <p>We represent the expected relationships between source and
derived data using a generalization of view specification called
alignment specifications . Alignment specifications may define
derived s-tables as the fusion of multiple views, written Ⓢ ⊔ Ⓢ .
The fusion operator combines the information from two s-tables,
resulting in an s-table with constraints that ensure that the values
reported for common keys in Ⓢ and Ⓢ are equal.</p>
      <p>Example 4.4. Given the s-tables in Example 4.2 and queries in
Example 4.3, the SQL assertions in Example 2.4 can be specified
as:</p>
      <p>Ⓢ   :=
Ⓢ  ℎ :=
Ⓢ ⊔ Ⓢ
Ⓢ ⊔ Ⓢ</p>
      <p>The discord is, intuitively, the shortest distance between the
actual observed, uncertain data and a hypothetical concordant
database instance that is consistent given the constraints
introduced by the alignment specification. The more distant from any
such concordant instance, the more discordant our data are. Then,
the degree of discordance of our database given an alignment
and according to a distance metric (a.k.a., cost function) equals
the solution to the quadratic programming problem formed by
minimizing the metric subject to the constraints introduced by
coincident instances in diferent sources found on fusing them
with ⊔ operation.</p>
      <p>Example 4.5. From the specification in Example 4.4, we get the
constraints represented by the following system of equations:
1000(1 + )
0.015 ∗ 1000(1 + )
=
=
75(1 + 1) + · · · +75(1 + 12) + 13
20 ∗ (1 + )</p>
      <p>Obviously, even considering only positive values for the
different variables, that system has many solutions. One solution
1 consists of taking all  to be zero,  = −0.1 and  = −0.325.
This corresponds to assuming there is no error in the twelve
districts’ reports and there are no cases in District XIII. Another
solution 2 sets  = ... =   = 0 and   = 100, then  = 0
and  = −0.25 which corresponds to assuming   has all of
the missing cases. Of course, whether 1 or 2 (or some other
solution) is more plausible depends strongly on domain-specific
knowledge. Nevertheless, given a cost function assigning a cost
to each solution, we can compare diferent solutions in terms of
how much correction is needed (or discord exists). For example,
we might consider a cost function that simply takes the sum of
the squares of the variables:
︁∑
︁∑
1 (®, , ) = (</p>
      <p>2
 ) +  + 
2</p>
      <p>2
 ∈ {,...,  }
Using this cost function, 1 has cost ≈ 0.116 while 2 has cost
10000.0625, the first solution is much closer to being concordant,
because a large change to   is not needed. Alternatively, we
might give the unknown number of cases in   no weight,
reflecting that we have no knowledge about what it might be,
corresponding to the cost function
2 (®, , ) = (</p>
      <p>2
 ) +  + 
2</p>
      <p>2
 ∈ {,...,  }
that assigns the same cost to 1 but assigns cost 0.0625 to 2,
indicating that if we are free to assign all unaccounted cases to
  then the second solution is closer to concordance.</p>
      <p>Besides alternatives in the cost function, we could weight
variables considering the reliability of the diferent districts as
well as the central government, and the historical information of
the census. However, these values depend on knowledge of the
domain and we will leave exploration of more sophisticated cost
functions to future work.
5</p>
    </sec>
    <sec id="sec-5">
      <title>RELATED WORK</title>
      <p>
        The problems described above are related to Consistent Query
Answering (CQA) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which tries to identify the subset of a
database that fulfills some integrity constraints, and corresponds
to the problem of identifying certain answers under open world
assumption [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In CQA, distance between two database instances
is captured by symmetric diference of tuples. However, in our
case, the efects of an alignment are not only reflected in the
presence/absence of a tuple, but also in the values it contains.
This leads to the much closer Database Fix Problem (DFP) [
        <xref ref-type="bibr" rid="ref7 ref9">7, 9</xref>
        ],
which aims at determining the existence of a fix at a bounded
distance measuring variations in the numerical values.
      </p>
      <p>
        Both DFP as well as CQA become undecidable in the
presence of aggregation constraints. Nonetheless, these have been
used to drive deduplication [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. However, our case is diferent
since we are not questioning correspondences between entities
to create aggregation groups, but instead trying to quantify their
(in)consistency in the presence of complex transformations.
      </p>
      <p>
        Another known result in the area of DFP is that prioritizing
the repairs by considering preferences or priorities (like the data
sources in our case) just increases complexity. An already
explored idea is the use of where-provenance in the justification
of the new value [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], but with pure direct value imputation
(without any data transformation). In contrast, we consider that
there is not any master data, but multiple contradictory sources,
and we allow aggregates, while [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] only uses pure equalities
(neither aggregation nor any real arithmetic) between master
and target DBs.
      </p>
      <p>
        From another perspective, our work is related to
incompleteness in multidimensional databases, which has been typically
focused on the problems generated by imprecision in
hierarchical information [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Only more recently, attention
has shifted to missing values in the measures. Bimonte et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
presents a linear programming-based framework that imputes
missing values under some constraints generated by sibling data
at the same aggregation level, as well as parent data in higher
levels. We could consider this a special case of our approach,
where there is a single data source and alignment is predefined.
      </p>
      <p>
        The setting we have described shares many motivations in
common with previous work on provenance. The semiring
provenance model [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] is particularly related, explaining why
whyprovenance [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is not enough (e.g., in the case of alternative
sources for the same data) and we need how-provenance to really
understand how diferent inputs contribute to the result. They
propose the use of polynomials to capture such kind of
provenance. Further, Amsterdamer et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] extended the semiring
provenance model to aggregations by mixing together
annotations and values, but the fine-grained provenance information
may become prohibitively large. However, to the best of our
knowledge no practical implementations exist. As noted earlier,
our s-tables are similar in some respects to c-tables studied in
incomplete information databases [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Our data model and queries
is more restricted in some ways, due to the restriction to finite
maps, and the fact that we do not allow for conditions afecting
the presence of entire rows, but our approach supports
aggregation, which is critical for our application area and which was not
handled in the original work on c-tables.
      </p>
      <p>
        There have been implementations of semiring provenance or
ctables in systems such as Orchestra [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], ProQL [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], ProvSQL [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ],
and Mimir [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], respectively. In Orchestra provenance
annotations were used for update propagation in a distributed data
integration setting. ProQL and ProvSQL implement the
semiring model but do not allow for symbolic expressions in data or
support aggregation. Mimir is a system for querying uncertain
and probabilistic data based on c-tables; however, in Mimir
symbolic expressions and conditions are not actually materialized
as results, instead the system fills in their values with guesses
in order to make queries executable on standard RDBMSs. Thus,
Mimir’s approach to c-tables would not sufice for our needs since
we need to generate the symbolic constraints for the QP solver
to solve. On the other hand, our work shows how some of the
symbolic computation involved in c-tables can be implemented
in-database.
      </p>
      <p>
        We have reduced the concordancy evaluation problem to
quadratic programming, a well-studied optimization problem.
Solvers such as OSQP [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] can handle systems of equations with
thousands of equations and variables. However, we have not
made full use of the power of linear/quadratic programming. For
example, we could impose additional linear inequalities on
unknowns to constrain that certain error or null values have to be
positive or within some range. Likewise, we have defined the
cost function in one specific way but quadratic programming
permits many other cost functions to be defined, for example
with diferent weights for each variable or with additional linear
cost factors.
      </p>
      <p>
        As noted in Section 2, we have focused on the problem of
evaluating concord/discord among data sources and not on using
the diferent data sources to estimate the actual values being
measured. It would be interesting to extend our framework by
augmenting symbolic tables and queries with a probabilistic
interpretation, so that the optimal solution found by quadratic
programming produces statistically meaningful consensus
values (similarly to the work of Mayfield et al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]).
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSIONS</title>
      <p>In many real settings, such as epidemiological surveillance, ground
truth is not known or knowable and we still need to integrate
discordant data sources with diferent levels of trustworthiness,
completeness and self-consistency. In this setting without any
master data, we still would like to be able to measure how close
the observed data is to our idealized expectations. Thus, we
proposed definitions of concordance and discordance capturing
respectively when data sources we wish to fuse are compatible
with one another, and measuring how far away the observed
data are from being concordant. Consequently, we can compare
discordance measurements over time to understand whether the
diferent sources are becoming more or less consistent with one
another.</p>
      <p>Our approach to symbolic evaluation of multidimensional
queries appears to have further applications which we plan to
explore next, such supporting other forms of uncertainty
expressible as linear constraints, and adapting our approach to produce
statistically meaningful estimates of the consensus values.</p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGMENTS</title>
      <p>The work of Alberto Abelló has been done under project
PID2020117191RB-I00 funded by MCIN/ AEI /10.13039/501100011033. The
work of James Cheney was supported by ERC Consolidator Grant
Skye (grant number 682315).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Alberto</given-names>
            <surname>Abelló</surname>
          </string-name>
          and
          <string-name>
            <given-names>James</given-names>
            <surname>Cheney</surname>
          </string-name>
          .
          <year>2022</year>
          .
          <article-title>Eris: Measuring discord among multidimensional data sources (extended version)</article-title>
          .
          <source>arXiv:2201</source>
          .13302 [cs.DB]
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Alberto</given-names>
            <surname>Abelló</surname>
          </string-name>
          and
          <string-name>
            <given-names>Oscar</given-names>
            <surname>Romero</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Online Analytical Processing</article-title>
          .
          <source>In Encyclopedia of Database Systems</source>
          , Second Edition, Ling Liu and
          <string-name>
            <given-names>M.</given-names>
            <surname>Tamer Özsu</surname>
          </string-name>
          (Eds.). Springer. https://doi.org/10.1007/978-1-
          <fpage>4614</fpage>
          -8265-9_
          <fpage>252</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Serge</given-names>
            <surname>Abiteboul</surname>
          </string-name>
          , Richard Hull, and
          <string-name>
            <given-names>Victor</given-names>
            <surname>Vianu</surname>
          </string-name>
          .
          <year>1995</year>
          . Foundations of Databases. Addison-Wesley. http://webdam.inria.fr/Alice
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Yael</given-names>
            <surname>Amsterdamer</surname>
          </string-name>
          , Daniel Deutch, and
          <string-name>
            <given-names>Val</given-names>
            <surname>Tannen</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Provenance for aggregate queries</article-title>
          .
          <source>In ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS)</source>
          .
          <source>ACM</source>
          ,
          <volume>153</volume>
          -
          <fpage>164</fpage>
          . https://doi.org/10.1145/1989284. 1989302
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Franz</given-names>
            <surname>Baader</surname>
          </string-name>
          , Diego Calvanese, Deborah L.
          <string-name>
            <surname>McGuinness</surname>
          </string-name>
          ,
          <string-name>
            <surname>Daniele Nardi</surname>
          </string-name>
          , and
          <string-name>
            <surname>Peter F. Patel-Schneider</surname>
          </string-name>
          (Eds.).
          <year>2003</year>
          .
          <article-title>The Description Logic Handbook: Theory, Implementation, and Applications</article-title>
          . Cambridge University Press.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Eftychia</given-names>
            <surname>Baikousi</surname>
          </string-name>
          , Georgios Rogkakos, and
          <string-name>
            <given-names>Panos</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Similarity measures for multidimensional data</article-title>
          .
          <source>In Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11-16</source>
          ,
          <year>2011</year>
          , Hannover, Germany, Serge Abiteboul, Klemens Böhm, Christoph Koch, and
          <string-name>
            <surname>Kian-Lee Tan</surname>
          </string-name>
          (Eds.).
          <source>IEEE Computer Society</source>
          ,
          <fpage>171</fpage>
          -
          <lpage>182</lpage>
          . https://doi.org/10.1109/ICDE.
          <year>2011</year>
          . 5767869
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Leopoldo</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Bertossi</surname>
            , Loreto Bravo, Enrico Franconi, and
            <given-names>Andrei</given-names>
          </string-name>
          <string-name>
            <surname>Lopatenko</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Complexity and Approximation of Fixing Numerical Attributes in Databases Under Integrity Constraints</article-title>
          .
          <source>In 10th International Symposium on Database Programming Languages (DBPL) (LNCS</source>
          , Vol.
          <volume>3774</volume>
          ). Springer,
          <fpage>262</fpage>
          -
          <lpage>278</lpage>
          . https://doi.org/10.1007/11601524_
          <fpage>17</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Sandro</given-names>
            <surname>Bimonte</surname>
          </string-name>
          , Libo Ren, and
          <string-name>
            <given-names>Nestor</given-names>
            <surname>Koueya</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>A linear programmingbased framework for handling missing data in multi-granular data warehouses</article-title>
          .
          <source>Data Knowl. Eng</source>
          .
          <volume>128</volume>
          (
          <year>2020</year>
          ),
          <volume>101832</volume>
          . https://doi.org/10.1016/j.datak.
          <year>2020</year>
          . 101832
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Philip</given-names>
            <surname>Bohannon</surname>
          </string-name>
          , Michael Flaster, Wenfei Fan, and
          <string-name>
            <given-names>Rajeev</given-names>
            <surname>Rastogi</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>A Cost-Based Model and Efective Heuristic for Repairing Constraints by Value Modification</article-title>
          .
          <source>In ACM SIGMOD International Conference on Management of Data. ACM</source>
          ,
          <volume>143</volume>
          -
          <fpage>154</fpage>
          . https://doi.org/10.1145/1066157.1066175
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Buneman</surname>
          </string-name>
          , Sanjeev Khanna, and
          <string-name>
            <surname>Wang-Chiew Tan</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Why and Where: A Characterization of Data Provenance</article-title>
          .
          <source>In 8th International Conference on Database Theory (ICDT) (LNCS</source>
          , Vol.
          <year>1973</year>
          ), Jan Van den Bussche and Victor Vianu (Eds.). Springer,
          <fpage>316</fpage>
          -
          <lpage>330</lpage>
          . https://doi.org/10.1007/3-540-44503-X_
          <fpage>20</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Surajit</surname>
            <given-names>Chaudhuri</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anish Das</surname>
            <given-names>Sarma</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Venkatesh</given-names>
            <surname>Ganti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Raghav</given-names>
            <surname>Kaushik</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Leveraging aggregate constraints for deduplication</article-title>
          .
          <source>In ACM SIGMOD International Conference on Management of Data. ACM</source>
          ,
          <volume>437</volume>
          -
          <fpage>448</fpage>
          . https: //doi.org/10.1145/1247480.1247530
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Jan</given-names>
            <surname>Chomicki</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Consistent Query Answering: Five Easy Pieces</article-title>
          .
          <source>In 11th International Conference on Database Theory (ICDT) (LNCS</source>
          , Vol.
          <volume>4353</volume>
          ). Springer,
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          . https://doi.org/10.1007/11965893_
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Curtis</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Dyreson</surname>
          </string-name>
          , Torben Bach Pedersen, and
          <string-name>
            <surname>Christian</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Jensen</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Incomplete Information in Multidimensional Databases</article-title>
          . In Multidimensional Databases: Problems and Solutions, Maurizio Rafanelli (Ed.). Idea Group,
          <fpage>282</fpage>
          -
          <lpage>309</lpage>
          . https://doi.org/10.4018/978-1-
          <fpage>59140</fpage>
          -053-0.
          <fpage>ch010</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Wenfei</given-names>
            <surname>Fan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Floris</given-names>
            <surname>Geerts</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Relative information completeness</article-title>
          .
          <source>In Proceedings of the Twenty-Eigth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS)</source>
          .
          <source>ACM</source>
          ,
          <volume>97</volume>
          -
          <fpage>106</fpage>
          . https://doi.org/10. 1145/1559795.1559811
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Floris</surname>
            <given-names>Geerts</given-names>
          </string-name>
          , Giansalvatore Mecca, Paolo Papotti, and
          <string-name>
            <given-names>Donatello</given-names>
            <surname>Santoro</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>The LLUNATIC Data-Cleaning Framework</article-title>
          .
          <source>PVLDB 6</source>
          ,
          <issue>9</issue>
          (
          <year>2013</year>
          ),
          <fpage>625</fpage>
          -
          <lpage>636</lpage>
          . https://doi.org/10.14778/2536360.2536363
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Matteo</given-names>
            <surname>Golfarelli</surname>
          </string-name>
          and
          <string-name>
            <given-names>Elisa</given-names>
            <surname>Turricchia</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>A characterization of hierarchical computable distance functions for data warehouse systems</article-title>
          .
          <source>Decis. Support Syst</source>
          .
          <volume>62</volume>
          (
          <year>2014</year>
          ),
          <fpage>144</fpage>
          -
          <lpage>157</lpage>
          . https://doi.org/10.1016/j.dss.
          <year>2014</year>
          .
          <volume>03</volume>
          .011
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Todd</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Green</surname>
            ,
            <given-names>Gregory</given-names>
          </string-name>
          <string-name>
            <surname>Karvounarakis</surname>
            , and
            <given-names>Val</given-names>
          </string-name>
          <string-name>
            <surname>Tannen</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Provenance semirings</article-title>
          .
          <source>In ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS)</source>
          .
          <source>ACM</source>
          ,
          <volume>31</volume>
          -
          <fpage>40</fpage>
          . https://doi.org/10.1145/1265530.1265535
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Tomasz</given-names>
            <surname>Imielinski and Witold Lipski</surname>
          </string-name>
          Jr.
          <year>1984</year>
          .
          <article-title>Incomplete Information in Relational Databases</article-title>
          .
          <source>J. ACM</source>
          <volume>31</volume>
          ,
          <issue>4</issue>
          (
          <year>1984</year>
          ),
          <fpage>761</fpage>
          -
          <lpage>791</lpage>
          . https://doi.org/10.1145/ 1634.1886
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Zachary</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Ives</surname>
          </string-name>
          , Todd J.
          <string-name>
            <surname>Green</surname>
          </string-name>
          , Grigoris Karvounarakis, Nicholas E. Taylor, Val Tannen, Partha Pratim Talukdar, Marie Jacob, and
          <string-name>
            <surname>Fernando</surname>
            <given-names>C. N.</given-names>
          </string-name>
          <string-name>
            <surname>Pereira</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>The ORCHESTRA Collaborative Data Sharing System</article-title>
          .
          <source>SIGMOD Rec</source>
          .
          <volume>37</volume>
          ,
          <issue>3</issue>
          (
          <year>2008</year>
          ),
          <fpage>26</fpage>
          -
          <lpage>32</lpage>
          . https://doi.org/10.1145/1462571.1462577
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Grigoris</surname>
            <given-names>Karvounarakis</given-names>
          </string-name>
          , Zachary G. Ives, and
          <string-name>
            <given-names>Val</given-names>
            <surname>Tannen</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Querying data provenance</article-title>
          .
          <source>In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD</source>
          <year>2010</year>
          , Indianapolis, Indiana, USA, June 6-10,
          <year>2010</year>
          ,
          <string-name>
            <surname>Ahmed</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Elmagarmid</surname>
          </string-name>
          and Divyakant Agrawal (Eds.). ACM,
          <volume>951</volume>
          -
          <fpage>962</fpage>
          . https://doi.org/10.1145/1807167.1807269
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Chris</surname>
            <given-names>Mayfield</given-names>
          </string-name>
          , Jennifer Neville, and
          <string-name>
            <given-names>Sunil</given-names>
            <surname>Prabhakar</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>ERACER: a database approach for statistical inference and data cleaning</article-title>
          .
          <source>In ACM SIGMOD International Conference on Management of Data. ACM</source>
          ,
          <volume>75</volume>
          -
          <fpage>86</fpage>
          . https://doi. org/10.1145/1807167.1807178
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22] Robinson Meyer and Alexis C. Madrigal.
          <year>2021</year>
          .
          <article-title>Why the Pandemic Experts Failed</article-title>
          . https://www.theatlantic.com/science/archive/2021/03/ americas-coronavirus
          <article-title>-catastrophe-began-with-data/618287</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Arindam</surname>
            <given-names>Nandi</given-names>
          </string-name>
          , Ying Yang, Oliver Kennedy, Boris Glavic, Ronny Fehling, Zhen Hua Liu, and
          <string-name>
            <given-names>Dieter</given-names>
            <surname>Gawlick</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Mimir: Bringing CTables into Practice</article-title>
          .
          <source>CoRR abs/1601</source>
          .00073 (
          <year>2016</year>
          ). arXiv:
          <volume>1601</volume>
          .00073 http://arxiv.org/abs/1601.00073
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Lamia</surname>
            <given-names>Oukid</given-names>
          </string-name>
          , Omar Boussaid, Nadjia Benblidia, and
          <string-name>
            <given-names>Fadila</given-names>
            <surname>Bentayeb</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>TLabel: A New OLAP Aggregation Operator in Text Cubes</article-title>
          .
          <source>Int. J. Data Warehousing and Mining</source>
          <volume>12</volume>
          ,
          <issue>4</issue>
          (
          <year>2016</year>
          ),
          <fpage>54</fpage>
          -
          <lpage>74</lpage>
          . https://doi.org/10.4018/IJDWM. 2016100103
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Pierre</surname>
            <given-names>Senellart</given-names>
          </string-name>
          , Louis Jachiet, Silviu Maniu, and
          <string-name>
            <given-names>Yann</given-names>
            <surname>Ramusat</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>ProvSQL: Provenance and Probability Management in PostgreSQL</article-title>
          .
          <source>PVLDB 11</source>
          ,
          <issue>12</issue>
          (
          <year>2018</year>
          ),
          <fpage>2034</fpage>
          -
          <lpage>2037</lpage>
          . https://doi.org/10.14778/3229863.3236253
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Bartolomeo</surname>
            <given-names>Stellato</given-names>
          </string-name>
          , Goran Banjac, Paul Goulart, Alberto Bemporad, and
          <string-name>
            <given-names>Stephen P.</given-names>
            <surname>Boyd</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>OSQP: an operator splitting solver for quadratic programs</article-title>
          .
          <source>Mathematical Programming Computation</source>
          <volume>12</volume>
          ,
          <issue>4</issue>
          (
          <year>2020</year>
          ),
          <fpage>637</fpage>
          -
          <lpage>672</lpage>
          . https://doi.org/10.1007/s12532-020-00179-2
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Minghe</surname>
            <given-names>Yu</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Guoliang</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Dong</given-names>
            <surname>Deng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jianhua</given-names>
            <surname>Feng</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>String similarity search and join: a survey</article-title>
          .
          <source>Frontiers Comput. Sci. 10</source>
          ,
          <issue>3</issue>
          (
          <year>2016</year>
          ),
          <fpage>399</fpage>
          -
          <lpage>417</lpage>
          . https: //doi.org/10.1007/s11704-015-5900-5
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>