Data fusion with source authority
                     and multiple truth
                               (Discussion Paper)
                Fabio Azzalini, Davide Piantella and Letizia Tanca
                Politecnico di Milano, Italy {name.surname}@polimi.it
        Abstract. The abundance of data available on the Web makes more and
        more probable the case of finding that different sources contain (partially
        or completely) different values for the same item. Data Fusion is the rel-
        evant problem of discovering the true values of a data item when two
        entities representing it have been found and their values are different.
        Recent studies have shown that when, for finding the true value of an
        object, we rely only on majority voting, results may be wrong for up to
        30% of the data items, since false values are spread very easily because
        data sources frequently copy from one another. Therefore, the problem
        must be solved by assessing the quality of the sources and giving more
        importance to the values coming from trusted sources. State-of-the-art
        Data Fusion systems define source trustworthiness on the basis of the
        accuracy of the provided values and on the dependence on other sources.
        In this paper we propose an improved algorithm for Data Fusion, that
        extends existing methods based on accuracy and correlation between
        sources by taking into account also source authority, defined on the basis
        of the knowledge of which sources copy from which ones. Our method
        has been designed to work well also in the multi-truth case, that is, when
        a data item can also have multiple true values. Preliminary experimen-
        tal results on a multi-truth real-world dataset show that our algorithm
        outperforms previous state-of-the-art approaches.

1     Introduction
The massive use of user-generated content, the Internet of Things and the ten-
dency to transform every real-world interaction into digital data have lead to
the problem of how to make sense of the huge mass of data available nowadays.
In this context, not only a source can store a previously unimaginable amount
of data, but also the number of sources that can provide information relevant
for a query increases dramatically, even in very specific contexts.
    With all these conflicting data available on the web, discovering their true
values is of primary importance. The solution of this problem is Data Fusion,
where the true value of each data item is decided. Redundancy per se is not
enough, since it has been shown in [3] that, if we rely only on majority vote, we
could get wrong results even in 30% of the times. In order to get more accurate
results we propose a Bayesian approach able to evaluate source quality.
    Data fusion algorithms can be divided into two sub-classes: single-truth and
multi-truth, the latter denoting the case when a data item may have multiple
    Copyright c 2019 for the individual papers by the papers authors. Copying permit-
    ted for private and academic purposes. This volume is published and copyrighted by
    its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy.
true values. Such scenarios are common in everyday life, where many actors can
play in a movie or a book can have many authors, like Alice’s book “Foundations
of Databases”[10] by Serge Abiteboul, Rick Hull and Victor Vianu. We decided
to design our model to work also in the multi-truth case.
    Currently, many single-truth data fusion algorithms exist in literature, and
a few of them exploit Bayesian inference to estimate the veracity of each value
and the trustworthiness of sources. TruthFinder [8] applies Bayesian analysis
to compute the probability of a value being true, conditioned to the observation
of values provided by the sources. Accu[4] applies a Bayesian iterative approach
to compute the veracity of values, assuming uniform distribution of false values
for each data item and source independence. These two assumptions have been
relaxed by PopAccu[9] and AccuCopy[1] respectively.
    Less attention has been devoted to studying the problem of multi-truth find-
ing: to our knowledge, only three algorithms try to solve it. MBM[6] approaches
multi-truth data fusion with a model that focuses on mappings and relations
between sources and sets of provided values, introducing also a copy-detection
phase to discover dependencies between sources. DART[5] computes, for each
source, a domain expertise score relative to the domains of input data. This
score is used in a Bayesian inference process to model source trustworthiness
and value confidence; sources are assumed to be independent. LTM[7] exploits
probabilistic graphical models to find all the true values claimed for each data
item.
    State of the art systems on Data Fusion define source trustworthiness based
on the accuracy of the provided values and on the dependence on other sources.
In this paper we propose an improved algorithm for Data Fusion. Our method
extends existing methods based on accuracy and correlation between sources
taking also into consideration the authority of sources. Authoritative sources
are defined as the ones that have been copied by many sources: the key idea
is that, when source administrators decide to copy data, they will choose the
sources that they perceive as most trustworthy.
    To summarize, in this paper we make the following contributions:
  – We present a new formula for domain-aware copy detection with the goal
    of determine the probability that source Si copies, from source Sj , data
    items belonging to a specific domain. Our copy detection process exploits the
    domain expertise of the sources and can also assign different probabilities to
    the two directions of copying.
  – An urgent need of the truth discovery process is to determine what sources
    we can trust. We present a fully unsupervised algorithm that can assign an
    authority score to each source for each domain. This process is based on the
    natural habit of choosing, to copy a missing value, the source that provides
    the correct value with the highest probability - in other words, the most
    authoritative one.
  – We present an improved algorithm for assessing values’ veracity in a multi-
    truth discovery process, exploiting source authority in copy detection, posi-
    tively rewarding sources according to their authority.
In Section 2 we present preliminary information, Section 3 provides the details
about our approach and in Section 4 we show the experimental results.


2    Background and Preliminaries
We now present more in details two methods that have been of great importance
for our work.

DART. This algorithm exploits an iterative domain-aware Bayesian approach to
do multi-truth discovery over a dataset composed starting from different sources.
Its key intuition is that, in general, a source may have different quality of data for
different domains. For each source, they define the domain expertise score edi (s),
measuring the source’s experience in a given domain, and assign a confidence
cos (v) to each value v provided by a source s, reflecting how much s is convinced
that the value v is (part of) the correct value(s) for object o.
      The veracity σo (v) of value v for object o is the probability that v is a true
value of o, which is better estimated at each iteration of the discovery process.
The goal of the DART algorithm is to evaluate the probability that a value v is true
given the observation of the claimed data ψ(o) (i.e. P (v|ψ(o))). Being P (ψ(o)|v)
and P (ψ(o)|v̄) the probabilities of having the observation ψ(o) when v is true or
false respectively, Bayesian inference can be used to express P (v|ψ(o)) as shown
below:
                  P (ψ(o)|v)P (v)                   P (ψ(o)|v) σo (v)
     P (v|ψ(o)) =                  =
                      P (ψ(o))        P (ψ(o)|v) σo (v) + P (ψ(o)|v̄)(1 − σ0 (v))
                                                                                   (1)
      Our main criticism to DART is the assumption that sources are independent,
which is a clear oversimplification of the real world. We will explain how we have
relaxed this assumption in the following section.

MBM is a Bayesian algorithm for multi-truth finding that takes into consideration
also the problem of source dependence. It computes, for each source and set of
values, an independence score based on the values provided by all the sources.
The independence score is then used to discredit, in the voting phase, sources
that don’t provide their values independently.
    Our criticisms to MBM are the assumption that there is no mutual copying
between sources in the whole dataset and the fact that the algorithm is not able
to distinguish the direction of copying. In the following section we will describe
how we have relaxed these assumptions.
    Table 1 describes the notation that will be used in the following sections.


3    Methodology
We now present ADAM (Authority Domain Aware Multi-truth data fusion), a
method based on Bayesian inference and source authority that iteratively refines
the probability that a provided value for a data item is true.
3.1     Copy detection               4      Fabio Azzalini, Davide Piantella and Letizia Tanca

Starting from [6], we have de-              Notation Description
                                               O(s)    Set of all objects provided by source s
vised a domain-aware copy                     Od (s) Set of objects in domain d provided by source s
detection algorithm to assign                 Vs (o) Set of all values claimed for object o by source s
                                              Vs (o) Set of all values claimed for object o by sources 6= s
different probabilities to the                Sod (v) Sources that provide value v for object o in domain d
two directions of copy. This                  Sod (v̄) Sources that don’t provide value v for object o in domain d
                                               ed (s)  Expertise of source s in domain d
model works at domain gran-                      o (v) Veracity of value v for object o
ularity, therefore it can more               ⌧drec (s) Recall of source s in domain d
                                              ⌧dsp (s) Specificity of source s in domain d
accurately approximate the                     cos (v) Confidence score of value v of object o related to source s
real world behaviour of corre-               si ! sj Source i is copying at object level from source j
                                             si ? sj Sources i and j are independent at object level
lated copying [2].                              d
                                              si ! sj Sources i is copying from source j for domain d
                                                   d
                                                 ⇥ij     Set of common objects in domain d between sources i and j
Scope. Given an object o and                cij (o) =: c Values provided by both sources i and j for object o
                                               o
                                               ij =: c Observation of c
two sources si and sj , we de-                    (o)    Observation of the values provided by object o
         o
note by ψij the observation of                 Ad (s) Authority of source s in domain d
                                                                        Table 1. Notation
the common values cij (o) for
                        d
a common object o ∈ Θij   in domain d provided by two source si and sj .
                                     3   Methodology
Assumptions. In our copy detection       algorithm we assume that there is no mutual
                             We now present ADAM (Authority Domain Aware Multi-truth data fusion), a
copying at domain level, i.e.,method
                               if source    s1Bayesian
                                     based on   copiesinference
                                                         from source      s2 authority
                                                                 and scource  regarding   that domain
                                                                                               iteratively refine
¯ then s2 can copy from sthe
d,                               probability that a provided value for a data item ˜is true.¯
                             1 only values for objects in domains d 6= d; we also
assume that two sources can  3.1only  bedetection
                                   Copy    either independent or copiers.
                                     Starting from [6], we have derived a domain aware copy detection algorithm
Object copying. For each pair     ofofsources
                            capable                i, j, after
                                       assigning di↵erent             we have
                                                                probabilities      defined
                                                                              to the          the truth
                                                                                     two directions of copy. This
probability of the group of model
                            values works at domain granularity, therefore it can more accurately approximate
                                      in  c  as   the    probability        that
                            the real world behaviour of correlated copying [2].
                                                                                   all the   values     are
correct (Eq. 2), we can compute the likelihood of ψc in different cases of source o
                            Scope. Given an object o and two sources si and sj , we denote by ij the
dependence and truthfulness    of c. Similarly to [6], we state that if si haso 2copied
                            observation of the common values cij (o) for a common object             d
                                                                                                    ⇥ij in domain
from sj , or the other way round,    then
                            d provided by twothey
                                               sourceprovide
                                                         si and sjthe
                                                                    .     same common values c,
no matter the veracity of c Assumptions.
                             (Eq. 3). In our copy detection algorithm we assume that there is no mutual
                            copying at domain Y level, i.e., if source s1 copies from source s2 regarding domain
                            ¯ then s2 can copy from s1 only values for objects in domains d˜ 6= d;       ¯ we also
                            d,     σ(c) =          σ(v)                                                 (2)
                                 assume that two sources can only be either independent or copiers.
                                                 v∈c
                (                Object copying. For each pair of sources i, j, after we have defined the truth
                 P (ψc |si → sj ,probability
                                   c true) of=the  group of values in c as the probability that all the values are
                                                P (ψ    |s → si , c true)
                                 correct (Equation 2), cwe jcan compute
                                                                                 =1
                                                                        the likelihood of c in di↵erent (3)cases of
                 P (ψc |si → sj ,source
                                   c false)  = P (ψ
                                        dependence  andc |s j → si , of
                                                         truthfulness c c.false)  = to
                                                                            Similarly 1 [6], we state that if si has

Eq.s 4 and 5 define the probabilities that both sources provide the same group
of values c independently of each other, in the two cases that c is true and false.

         P (ψc |s1 ⊥s2 , c true) = τ rec (s1 ) τ rec (s2 ) [1 − τ sp (s1 )] [1 − τ sp (s2 )]                  (4)

         P (ψc |s1 ⊥s2 , c false) = τ sp (s1 ) τ sp (s2 ) [1 − τ rec (s1 )] [1 − τ rec (s2 )]                 (5)


Bayesian model. If we apply a Bayesian inference approach we can now compute
the probability of two sources being dependent or independent, and in the first
case we can also define which of the two is the copier.
With Y = {si → sj ; sj → si ; si ⊥sj } we define the three possible outcomes.
                      P (ψc |y) P (y)
 P (y|ψc ) = P                   0      0
                  y 0 ∈Y P (ψc |y ) P (y )
                                                                                              (6)
           P (y) [P (ψc |y, c true) σ (c) + P (ψc |y, c false) (1 − σ (c))]
  =P              0           0                          0
        y ∈Y P (y ) [P (ψc |y , c true) σ (c) + P (ψc |y , c false) (1 − σ (c))]
         0


We now have to find a way to estimate the prior probability of the Bayesian
model: P (si → sj ), P (sj → si ) and P (sj ⊥si ), that are all the different con-
figurations of object copying between sources si and sj . We define this as the
probability of the two sources being independent or copiers in the domain of the
object we are considering, defined in Eq. 11. For ease of notation we apply the
                                                                 d
following definition, recalling that d is the same domain of Θij   3 o where o is
the object of ψc that we are analyzing.
                          P (si → sj ) =: ρdij
                          
                             P (sj → si ) =: ρdji                              (7)
                                                  d    d
                          
                             P (sj ⊥si )  = 1 − ρij − ρji
                          

and replace Eq.s 7, 2, 3, 4 and 5 into Eq. 6, with the following result:
                                                 ρdij
               P (si → sj |ψc ) = d                                                           (8)
                                   ρij + ρdji + 1 − ρdij − ρdji Pu
                                                               

where
  Pu := σ(c) [τ rec (si ) · τ rec (sj ) · (1 − τ sp (si )) · (1 − τ sp (sj ))] +
                                                                                              (9)
             + (1 − σ(v)) [τ sp (si ) · τ sp (sj ) · (1 − τ rec (si )) · (1 − τ rec (sj ))]
Non-shared values. With Eq. 8 we have expressed the probability that a source
si has copied from another source sj their common values c for object o. We
now have to take into consideration other possible non-in-common values to
opportunely compute the probability that c were really copied. We have chosen
to scale the copy probability by the Jaccard similarity of the two sets of values
of o claimed by the two sources si and sj , as shown in Eq. 10.
                                          Vsi (o) ∩ Vsj (o)
                      Jij (o) = Jji (o) =                                    (10)
                                          Vsi (o) ∪ Vsj (o)
Domain-level copying. We can use the concept of copying an object o to define
the act of copying with respect to a domain d as defined in Eq. 11.
                                   P
                                       d P (si → sj |ψc ) · Jij (o)
                                      o∈Θij
                     d      d
               P si −
                    → sj Θij :=                   d
                                                                        (11)
                                                 Θij
Initialization. Since in the initialization phase we have no prior knowledge of
ρdij , we decided to exploit the fact that sources with high expertise in domain
d are less likely to be copiers for domain d and that sources with low expertise
in d tend to copy from sources with higher expertise in d. These ideas can be
summarized in the initialization    expressed in Eq. 12.
                ρdij = 1 − ed (si ) ed (sj )
                      
                                                ∀si , sj ∈ S ∧ si 6= sj     (12)
3.2 Source authority
The key idea to define the authority of a source in a specific domain with respect
to the outcomes of the copy detection process is that, if many sources copy some
values from the same source sa , it is because sa is considered authoritative and
more trustworthy. For each source sj ∈ S, we define Cd (sj ) in Eq. 13 as the
set of all the sources that copy from source sj with probability above a given
threshold Γ :                  n
                                              d       d
                                                              o
                    Cd (sj ) := si ∈ S P (si −
                                             → sj |Θij  )>Γ                   (13)
Qualitatively, the unadjusted authority score of source s in domain d is how much
source s is copied in d w.r.t. how much all sources are copied in d (Eq. 14).
                                  P                     d      d
                                     si ∈Cd (sj ) P (si −
                                                        → sj |Θij )
                   ad (sj ) := P                            d
                                                                              (14)
                                                                    d )
                                      P
                                sk ∈S    sl ∈Cd (sk ) P (sl −
                                                            → sk |Θkl
Note that in general the cardinality of S (i.e. the number of sources) is high and
the parameter Γ should not be set too close to 1 to better exploit the variety of
outcomes of the copy detection process. This configuration leads to ad (s)  1.
We can accordingly apply a linear conversion to ad (s) in order to map it on
the interval [0; 1]. We denote this new score as Ad (s) or authority of source s in
domain d, computed as:
                                              ad (s) − amin
                                                        d
                                 Ad (s) := max                                          (15)
                                              ad − amin d
3.3 Veracity
We have extended the DART Bayesian inference model in order to exploit the au-
thority score of each source. Our key idea is to positively reward sources accord-
ing to their authority, which can be achieved with Eq.s 16 and 17, respectively.
                                                                          e (s)c (v)+Ad (s)
                                                            (1 − τdsp (s)) d s
                   Y                                   Y
  P (ψ(o)|v) =          τdrec (s)ed (s)cs (v)+Ad (s)
                  s∈Sod (v)                                   s∈Sod (v̄)
                                                                                                           (16)
                                                                                           ed (s)cs (v)+Ad (s)
                               τdsp (s)ed (s)cs (v)+Ad (s)
                    Y                                          Y
  P (ψ(o)|v̄) =                                                          (1 − τdrec (s))
                  s∈Sod (v̄)                                 s∈Sod (v)
                                                                                           (17)
    In a multi-truth context, precision cannot be the only metrics for source
trustworthiness [7], but we should use recall and specificity: source recall is the
probability that true values are claimed as true (Eq. 18), while source specificity
is the probability that false values are claimed as false (Eq. 19).
                                                                                     (1 − σo (v 0 ))
             P            P                                 P          P
                o∈O d (s)   v∈V    (o) σo (v)      sp        o∈O d (s)   v 0 ∈Vs (o)
    rec
   τd (s) =       P              s               τd   (s) =
                     o∈O d (s) |Vs (o)|
                                                                   P
                                                                       o∈O d (s) Vs (o)
                                            (18)                                            (19)
At each iteration of the algorithm veracity scores of values are refined, this leads
to a better estimation of copy detection and source authority, that in turn will
improve again values’ veracity in the next iteration. The algorithm stops iter-
ating when the updates of all veracities are less then a given threshold δ. The
output of the algorithm is, for each object o in the dataset, a set of values whose
veracities are greater or equal to a given threshold θ.
                                        4     Experimental Results

                                        We now present the result of an experimental comparison between our algo-
                                        rithm,ADAM (Authority Domain Aware Multi-truth data fusion), and the original
                                        DART in di↵erent configurations of the input data.


                                        4.1   Dataset

                                      We have used as input data a subset of the same book dataset that has been
4 Experimental Results used                  for the evaluation of the DART algorithm, kindly made available by Xueling
                                      Lin, one of the authors of [8]. Our goal is to discover the correct values of the
                                      multi-truth parameter authors using the category attribute to clusterize books
We now present the result of an experimental       comparison between our algorithm,
                                      into domains.
ADAM (Authority Domain Aware Multi-truth           data fusion), and the original DART
                                          For our experiments we have been able to use a subset of this dataset match-
in different configurations of the input   data.
                                      ing another validated and trustworthy dataset considered as golden truth for
                                      the book-authors binding. The dataset used in our experiments is composed by
4.1 Dataset                           90,867 tuples from 2,680 sources and 1,958 books spanning on all the 18 domains
                                      (i.e. categories of book genres) of the original dataset.
We have used as input data a subset of the same book dataset that has been
                                          Our algorithm depends on several parameters, we report in Table 2 the value
used for the evaluation of the DART algorithm,
                                      used for eachkindly    madeWhen
                                                      parameter.    available  by Xueling
                                                                         an indication was present in [8] we use the same
Lin, one of the authors of [5]. Our goal
                                      provided value to ensure the comparability of
                                            was   to discover  the  correct  values    the the two algorithms.
                                                                                     between
multi-truth parameter authors using the category attribute to clusterize books
into domains.
                                                                       Parameter Value
    For our experiments, we have been able to use a sub-                     ↵      1.5
set of this dataset matching another validated and trustwor-                         0
thy dataset considered as golden truth for the book-authors                         0.1
binding. The dataset used in our experiments is composed                     ⌘      0.2
                                                                             ✓      0.5
by 90,867 tuples from 2,680 sources and 1,958 books, span-                   ¯      0.5
ning all the 18 domains (i.e. categories of book genres) of               ⌧¯rec     0.8
the original dataset.                                                      ⌧¯sp     0.9
                                                                      Table 2. Parameters
   Our algorithm depends on several parameters; Table 2
reports the value used for each of them. When an indication was present in [5],
we used the same provided value to ensure optimal comparability between the
two algorithms.
                                        4.2   Results
4.2 Results
                                     We have develop an implementation of DART algorithm following as precisely as
We have developed in Python 3.7 both       anthe
                                     possible  implementation      of DART
                                                  guidelines expressed  in [8](following
                                                                               and our extension ADAM both in Python
as precisely as possible the guidelines expressed  in [5]), and   our  extension
                                     3.7. Even though our interest was in determineADAM.the impact of our extensions on
Even though our interest was in determine        the impact
                                     DART performances,         of also
                                                          we have  our developed
                                                                         extensions    on version of MajorityVote
                                                                                   a simple
                                     as baselineacomparison,
DART performances, we have also developed                     transferring
                                                  simple version           the classic single-truth voting system in the
                                                                     of MajorityVote
as baseline comparison.              multi-truth context by considering true all the values of object o that have been
                                        voted by at least 60% of the sources that provide a value for o.
    ADAM has its F-1 score higher than DART in the 76% of the times. Moreover in
our experiments ADAM has required strictly less iterations before convergence in
the 65% of the times with respect to DART, in some cases the number of iterations
required was less than a half. At first sight this faster convergence might seem
to be due only to the increment of A(s) in the exponent in Eq.s 16 and 17 but
with a more precise analysis we discover that A(s) 6∼ 0 only for a small fraction
of the sources, modeling in a correct manner the desired meaning of authority
which by definition should be related to only a small subset of objects.
   We have run 37 comparison between DART, ADAM and MajorityVote using the
same input data for the three algorithms at each run, focusing on both input
regarding single and multiple domains. We particularly focus in this section on
a subset of 10 runs, reporting in Table 3 the metrics of DART and ADAM of those
runs and finally in Table 4 we aggregate the results of all 37 runs reporting the
averaged metrics of MajorityVote, DART and ADAM.
     Domain Records |D| |O| |S|              DART                 ADAM
                                     Prec. Rec. F-1 Prec. Rec. F-1
     Travels      221      1 44 120 0.9167     1   0.9565 0.9767 0.9545 0.9655
    Reference     497      1 19 279 0.6875 0.8148 0.7458 0.84 0.7778 0.8077
     History     1639      1 114 565 0.9416 0.8487 0.8927 0.9918 0.8176 0.8963
      Arts       1114      1 22 505 0.5161 0.8889 0.6531 0.9545 0.6176 0.75
    Random       2764     14 50 615 0.8852 0.9474 0.9153 0.98 0.875 0.9245
    Random       2408     13 50 811 0.9016 0.873 0.8871 0.9808 0.85 0.9107
    Random       5010     16 100 880 0.8252 0.9147 0.8676 0.9903 0.8361 0.9067
    Random       4772     16 100 866 0.837 0.8433 0.8401 0.9706 0.7615 0.8534
    Random∗      4276     17 100 976 0.5359 0.7366 0.6205 0.9225 0.5265 0.6704
    Random       1998     14 50 572 0.9 0.9643 0.931         1   0.9273 0.9623
    Table 3. Experimental results (*contains only books with at least 2 authors)

                          Method Precision Recall F-1
                        MajorityVote 0.9354 0.6958 0.7820
                            DART       0.7953 0.8621 0.8273
                            ADAM       0.9182 0.7727 0.8392
                          Table 4. Average results of 37 runs
5   Conclusions
We presented ADAM, an improved algorithm for multi-truth data fusion. A quicker
termination and better results confirm that our idea to reward authoritative
sources has led to an increase in the algorithm performance and accuracy.

References
1. Dong, Xin Luna and Berti-Equille, Laure and Srivastava, Divesh: Truth Discovery
   and Copying Detection in a Dynamic World. VLDB (2009)
2. Blanco, Lorenzo and Crescenzi, Valter and Merialdo, Paolo and Papotti, Paolo:
   Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources.
   Advanced Information Systems Eng (2010)
3. Li, Xian and Dong, Xin Luna and Lyons, Kenneth and Meng, Weiyi and Srivastava,
   Divesh: Truth Finding on the Deep Web: Is the Problem Solved? CoRR (2015)
4. Dong, Xin Luna and Berti-Equille, Laure and Srivastava, Divesh: Integrating Con-
   flicting Data: The Role of Source Dependence. VLDB (2009)
5. Lin, Xueling and Chen, Lei: Domain-aware Multi-truth Discovery from Conflicting
   Sources. VLDB (2018)
6. Wang, Xianzhi and Sheng, Quan Z. and Fang, Xiu Susie and Yao, Lina and Xu,
   Xiaofei and Li, Xue: An Integrated Bayesian Approach for Effective Multi-Truth
   Discovery. CIKM (2015)
7. Bo, Zhao and Benjamin, Rubinstein and Jim, Gemmell and Jiawei, Han: A Bayesian
   Approach to Discovering Truth from Conflicting Sources for Data Integration. CoRR
   (2012)
8. Xiaoxin, Yin and Jiawei, Han and Philip, Yu: Truth Discovery with Multiple Con-
   flicting Information Providers on the Web. TKDE (2007)
9. Dong, Xin Luna and Saha, Barna and Srivastava, Divesh. Less is more: Selecting
   sources wisely for integration. VLDB (2012)
10. Abiteboul, Serge and Hull, Richard and Vianu, Victor: Foundations of databases:
   the logical level, Addison-Wesley (1995)