-

Data fusion with source authority and multiple truth (Discussion Paper)

Politecnico di Milano

Italy

name.surnameg@polimi.it

The abundance of data available on the Web makes more and more probable the case of nding that di erent sources contain (partially or completely) di erent values for the same item. Data Fusion is the relevant problem of discovering the true values of a data item when two entities representing it have been found and their values are di erent. Recent studies have shown that when, for nding the true value of an object, we rely only on majority voting, results may be wrong for up to 30% of the data items, since false values are spread very easily because data sources frequently copy from one another. Therefore, the problem must be solved by assessing the quality of the sources and giving more importance to the values coming from trusted sources. State-of-the-art Data Fusion systems de ne source trustworthiness on the basis of the accuracy of the provided values and on the dependence on other sources. In this paper we propose an improved algorithm for Data Fusion, that extends existing methods based on accuracy and correlation between sources by taking into account also source authority, de ned on the basis of the knowledge of which sources copy from which ones. Our method has been designed to work well also in the multi-truth case, that is, when a data item can also have multiple true values. Preliminary experimental results on a multi-truth real-world dataset show that our algorithm outperforms previous state-of-the-art approaches.

The massive use of user-generated content, the Internet of Things and the tendency to transform every real-world interaction into digital data have lead to the problem of how to make sense of the huge mass of data available nowadays. In this context, not only a source can store a previously unimaginable amount of data, but also the number of sources that can provide information relevant for a query increases dramatically, even in very speci c contexts.

With all these con icting data available on the web, discovering their true values is of primary importance. The solution of this problem is Data Fusion, where the true value of each data item is decided. Redundancy per se is not enough, since it has been shown in [ 3 ] that, if we rely only on majority vote, we could get wrong results even in 30% of the times. In order to get more accurate results we propose a Bayesian approach able to evaluate source quality.

Data fusion algorithms can be divided into two sub-classes: single-truth and multi-truth, the latter denoting the case when a data item may have multiple Copyright c 2019 for the individual papers by the papers authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy. true values. Such scenarios are common in everyday life, where many actors can play in a movie or a book can have many authors, like Alice's book \Foundations of Databases"[ 10 ] by Serge Abiteboul, Rick Hull and Victor Vianu. We decided to design our model to work also in the multi-truth case.

Currently, many single-truth data fusion algorithms exist in literature, and a few of them exploit Bayesian inference to estimate the veracity of each value and the trustworthiness of sources. TruthFinder [ 8 ] applies Bayesian analysis to compute the probability of a value being true, conditioned to the observation of values provided by the sources. Accu[ 4 ] applies a Bayesian iterative approach to compute the veracity of values, assuming uniform distribution of false values for each data item and source independence. These two assumptions have been relaxed by PopAccu[ 9 ] and AccuCopy[ 1 ] respectively.

Less attention has been devoted to studying the problem of multi-truth nding: to our knowledge, only three algorithms try to solve it. MBM[ 6 ] approaches multi-truth data fusion with a model that focuses on mappings and relations between sources and sets of provided values, introducing also a copy-detection phase to discover dependencies between sources. DART[ 5 ] computes, for each source, a domain expertise score relative to the domains of input data. This score is used in a Bayesian inference process to model source trustworthiness and value con dence; sources are assumed to be independent. LTM[ 7 ] exploits probabilistic graphical models to nd all the true values claimed for each data item.

State of the art systems on Data Fusion de ne source trustworthiness based on the accuracy of the provided values and on the dependence on other sources. In this paper we propose an improved algorithm for Data Fusion. Our method extends existing methods based on accuracy and correlation between sources taking also into consideration the authority of sources. Authoritative sources are de ned as the ones that have been copied by many sources: the key idea is that, when source administrators decide to copy data, they will choose the sources that they perceive as most trustworthy.

To summarize, in this paper we make the following contributions: { We present a new formula for domain-aware copy detection with the goal of determine the probability that source Si copies, from source Sj , data items belonging to a speci c domain. Our copy detection process exploits the domain expertise of the sources and can also assign di erent probabilities to the two directions of copying. { An urgent need of the truth discovery process is to determine what sources we can trust. We present a fully unsupervised algorithm that can assign an authority score to each source for each domain. This process is based on the natural habit of choosing, to copy a missing value, the source that provides the correct value with the highest probability - in other words, the most authoritative one. { We present an improved algorithm for assessing values' veracity in a multitruth discovery process, exploiting source authority in copy detection, positively rewarding sources according to their authority.

In Section 2 we present preliminary information, Section 3 provides the details about our approach and in Section 4 we show the experimental results. 2

Background and Preliminaries

We now present more in details two methods that have been of great importance for our work.

DART. This algorithm exploits an iterative domain-aware Bayesian approach to do multi-truth discovery over a dataset composed starting from di erent sources. Its key intuition is that, in general, a source may have di erent quality of data for di erent domains. For each source, they de ne the domain expertise score edi (s), measuring the source's experience in a given domain, and assign a con dence cso(v) to each value v provided by a source s, re ecting how much s is convinced that the value v is (part of) the correct value(s) for object o.

The veracity o(v) of value v for object o is the probability that v is a true value of o, which is better estimated at each iteration of the discovery process. The goal of the DART algorithm is to evaluate the probability that a value v is true given the observation of the claimed data (o) (i.e. P (vj (o))). Being P ( (o)jv) and P ( (o)jv) the probabilities of having the observation (o) when v is true or false respectively, Bayesian inference can be used to express P (vj (o)) as shown below:

P (vj (o)) =

P ( (o)jv)P (v)

P ( (o)) =

P ( (o)jv) o(v) P ( (o)jv) o(v) + P ( (o)jv)(1 (1)

Our main criticism to DART is the assumption that sources are independent, which is a clear oversimpli cation of the real world. We will explain how we have relaxed this assumption in the following section. 0(v)) MBM is a Bayesian algorithm for multi-truth nding that takes into consideration also the problem of source dependence. It computes, for each source and set of values, an independence score based on the values provided by all the sources. The independence score is then used to discredit, in the voting phase, sources that don't provide their values independently.

Our criticisms to MBM are the assumption that there is no mutual copying between sources in the whole dataset and the fact that the algorithm is not able to distinguish the direction of copying. In the following section we will describe how we have relaxed these assumptions.

Table 1 describes the notation that will be used in the following sections. 3

Methodology

We now present ADAM (Authority Domain Aware Multi-truth data fusion), a method based on Bayesian inference and source authority that iteratively re nes the probability that a provided value for a data item is true. 3.1

Copy detection

Starting from [ 6 ], we have devised a domain-aware

copy detection algorithm to assign di erent probabilities to the two directions of copy. This model works at domain granularity, therefore it can more accurately approximate

the real world behaviour of correlated copying [ 2 ].

Scope. Given an object o and two sources si and sj , we denote by

ioj the observation of the common values cij (o) for a common object o 2 Notation Description

O(s) Set of all objects provided by source s Od(s) Set of objects in domain d provided by source s Vs(o) Set of all values claimed for object o by source s Vs(o) Set of all values claimed for object o by sources 6= s Sod(v) Sources that provide value v for object o in domain d Sod(v¯) Sources that don’t provide value v for object o in domain d ed(s) Expertise of source s in domain d o(v) Veracity of value v for object o ⌧ drec(s) Recall of source s in domain d ⌧ dsp(s) Specificity of source s in domain d cso(v) Confidence score of value v of object o related to source s si ! sj Source i is copying at object level from source j si ? sj Sources i and j are independent at object level si ! d sj Sources i is copying from source j for domain d

⇥ idj Set of common objects in domain d between sources i and j cij(o) =: c Values provided by both sources i and j for object o ioj =: c Observation of c

(o) Observation of the values provided by object o Ad(s) Authority of source s in domain d

Table 1. Notation idj in domain d provided by two source si and sj . assume that two sources can3.1onClyopbyedeetitehcteiornindependent or copiers.

3 Methodology Assumptions. In our copy detection algorithm we assume that there is no mutual

We now present ADAM (Authority Domain Aware Multi-truth data fusion), a copying at domain level, i.e.m,eitfhsoodubarsceed osn1Bcaoyepsiiaensinfrfeoremncesaonudrscceousrc2e aruetghaorridtyinthgat ditoermatiavienly refine d, then s2 can copy from st1heopnrolybabvialitlyutehsatfaorproovbidjeedcvtasluienfordaomdataai nitesmd~is6 =trude.; we also

Starting from [ 6 ], we have derived a domain aware copy detection algorithm Object copying. For each pcaapirabloefofsaosusirgcniensg dii;↵ejre,ntapftroebrabwilietieshtaovtehedtewo ndeirdecttiohnes otf rcuoptyh. This probability of the group of mvaodluelewsorks at domain granularity, therefore it canalmlotrehaeccurately approximate in c as the probability that values are the real world behaviour of correlated copying [ 2 ]. correct (Eq. 2), we can compute the likelihood of c in di erent cases of source observation of the common values cij(o) for a common obiject o 2 c⇥opidj iienddoiojmtahine dependence and truthfulnesSscoopef. cG.iSveinmainlaorblyjecttoo [a6n]d, wtweo ssotuartcees tshiaatndifsjs, wheadsenote by from sj , or the other way roduprnodvi,detdhbeyntwtohseoyurcpersoi vainddesjt.he same common values c, no matter the veracity of c (Eq. 3).

Assumptions. In our copy detection algorithm we assume that there is no mutual copying at domain level, i.e., if source s1 copies from source s2 regarding domain d¯, then s(2cc)an=coYpy fro m(vs)1 only values for objects in domains d˜ 6= (d¯2;)we also assume that two sources can only be either independent or copiers.

v2c

Object copying. For each pair of sources i, j, after we have defined the truth (

P ( cjsi ! P ( cjsi !

prcobtarbuiliety) o=fthPe (groupsojf valuses ;in cc atsruthee)p=roba1bility that all the values are sj ;correct (Equation 2), cwje ca!ncomipute the likelihood of c in di↵erent( 3ca)ses of sj ;socurcfeadlseepe)n=denPce(andcjtsrujth!fulnseis;s ocf cf.aSlismei)lar=ly t1o [ 6 ], we state that if si has Eq.s 4 and 5 de ne the probabilities that both sources provide the same group of values c independently of each other, in the two cases that c is true and false.

P ( cjs1?s2; c true) = P ( cjs1?s2; c false) = rec(s1) rec(s2) [1

sp(s1)] [1 sp(s1) sp(s2) [1 rec(s1)] [1 sp(s2)] rec(s2)] (4) (5) Bayesian model. If we apply a Bayesian inference approach we can now compute the probability of two sources being dependent or independent, and in the rst case we can also de ne which of the two is the copier.

With Y = fsi ! sj ; sj ! si; si?sj g we de ne the three possible outcomes.

P (yj c) = Py02Y P ( cjy0) P (y0)

P ( cjy) P (y)

P (y) [P ( cjy; c true) = Py02Y P (y0) [P ( cjy0; c true) (c) + P ( cjy; c false) (1 (c))] (c) + P ( cjy0; c false) (1 (c))] We now have to nd a way to estimate the prior probability of the Bayesian model: P (si ! sj ), P (sj ! si) and P (sj ?si), that are all the di erent congurations of object copying between sources si and sj . We de ne this as the probability of the two sources being independent or copiers in the domain of the object we are considering, de ned in Eq. 11. For ease of notation we apply the following de nition, recalling that d is the same domain of idj 3 o where o is the object of c that we are analyzing.

8>P (si ! sj ) =: idj <

P (sj ! si) >:P (sj ?si) =: jdi = 1 d ij d ji and replace Eq.s 7, 2, 3, 4 and 5 into Eq. 6, with the following result: d ij P (si ! sj j c) = idj + jdi + 1 d ij d ji

Pu where

Pu := (c) [ rec(si) rec(sj ) (1 sp(si)) (1 sp(sj ))] + + (1 (v)) [ sp(si) sp(sj ) (1 rec(si)) (1 rec(sj ))] (6) (7) (8) (9) (10) Non-shared values. With Eq. 8 we have expressed the probability that a source si has copied from another source sj their common values c for object o. We now have to take into consideration other possible non-in-common values to opportunely compute the probability that c were really copied. We have chosen to scale the copy probability by the Jaccard similarity of the two sets of values of o claimed by the two sources si and sj , as shown in Eq. 10.

Jij (o) = Jji(o) =

Vsi (o) \ Vsj (o)

Vsi (o) [ Vsj (o) Domain-level copying. We can use the concept of copying an object o to de ne the act of copying with respect to a domain d as de ned in Eq. 11.

P si !d sj idj := Po2 idj P (si !d sj j c) Jij (o) (11) ij Initialization. Since in the initialization phase we have no prior knowledge of idj , we decided to exploit the fact that sources with high expertise in domain d are less likely to be copiers for domain d and that sources with low expertise in d tend to copy from sources with higher expertise in d. These ideas can be summarized in the initialization expressed in Eq. 12.

idj = 1 ed (si) ed (sj ) 8si; sj 2 S ^ si 6= sj (12)

ad(sj ) :=

3.2 Source authority

The key idea to de ne the authority of a source in a speci c domain with respect to the outcomes of the copy detection process is that, if many sources copy some values from the same source sa, it is because sa is considered authoritative and more trustworthy. For each source sj 2 S, we de ne Cd(sj ) in Eq. 13 as the set of all the sources that copy from source sj with probability above a given threshold : o

Cd(sj ) := nsi 2 S P (si !d sj j idj ) > (13) Qualitatively, the unadjusted authority score of source s in domain d is how much source s is copied in d w.r.t. how much all sources are copied in d (Eq. 14).

Psi2Cd(sj) P (si !d sj j idj ) P sk2S

Psl2Cd(sk) P (sl !d skj kdl) Note that in general the cardinality of S (i.e. the number of sources) is high and the parameter should not be set too close to 1 to better exploit the variety of outcomes of the copy detection process. This con guration leads to ad(s) 1. We can accordingly apply a linear conversion to ad(s) in order to map it on the interval [0; 1]. We denote this new score as Ad(s) or authority of source s in domain d, computed as:

ad(s) Ad(s) := amax d amin

d amin

d drec(s)ed(s)cs(v)+Ad(s)

3.3 Veracity

We have extended the DART Bayesian inference model in order to exploit the authority score of each source. Our key idea is to positively reward sources according to their authority, which can be achieved with Eq.s 16 and 17, respectively.

Y Y dsp(s))ed(s)cs(v)+Ad(s) (14) (15) o(v0)) (16) drec(s))ed(s)cs(v)+Ad(s) drec(s) =

P o2Od(s)

Po2Od(s) jVs(o)j

Pv2Vs(o) o(v)

(18) (19) At each iteration of the algorithm veracity scores of values are re ned, this leads to a better estimation of copy detection and source authority, that in turn will improve again values' veracity in the next iteration. The algorithm stops iterating when the updates of all veracities are less then a given threshold . The output of the algorithm is, for each object o in the dataset, a set of values whose veracities are greater or equal to a given threshold . dsp(s) =

P o2Od(s)

v02Vs(o)(1 Po2Od(s) Vs(o) s2Sod(v)

Y s2Sod(v) dsp(s)ed(s)cs(v)+Ad(s) s2Sod(v)

Y s2Sod(v) (1 (1 P ( (o)jv) = P ( (o)jv) = (17)

In a multi-truth context, precision cannot be the only metrics for source trustworthiness [ 7 ], but we should use recall and speci city : source recall is the probability that true values are claimed as true (Eq. 18), while source speci city is the probability that false values are claimed as false (Eq. 19). We now present the result of an experimental comparison between our algorithm,ADAM (Authority Domain Aware Multi-truth data fusion), and the original DART in di↵erent configurations of the input data. 4.1

Dataset

We have used as input data a subset of the same book dataset that has been 4 Experimental Results uLsined, ofonre tohfetehvealauuatthioonrsooffth[8e].DOARuTr aglogaolriitshmto, dkiisncdolvyermtahdee caovrarielcatblveabluyeXsuofeltinhge We now present the result of an expemriumltei-nttraulthcopmarpamareitseornabutehtworesenusoinugr tahlegocraittehgmor,y attribute to clusterize books into domains.

ADAM (Authority Domain Aware Multi-truth

For ourdeaxtpaerfiumseionnts),waenhdavtehbeeoenriagbinleatloDuAsReTa subset of this dataset matchin di erent con gurations of the inpiungt danaottah.er validated and trustworthy dataset considered as golden truth for the book-authors binding. The dataset used in our experiments is composed by 4.1 Dataset 90,867 tuples from 2,680 sources and 1,958 books spanning on all the 18 domains (i.e. categories of book genres) of the original dataset.

We have used as input data a subset Oofurtahlegosraitmhme dboepoeknddsatoansseetvetrhaaltpahraamsebteeresn, we report in Table 2 the value used for the evaluation of the DART aulsgeodrfiotrhmea,chkipnadralymmeteard. eWahveanilaanbliendbicyatXiounelwinasg present in [ 8 ] we use the same Lin, one of the authors of [ 5 ]. Our gporaolvwidaeds tvoaludeistcoovenesrutrheethceorcroemcptavraabluileitsyobfettwheeen the two algorithms. multi-truth parameter authors using the category attribute to clusterize books into domains.

For our experiments, we have been able to use a subset of this dataset matching another validated and trustworthy dataset considered as golden truth for the book-authors binding. The dataset used in our experiments is composed by 90,867 tuples from 2,680 sources and 1,958 books, spanning all the 18 domains (i.e. categories of book genres) of the original dataset.

Our algorithm depends on several parameters; Table 2 reports the value used for each of them. When an indication was present in [ 5 ], we used the same provided value to ensure optimal comparability between the two algorithms.

4.2

Results

Parameter Value ↵ 1.5 0 0.1 ⌘ 0.2 ✓ 0.5 ¯ 0.5 ⌧¯rec 0.8 ⌧¯sp 0.9 Table 2. Parameters 4.2

Results

We have develop an implementation of DART algorithm following as precisely as We have developed in Python 3.7 bpoostshible thimegpuleidmeleinnetsateixopnreossfedDAinRT[ 8 ](afonldloowuirnegxtension ADAM both in Python an as precisely as possible the guideline3s.7e.xEpvreensstehdouignh[o5u])r, ianntedreostuwraesxitnendseitoenrmAinDeAMth.e impact of our extensions on Even though our interest was in dDeAtReTrmpeirnfeor mthaenciems,pwaechtaovef aolsuor deexvetleonpseidonassimonple version of MajorityVote DART performances, we have also deavseblaospeelidneacosmimpaprliesovne,rtsriaonnsfeorfrinMga jthoericltayssVicotsiengle-truth voting system in the as baseline comparison. multi-truth context by considering true all the values of object o that have been voted by at least 60% of the sources that provide a value for o.

ADAM has its F-1 score higher than DART in the 76% of the times. Moreover in our experiments ADAM has required strictly less iterations before convergence in the 65% of the times with respect to DART, in some cases the number of iterations required was less than a half. At rst sight this faster convergence might seem to be due only to the increment of A(s) in the exponent in Eq.s 16 and 17 but with a more precise analysis we discover that A(s) 6 0 only for a small fraction of the sources, modeling in a correct manner the desired meaning of authority which by de nition should be related to only a small subset of objects.

We have run 37 comparison between DART, ADAM and MajorityVote using the same input data for the three algorithms at each run, focusing on both input regarding single and multiple domains. We particularly focus in this section on a subset of 10 runs, reporting in Table 3 the metrics of DART and ADAM of those runs and nally in Table 4 we aggregate the results of all 37 runs reporting the averaged metrics of MajorityVote, DART and ADAM.

Domain Records jDj jOj jSj

5 Conclusions

We presented ADAM, an improved algorithm for multi-truth data fusion. A quicker termination and better results con rm that our idea to reward authoritative sources has led to an increase in the algorithm performance and accuracy.

1. Dong , Xin Luna and Berti-Equille , Laure and Srivastava, Divesh: Truth Discovery and Copying Detection in a Dynamic World . VLDB ( 2009 )

2. Blanco , Lorenzo and Crescenzi, Valter and Merialdo, Paolo and Papotti, Paolo: Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources . Advanced Information Systems Eng ( 2010 )

3. Li , Xian and Dong, Xin Luna and Lyons, Kenneth and Meng, Weiyi and Srivastava, Divesh: Truth Finding on the Deep Web: Is the Problem Solved? CoRR ( 2015 )

4. Dong , Xin Luna and Berti-Equille , Laure and Srivastava, Divesh: Integrating Conicting Data: The Role of Source Dependence . VLDB ( 2009 )

5. Lin , Xueling and Chen, Lei: Domain-aware Multi-truth Discovery from Con icting Sources . VLDB ( 2018 )

6. Wang , Xianzhi and Sheng, Quan Z. and Fang, Xiu Susie and Yao, Lina and Xu, Xiaofei and Li, Xue: An Integrated Bayesian Approach for E ective Multi-Truth Discovery . CIKM ( 2015 )

7. Bo , Zhao and Benjamin, Rubinstein and Jim, Gemmell and Jiawei, Han: A Bayesian Approach to Discovering Truth from Con icting Sources for Data Integration . CoRR ( 2012 )

8. Xiaoxin , Yin and Jiawei, Han and Philip , Yu: Truth Discovery with Multiple Conicting Information Providers on the Web . TKDE ( 2007 )

9. Dong , Xin Luna and Saha, Barna and Srivastava, Divesh. Less is more: Selecting sources wisely for integration . VLDB ( 2012 )

10. Abiteboul , Serge and Hull, Richard and Vianu, Victor: Foundations of databases: the logical level, Addison- Wesley ( 1995 )