=Paper= {{Paper |id=Vol-2157/paper5 |storemode=property |title=Leveraging Probabilistic Existential Rules for Adversarial Deduplication |pdfUrl=https://ceur-ws.org/Vol-2157/paper5.pdf |volume=Vol-2157 |authors=José N. Paredes,Maria Vanina Martinez,Gerardo I. Simari,Marcelo A. Falappa |dblpUrl=https://dblp.org/rec/conf/cade/ParedesMSF18 }} ==Leveraging Probabilistic Existential Rules for Adversarial Deduplication== https://ceur-ws.org/Vol-2157/paper5.pdf
       Leveraging Probabilistic Existential Rules
             for Adversarial Deduplication

                     Jose N. Paredes, Maria Vanina Martinez,
                    Gerardo I. Simari, and Marcelo A. Falappa
               {jose.paredes,mvm,gis,mfalappa}@cs.uns.edu.ar

    Dept. of Computer Science and Engineering, Universidad Nacional del Sur (UNS)
           Institute for Computer Science and Engineering (UNS–CONICET)
                     San Andres 800, (8000) Bahia Blanca, Argentina



        Abstract. The entity resolution problem in traditional databases, also
        known as deduplication, seeks to map multiple virtual objects to its cor-
        responding set of real-world entities. Though the problem is challenging,
        it can be tackled in a variety of ways by means of leveraging several
        simplifying assumptions, such as the fact that the multiple virtual ob-
        jects appear as the result of name or attribute ambiguity, clerical errors
        in data entry or formatting, missing or changing values, or abbrevia-
        tions. However, in cyber security domains the entity resolution problem
        takes on a whole different form, since malicious actors that operate in
        certain environments like hacker forums and markets are highly moti-
        vated to remain semi-anonymous—this is because, though they wish to
        keep their true identities secret from law enforcement, they also have
        a reputation with their customers. The above simplifying assumptions
        cannot be made in this setting, and we therefore coin the term “adver-
        sarial deduplication”. In this paper, we propose the use of probabilistic
        existential rules (also known as Datalog+/–) to model knowledge engi-
        neering solutions to this problem; we show that tuple-generating depen-
        dencies can be used to generate probabilistic deduplication hypotheses,
        and equality-generating dependencies can later be applied to leverage
        existing data towards grounding such hypotheses. The main advantage
        with respect to existing deduplication tools is that our model operates
        under the open-world assumption, and thus is capable of modeling hy-
        potheses over unknown objects, which can later become known if new
        data becomes available.


1     Introduction
Deduplication of objects in databases, also known as entity resolution, is a classi-
cal problem in data cleaning [18]; the basic idea is that databases contain objects
that are potential duplicates—seemingly different records that correspond to the
same entity or object in the real world—that need to be identified and merged [8,
14]. There has been a lot of interest on this topic in the last 20 years, yielding a
large body of work that lies mostly in the databases literature. Traditionally, enti-
ties are resolved using pairwise similarity over the attributes of reference [18]. The
2

most recent and promising proposal are those based on so-called collective entity
resolution [6], which exploits additional relational information in the data since
references to different entities may co-occur. For instance, one such approach,
called iterative blocking [30], iteratively groups together matching records in
blocks, giving rise to the possibility of using the information processed so far
to inform further decisions. Data-driven approaches [6, 7, 13] leverage examples
to determine if which pairs of records are duplicates. Recently, [3] has defined
a declarative framework for entity resolution based on matching dependencies
(MDs). This formalism, which was first introduced in [16, 17], consists of declar-
ative rules that generalize entity resolution tasks, eliminating duplicates via a
matching process. These rules state that certain attribute values in relational
tuples, under certain similarity conditions over possibly other attribute values
in those tuples, must be made the same. The original semantics for MDs, defined
in [17], was redefined and extended in [5].
    All of the above-mentioned methods operate under the assumption that the
undesired situation that multiple records refer to the same real-world object is
caused by a mix of clerical errors in data entry (simple typos), ambiguity in
names or other attributes (consider, for instance, the result of transliterations
such as “Kurt Gödel” being mapped to “Kurt Goedel”), inconsistent abbrevia-
tions and formatting, missing values (for instance, phone number not provided),
or changing values (for example, one record has an old phone number while oth-
ers have an updated one). Though these assumptions adequately reflect many
real-world scenarios, we are interested in others in which they cannot be as-
sumed to hold. One example of such a scenario is in cyber-security domains
such as malicious hacker forums on the dark/deep web (the part that is not
directly accessible by traditional browsers and search engines)—the entity reso-
lution problem in this setting becomes quite different. In [26], a system for cyber
threat intelligence was developed that gathers data from various social platforms
on the Internet, focusing particularly on sites in the dark net and deep net. They
collect and store information from hacker forum discussions and marketplaces
offering products and services that focus on malicious hacking. The system is de-
signed to extract specific information from marketplaces (regarding sales of mal-
ware/exploits) and hacker forums (discussions regarding services and threats).
This extracted information is well-structured and can be stored in a relational
database; they maintain two databases, one for marketplaces and the other for
forums. The crawling and parsing of these sites is carried out periodically so that
time-varying data can be collected. For markets, they collect product fields such
as item title, item description, vendor name, CVE number (identifier given by
the National Vulnerability Database [11, 25]), shipping details, item reviews, etc.
For forums, they collect other fields such as topic content, post content, topic
author, post author, author status, reputation, and topic interest.
   These forums and marketplaces have an interesting characteristic: though
malicious actors of course wish to remain anonymous from law enforcement
agents that might be patrolling, they also enjoy a reputation with their cus-
tomers and readers, and so they also wish to remain identifiable. In particular,
                                                                                             3

the same actor typically operates under different screen names, keeping certain
aspects constant so that others can recognize them. Furthermore, and perhaps
most importantly, they also leave involuntary traces behind that can be leveraged
in deduplication efforts. We refer to this as the adversarial deduplication/entity
resolution problem. The following is a simplified example of the kind of informa-
tion that we can obtain from dark web forums and marketplaces.
Example 1. As a running example, consider the simplified domain of the mali-
cious hacker forums. Consider the following relational schema regarding infor-
mation about users and their posts in forums:
                       User(Id u, Nick), Post(Id po, Id u, Id t),
                   HasPosted(Id u, Id t), WritStSim(Id u, Id u),
which represent users of these forums, posts by users, topics in posts, which
users have posted on which topics, and the writing style similarity between users,
respectively. Furthermore, id u and id po are the keys for relations User and Post,
respectively. Let D1 be an instance of this schema, where t1 is “private sqli tool”,
t2 is “Sql Injection”, and t3 is “trojan servers as a backdoor”:
      D1 = { d1 : Post(p1 , a1 , t1 ), d2 : Post(p2 , a2 , t2 ), d3 : Post(p3 , a3 , t3 ),
             d4 : User(a1 , ‘Blackstar’), d5 : User(a2 , ‘Archer’),
             d6 : User(a3 , ‘Angel Eyes’), d7 : WritStSim(a1 , a2 )
      }
                                                                                             


    Finally, another basic assumption generally made by traditional approaches
to entity resolution—which can be seen as a consequence of the other assumptions—
is that the information contained in the databases is complete. In our setting,
on the other hand, it may be the case that the relevant information is yet to be
discovered; in these settings, the open-world assumption becomes crucial, allow-
ing us to entertain deduplication hypotheses over unnamed entities. In the next
section, we present a formalism that allows for this kind of reasoning, as well as
deal with probabilistic uncertainty.
   The rest of this paper is organized as follows: Section 2 presents the ex-
tension of Datalog+/– with probabilistic annotations and stratified negation,
Section 3 presents a two-stage chase procedure that extends the classical chase
with propagation of probabilistic events and EGDs. Sections 4 and 5 introduces
deduplication programs and deduplication threshold queries that can be an-
swered via the two-stage chase, respectively. We discuss conclusions and future
work in Section 6.

2   Probabilistic Datalog+/– with Stratified Negation
In this section, we present the basics of the language of Datalog+/– from [22],
its extension with probabilistic uncertainty from [19], and extend it with strat-
ified negation; we redefine some of the basic concepts to adjust them to our
presentation.
4

    We assume the existence of the following pairwise disjoint (countably infinite)
sets: a set ∆ of constants, a set V of variables, and a set N of nulls. Different
constants represent different values (unique name assumption), while different
nulls may represent the same value. We assume a lexicographic order on ∆ ∪ N ,
with every symbol in N following all symbols in ∆. We denote by X sequences
of variables X1 , . . . , Xk with k > 0.
    Consider relational schemas R with a possibly infinite data domain ∆, a finite
set of database predicates (relations) R ∈ R, and a set of built-in predicates
for evaluating “=”, “¬”, “≤”, comparisons, and arithmetic operations. Each
relation R ∈ R has an associated arity n, thus R has the form R(A1 , . . . , An ),
where each attribute Ai has associated domain Dom(Ai ) ⊆ ∆. For the sake of
simplicity we may assume that the Ai ’s are different, and different predicates do
not share attributes. However, different attributes may share the same domain.
An atomic formula (or atom) has the form P (t1 , . . . , tn ), where ti ∈ Dom(Ai )∪V.
A conjunction of atoms is often identified with the set of all its atoms. A database
instance D for R is a (possibly infinite) set of ground atoms (or tuples) with
predicates from R and arguments from ∆.
    A homomorphism is a mapping µ : ∆ ∪ V ∪ N → ∆ ∪ V ∪ N such that (i)
if c ∈ ∆ then µ(c) = c, (ii) if c ∈ V ∪ N then µ(c) ∈ ∆ ∪ N , and (iii) µ is
naturally extended to atoms, sets of atoms, and conjunctions of atoms, that is,
µ(p(t1 , . . . , tn )) = p(µ(t1 ), . . . , µ(tn )). A grounding is a homomorphism ρ such
that ρ(c) ∈ ∆ for every c ∈ ∆ ∪ V ∪ N . Also, given a conjunction of atoms
Φ(X), we denote with pos(Φ(X)) the conjunction of atoms in Φ(X) that appear
positive, and with neg(Φ(X)) the conjunction of atoms in Φ(X) that appear
negated.
TGDs. A tuple generating dependency (TGD) is a first-order formula:

                          (σ) ∀XY Φ(X, Y) → ∃ZΨ (X, Z)                              (1)

where X, Y, Z are sequences of variables, Φ(X, Y) is a conjunction of atoms and
negated atoms over R (called the body of σ, denoted body(σ)), and Ψ (X, Z) is
a conjunction of atoms over R (called the head of σ, denoted head(σ)). We use
body+ (σ) and body− (σ) to denote pos(Φ(X, Y)) and neg(Φ(X, Y)), respectively.
For ease of presentation, we sometimes omit the universal quantifiers, and assume
that all variables without quantifier are universally quantified.
    We restrict the language so that all the variables in body− (σ) appear already
in some atom in body+ (σ). Furthermore, we assume Σ is stratified [20], i.e., Σ
can be partitioned into mutually disjoint sets Σ1 , . . . , Σk such that:
    – If Σi contains a predicate A that appears positively in some rule, then there
      is no rule in Σi+1 ∪ . . . ∪ Σk with A in the head.
    – If Σi contains a predicate A that appears negated in some rule, then there
      is no rule in Σi ∪ . . . ∪ Σk with A in the head.
We call hΣ1 , . . . , Σk i a stratification of Σ. As in the case of standard TGDs [22],
w.l.o.g., we can assume that Ψ (X, Z) is a single atom (any set not satisfying this
condition can be translated to an equivalent set that does).
                                                                                    5

Definition 1. Let σ be a TGD with body(σ) = ∀XY Φ(X, Y). We say that
body(σ) is satisfied in D through homomorphism h if and only if h is such that
h(pos(Φ(X, Y))) ⊆ D and h(neg(Φ(X, Y))) 6⊆ D.
    A TGD σ is satisfied in D if and only if whenever the body of σ is satisfied
through homomorphism h, then there exists an extension h0 of h such that
h0 (head(σ)) ⊆ D.

EGDs. Equality generating dependencies (EGDs) are first order formulas of the
form: ∀ XΦ(X) → Xi = Xj , where Φ(X) is a conjunction of atoms, X is a
sequence of variables, and Xi , Xj ∈ X. As with TGDs, we sometimes omit the
universal quantifiers.
    The body of an EGD , denoted body(), corresponds to Φ(X), and the head,
denoted head(), corresponds to the equality atom on the right hand side of the
rule. An instance D satisfies an EGD , denoted D |= , if whenever there exists
a homomorphism h such that h(body(Φ)) ⊆ D, then it must be the case that
h(Xi ) = h(Xj ). Otherwise, we say that  is violated in D.


Probabilistic Annotations

Following [19], we further assume the existence of a finite set of constants ∆M ,
a finite set of predicate names RM such that R ∩ RM = ∅, and an infinite set
of variables VM , such that V ∩ VM = ∅. These sets give rise to corresponding
Herbrand bases and Herbrand universes consisting of all possible ground terms
and atoms, respectively; these are denoted with HM and UM , respectively.
    Let M be a probabilistic model that represents a joint probability distribution
over a (finite) set of random Boolean variables X = {X1 , . . . , Xn } such that there
is a 1-to-1 mapping from X to the set of all ground atoms over HM . Examples
of the type of probabilistic models that we assume in this work are Markov
Logic [28], Bayesian networks [27], etc.
    A (probabilistic) annotation λ, relative to M , is a (finite) set of expressions
hA = xi i, where A is an atom over RM , ∆M , and VM , xi ∈ {0, 1}, and the Xi ’s
are pairwise distinct. A probabilistic annotation is valid if and only if for any two
different expressions A = x, B = y ∈ λ, there does not exist a substitution θ such
that θ(A) = θ(B). A probabilistic scenario is a valid probabilistic annotation λ
for which |λ| = |X| and all hA = xi i are such that A is ground. We use scn(M )
to denote the set of scenarios in M . A probabilistic annotation is equivalent to
a formula that consists of the disjunction of all the scenarios that belong to the
set denoted by the annotation. Therefore, with a slight abuse of notation, we
sometimes combine annotations with logical operators.
    Intuitively, a probabilistic annotation is used to describe an event in which the
random variables in a probabilistic model M are compatible with the settings
of the random variables described by λ, i.e., each Xi has the value xi . We
will attach probabilistic annotations λ to formulas in our language to produce
annotated formulas F : λ, where F is an atom, a TGD, or an EGD. Intuitively,
the annotation means that F holds whenever the event associated with λ occurs.
Note that whenever a random variable’s value is left unspecified in a probabilistic
6

annotation, the variable is unconstrained; in particular, a formula annotated with
an empty probabilistic annotation means that the formula holds in every world.
An annotated formula F : λ is referred to as a probabilistic atom whenever F
is atom, a probabilistic TGD (pTGD) if F is a TGD, and a probabilistic EGD
(pEGD) whenever F is an EGD.

Definition 2. A probabilistic Datalog+/– knowledge base is a triple KB =
(O, M, af), where O is a finite set of atoms, TGDs, and EGDs, M is probabilistic
model, and af is an annotation function that maps formulas to probabilistic
annotations relative to M .

   For ease of presentation, we sometimes use notation “σ : e” to denote
“af(σ) = e”. Given a probabilistic Datalog+/– KB KB = (O, M, af) and a
scenario λ ∈ scn(M ), Oλ denotes the (non-probabilistic) Datalog+/– knowledge
base induced by λ. Formally, Oλ consists of all formulas Fi (atoms, TGDs, and
EGDs) such that λi ⊆ λ for some Fi : λi ∈ O.

Example 2. Consider the DB from Example 1, and let KB1 = (O1 , M1 , af1 ) be
a Datalog+/– KB. Then, O1 could contain the following TGD ω1 :

         (ω1 ) User(UID, N ) ∧ hasPosted(UID, t1 ) → hasPosted(UID, t2 )

The intuition of this rule is: “Every User that has posts on topic t1 , also has posts
on topic t2 ”. The annotation function af1 is defined such that af1 (ω1 ) = e1 (t1 , t2 )
and e1 (t1 , t2 ) is an event associated with the probability that a user posted
something related to t2 given that he has posted something related to t1 .
O1 could contain the following EGD 1

             (1 ) User(UID1 , N ) ∧ User(UID2 , N ) → UID1 = UID2 .

The intuition of this rule is: “Two users cannot share the same user name.”.
This rule is annotated with the empty event, that is af1 (1 ) = ∅—therefore, it
always holds.                                                                 



    Probabilistic model M1 in the above example can be learned from domain
information, and it allows to infer probabilities about possible duplicate user
and correlations regarding certain events, such as posts by users given specific
topics. We will discuss this in greater detail in Section 4.


3    A Two-stage Chase Procedure
We now extend the classical operational semantics for Datalog+/–, based on the
chase procedure, for the case of the extensions discussed above. This procedure
provides the semantics for the application of rules and constraints, extending
the well-known (oblivious) chase algorithm for testing implications of data de-
pendencies [4, 23] or completing deductive databases [1]. The chase provides a
mechanism for repairing a database instance relative to a set of dependencies,
                                                                                       7

so that the result of the chase (a database instance) satisfies the dependencies.
Note that, though [19] presents a chase procedure for probabilistic Datalog+/–
ontologies, our two-stage chase includes stratified negation and EGDs.
    The building block for the chase procedure is the TGD chase rule; in our
setting, we must also deal with the propagation of probabilistic annotations.
pTGD Chase Rule. We will now define the probabilistic TGD chase rule for
a knowledge base KB = (O = (D, Σ), M, af). Let σ ∈ O be a pTGD of the form
∀XY Φ(X, Y) → ∃ZΨ (X, Z) : eσ ; then, σ is applicable to D if and only if there
exists a homomorphism h such that one the following cases occur:

a) 1) h maps the atoms in pos(Φ(X, Y)) and neg(Φ(X, Y)) to probabilistic
      atoms {α1 : e1 , ..., αk : ek } ⊆ D and {αk+1 : ek+1 , ..., αj : ej } * D,
      and V
   2) h(eσ ∧ i=1,...,k ei ) is a valid probabilistic annotation.
b) 1) h maps the atoms in pos(Φ(X, Y)) and neg(Φ(X, Y)) to probabilistic
      atoms {α1 : e1 , ..., αk : ek } ⊆ D and {αk+1 : ek+1 , ..., αj : ej } ⊆ D,
      and V                  V
   2) h(eσ ∧ i=1,...,k ei ∧ i=k+1,...,j ¬ei ) is a valid probabilistic annotations.

    Let σ be applicable to D, and h0 be a homomorphism that extends h as
follows: for each term Xi ∈ X, h0 (Xi ) = h(Xi ) and h0 (Z) = zj , where Z is
existentially quantified in head(σ) and zj is a “fresh” null, i.e., zj ∈ ∆N , zj does
not occur in D, and zj lexicographically follows all other nulls already introduced.
The application of σ on D adds to D one of the following probabilistic atoms
according to the case where σ is applicable:

a) h0 (Ψ (X, Z)) : h0 (eσ ∧ i=1,...,k ei ).
                           V

b) h0 (Ψ (X, Z)) : h0 (eσ ∧ i=1,...,k ei ∧ i=k+1,...,j ¬ei ).
                           V                V

    As a second step, we apply the equality-generating dependencies. We do this
in a separate step over the result of the pTGD chase in order to avoid undesired
interactions between TGDs and EGDs that can lead to undecidability [22], since
the standard way to avoid this via separability is too strong an assumption in
our case (see Section 4).
pEGD Chase Rule. Let KB = (O = (D, Σ), M, af) be a probabilistic Datalog+/–
KB, and  ∈ Σ be a pEGD of the form ∀X Φ(X) → Xm = Xn : e . Then,  is
applicable to D if and only if there exists a homomorphism h such that:

a) h maps the atoms in Φ(X) to probabilistic atoms {α1 : e1 , ..., αk : ek } ⊆ D,
   and
         V
b) h(e ∧ i=1,...,k ei ) is a valid probabilistic annotation.

Let  be applicable to D, and ΦXm = {α1 : e1 , ..., αl : el } ⊆ {α1 : e1 , ..., αk : ek }
be the set of probabilistic atoms where Xm occurs in α1≤i≤l ; then, the application
of  on D results in:
 8

 (i) The chase fails if h(Xm ) and h(Xn ) are constants; this is called a hard vio-
     lation. Otherwise,                                                 V
(ii) for all αi : ei ∈ ΦXm replace h(Xm ) with h(Xn ) and ei with h(e ∧ i=1,...,k ei ),
     if h(Xm ) is a null and h(Xn ) is a constant, or h(Xm ) precedes h(Xn ) in the
     lexicographic order.
    The following section shows how this chase procedure can be used to answer
 probabilistic deduplication queries.


 4    Deduplication Programs
 In this section we define a subclass of Probabilistic Datalog+/– programs that
 are specifically tailored to declaratively capture the requirements for a dedupli-
 cation task in an open-world, adversarial environment.
     Towards this end, we assume that there exists a set E ⊆ R that we identify
 as entity relations, i.e., relations that represent entities of interest in the appli-
 cation domain. Tuples in these relations have identifiers—keys—that allows us
 to compare extensions of the same predicate in different instances and to trace
 changes to them. Tuple identifiers can be accommodated by simply adding to
 each predicate E ∈ E an extra attribute (usually denoted with T ) that acts as a
 primary key in E. Finally, there is also a predefined predicate called dedup hypth.
     Our formalism is inspired in (relational) matching dependencies (RMDs) [5],
 which provide a declarative way of defining conditions under which records (or
 attribute values) should be made equal—they also establish how this matching is
 supposed to be done by means of matching functions. However, as we discussed
 before, RMDs are developed under a closed-world assumption, i.e., everything
 that is not contained in the database instance is assumed to be false (or to
 not exist). Nevertheless, we argue that in the type of domain applications we
 are interested in, this assumption is not realistic and therefore we must agree
 in that the database instances we are dealing with are potentially incomplete.
 Under such scenarios, RMDs are not expressive enough to capture all potential
 matchings, since there might be matchings for which we cannot establish specific
 conditions to find them but rather just make hypotheses based on (weaker) evi-
 dence of other nature—the possibility of value invention afforded by existential
 rules thus becomes crucial.
     To specify the kind of rules we require, we slightly modify pTGDs so they
 can be used to infer potential matchings (or matching hypotheses). For this, we
 assume that for every attribute Aj appearing in a n-nary relation R ∈ R, where
 the schema of R is R(A1 , . . . , An ), there exists a binary similarity relation ≈Aj
 that is reflexive and symmetric. Thus, they are capable of expressing similarity
 conditions over attribute values as done in RMDs, since the “≈” predicates can
 simply occur in the body.
 dhTGDs. A Duplication Hypothesis Tuple Generating Dependency (or dhTGD)
 σ is simply a pTGD of the following form:
     ∀XY Φ(X, Y) → ∃Zdedup hypth(X, Z) ∧ Γ (X, Z)
                                                                                      9

           INPUT                                                   OUTPUT

                         RAR          r1                             
     DB                  MINING
                                                          body(ri)  …  not(head(ri))
                   DB                 ri
                                                             dedup_hyp  …
                                      
      DB                              rn
                                                                      



Fig. 1. Overview of a possible dhTGD learning process: relational association rule
mining algorithms are applied over a set of available data sources; each such rule then
leads to one output dhTGD characterizing situations that violate the corresponding
association rule and therefore are the basis of a duplicate entity hypothesis.



where X, Y, Z are sequences of variables, X and Z are variables and X ∈ X, Z ∈
Z, Φ(X, Y) is a conjunction of atoms and negated atoms over R, and Γ is also
a conjunction of atoms.
    dhTGDs are a special class of pTGDs that generate deduplication hypotheses—
essentially, they allow to create hypotheses about potential matchings, which
are kept and later confirmed or discarded whenever new information is added to
(or extracted from) the data sources. Informally, dhTGDs capture situations in
which we have reasons to believe that entities may be duplicated but we do not
know which other entity could correspond to them. Figure 1 describes one possi-
ble way in which dhTGDs could be automatically derived from data. A relational
association rule mining algorithm (such as [12]) is used to discover patterns in
the available data sources, which are rules of the form r : body(r) → head(r));
therefore, entities that satisfy the conditions in the body but not those in the
head are potential candidates for entities that have duplicate identities—see Ex-
ample 3 below for a concrete scenario.

pEGDs as companions to dhTGDs. As we saw in Section 3, equality-
generating dependencies are applied as a second step after all (dh)TGDs are
applied; the goal of this step is to complete the database with the information
that might be available with respect to the null values in the dedup hypth atoms.
One kind of pEGDs that might be useful can be automatically derived from
dhTGDs σ by taking the negation of body− (σ) as the body of the pEGD. Fig-
ure 2 illustrates this procedure, and the following is a simple example based on
the running example.

Example 3. Consider the running example, and suppose we have obtained a
relational association rule that states that “if a user has posts related to topic
t1 , then he has posts related to topic t2 ”. This can be translated into the following
dhTGD:

    (ω2 ) User(UID, N ) ∧ hasPosted (UID, t1 ) ∧ not(hasPosted (UID, t2 )) →
                ∃Z dedup hypth(UID, Z) ∧ hasPosted(Z, t2 ).
10


                                                                                             same atoms

                                                 body (2)
                                                     
                                                                      dedup_hyp atom         of body(2)



(𝜔2 ) 𝑈𝑠𝑒𝑟(𝑈𝐼𝐷, 𝑁) ∧ ℎ𝑎𝑠𝑃𝑜𝑠𝑡𝑒𝑑(𝑈𝐼𝐷, 𝑡1 ) ∧ 𝑛𝑜𝑡(ℎ𝑎𝑠𝑃𝑜𝑠𝑡𝑒𝑑(𝑈𝐼𝐷, 𝑡2 )) → ∃𝑍 𝑑𝑒𝑑𝑢𝑝_ℎ𝑦𝑝(𝑈𝐼𝐷, 𝑍) ∧ ℎ𝑎𝑠𝑃𝑜𝑠𝑡𝑒𝑑(𝑍, 𝑡2 )




                                              same atoms              atoms to find
      dedup_hyp atom                          of body(2)            similar entity



(𝜖2 ) 𝑑𝑒𝑑𝑢𝑝_ℎ𝑦𝑝(𝑈𝐼𝐷1 , 𝑍) ∧ 𝑈𝑠𝑒𝑟(𝑈𝐼𝐷2 , 𝑁) ∧ ℎ𝑎𝑠𝑃𝑜𝑠𝑡𝑒𝑑(𝑈𝐼𝐷2 , 𝑡2 ) ∧ 𝑊𝑟𝑡𝑆𝑡𝑆𝑖𝑚(𝑈𝐼𝐷1 , 𝑈𝐼𝐷2 ) → 𝑍 = 𝑈𝐼𝐷2



Fig. 2. Illustration of one possible procedure with which a pEGD can be derived from
an existing dhTGD (σ)—the (unnegated) atoms from body− (σ) plus additional condi-
tions that help find candidate entities to instantiate the deduplication hypothesis.


The intuition of this rule is “If a user that has some post on topic t1 but he does
not have any post on topic t2 , then we generate the hypothesis that there exists
another user corresponding to him (whose identity is currently unknown) who
has posted on topic t2 ”.
   As shown in Figure 2, an equality-generating dependency can be derived
from ω2 by taking the atoms from body− (ω2 ) and some additional conditions
that, when true, can help obtain the identity of the hitherto unknown entity
that participates in the deduplication hypothesis:

       (2 ) dedup hypth(UID1 , Z) ∧ User(UID2 , N ) ∧ hasPosted(UID2 , t2 ) ∧
             WritStSim(UID1 , UID2 ) → Z = UID2 .

The intuition of this rule is “If we have a duplicate user hypothesis between user
UID1 and an unknown user Z, and user UID2 has posts on topic t2 with similar
writing style to user UID1 , then Z may correspond to UID2 ”.                    


   Putting all of this together, the following example illustrates how the chase
procedure unfolds over the scenario in the running example.
Example 4. Considering the DB D1 presented in Example 1, a probabilistic Ex-
tended Datalog+/– KB representing a deduplication program can be KB2 =
(O2 , M, af) where O2 = (D1 , Σ2 ) and M1 is the same probabilistic model from
Example 2. The set Σ2 contains the (classical) TGD:

           (σ) User(UID, N ) ∧ Post(Id po, UID, T ) → hasPosted (UID, T ),

and also contains the dhTGD ω2 and pEGD 2 from the previous example. The
annotation function af is defined as follows:
 – af(ω2 ) = e1 (t1 , t2 ), where the event e1 (t1 , t2 ) is associated with the proba-
   bility that a user has some post on t2 given that he has some post on t1 .
                                                                                                                                  11


 User(a1,‘Blackstar’)       Post(p1,a1,t1)         User(a2,‘Archer’)      Post(p2,a2,t2)      User(a3,‘Angel Eyes’)   Post(p3,a3,t3)

                   TGD                                        TGD                                 TGD 


             hasPosted(a1,t1)                                 hasPosted(a2,t2)             hasPosted(a3,t3)

                                                         pEGD 2
   e1(t1,t2) dhTGD 2
                                dedup_hypth(a1,z1)                         WritStSim(a1,a2)

                                       e1(t1,t2)
 hasPosted(z1, t2)
                                                         e2
 e1(t1,t2)
                                              dedup_hypth(a1,a2)
                                                   e1(t1,t2)  e2


Fig. 3. Graphic representation of the chase for Example 4. Boxes with bold outline
represent atoms in D1 , boxes with dashed outline represent atoms that contain null
values, and boxes with standard outline represent atoms without nulls obtained from
some dhTGD or pEGD. The events associated with each atom are represented under-
neath the corresponding box; all atoms in D1 are associated with empty events, and
the events propagated are represented next to the arrow of the corresponding dhTGD
or pEGD.

                                              λi e1 (t1 , t2 ) e2 Probability
                                              λ1 true true           0.76
                                              λ2 true false          0.19
                                              λ3 false true          0.04
                                              λ4 false false         0.01


Fig. 4. Joint probability distribution over the events e1 (t1 , t2 ) and e2 used in the run-
ning example. In general, such distributions can be compactly specified via a proba-
bilistic model like a Bayesian network.


 – af(2 ) = e2 , where the event e2 is associated with the probability that two
   users with similar writing style actually correspond to the same underlying
   user.
     Figure 3 illustrates the chase for this example. Note that dhTGD ω2 is ap-
plicable because body+ (ω2 ) is mapped to D1 and body− (ω2 ) is not mapped to
D1 ; also note how the event e1 (t1 , t2 ) is propagated to the generated atoms.
Then, pEGD 2 is applicable, so null value z1 is instantiated and the events
e1 (t1 , t2 ) and e2 are propagated. Finally, for simplicity we specify M directly as
a joint probability distribution, considering all possible worlds λi for the events
e1 (t1 , t2 ) and e2 (cf. Figure 4).                                                


   In the previous example, that the application of the dhTGD ω2 produces
a deduplication hypothesis for user a2 , but the lack of information causes the
candidate for such a duplicate to remain as a null value. However, the subsequent
application of pEGD 2 lets us conclude that user a2 —who has a similar writing
12

                INPUT                                                           OUTPUT

                  cf. Figure 2
                                                   Two-stage Chase
     𝑝𝐸𝐺𝐷𝑠                                                           Annotations associated with
                                                                     each dedup_hyp atom yield a
       𝑄(𝑋) = 𝑑𝑒𝑑𝑢𝑝(𝑒𝑖𝑑1 , 𝑋, 𝜃)
                                                                     probability value denoting
                                                    dedup_hyp
                  cf. Figure 1                                       the confidence at least  with
     𝑑ℎ𝑇𝐺𝐷𝑠                                         atoms
                                                    without nulls    which it is believed to hold.
                                   dedup_hyp        dedup_hyp
        DB                         atoms            atoms                𝑒𝑖𝑑2              𝑒𝑖𝑑3
                                   with nulls       with nulls
                        DB                                                        𝑒𝑖𝑑1
                                     DB               DB

         DB                                                               𝑒𝑖𝑑4             𝑒𝑖𝑑5



Fig. 5. Overview of the application of probabilistic Datalog+/– to answer deduplica-
tion queries. A set of input databases is considered along with a set of dhTGDs. In the
first stage of the chase procedure, a set of dedup hypth atoms is generated; in the second
stage, a set of pEGDs is used to assign concrete values to nulls in such atoms—both
stages involve keeping track of probabilistic events stating conditions under which in-
ferences hold. The output is a set of dedup hypth atoms with probability values, which
can be used to answer the input query.


style to a1 —is a likely option (event e1 (t1 , t2 ) ∧ e2 has probability 0.76 according
to Figure 4). Note that the dhTGD could be applied multiple times, so that if
there are other possible candidates, fresh dedup hypth atoms can be generated
as needed.


5      Finding Duplicates: Threshold Deduplication Queries
The problem of identifying seemingly different entities within a knowledge base
(or set of knowledge bases) that actually correspond to the same real world
entity can be addressed as a query answering problem: given a knowledge base,
we want to have a principled way of answering questions regarding the occurrence
of duplicate entities. The chase procedure presented above can be used to answer
queries of the form:
                             Q(X) = dedup(eid, X, θ),
where eid is a specific entity of interest, and θ ∈ [0, 1] represents a threshold prob-
ability value. Answers to queries of this form simply consist of possible values for
variable X such that the chase procedure generates atoms dedup hypth(eid,val )
with an associated event e with PrM (e) ≥ θ. Figure 5 illustrates the overall
query answering process.
    The following result states that the two-stage chase procedure described in
Section 3 can be stopped after a finite number of steps in order to answer thresh-
old deduplication queries whenever the underlying classical TGDs belong to a
fragment of Datalog+/– for which conjunctive queries are decidable; this can be
guaranteed via syntactic conditions such such as guardedness [22], stickiness [10],
acyclicity, or their weak counterparts [9, 10, 15], or semantic conditions such as
                                                                                13

shyness [21], finite expansion sets (fes), bounded treewidth sets (bts), and finite
unification sets (fus)) [2]—cf. [29, Section 1.3.3] for a brief overview of many
such conditions.

Proposition 1. Let KB = (O = (D, Σ), M, af) be a probabilistic Datalog+/–
ontology with stratified negation, and Q(X) = dedup(eid, X, θ) be a threshold
deduplication query. If O belongs to a fragment of Datalog+/– for which a finite
portion of chase(D, Σ) can be computed in order to answer conjunctive queries,
then there exists a finite portion of the two-stage probabilistic chase procedure
that can be used to answer threshold deduplication query Q over KB.

    Essentially, this result holds since the first stage of our chase procedure ap-
plies TGDs and dhTGDs (which are a special case of classical TGDs) in the
order given by the stratification, and the second stage is simply an application
of pEGDs over a fixed structure. Both stages involve carrying over probabilistic
events, which are then used in conjunction with the probabilistic model to check
which atoms surpass the threshold given in the query.


6   Conclusions and Future Work

In this paper, we have introduced the adversarial deduplication problem as a
variant of the classical deduplication (or entity resolution) problem in which
one cannot assume that the existence of multiple records corresponding to the
same real-world entity is the result of innocent or casual errors such as typos,
transliterations, or inconsistently-updated values, but rather the result of actors
deliberately trying to avoid identification. As a motivating scenario we took
malicious hacker forums and marketplaces on the dark web, where this sort of
practice is commonplace and having tools at hand to address it would have a
great impact on efforts such as early-warning of cyber attacks, understanding
the social networks of buyers and sellers, and identification of leading vendors
of specific types of products.
    We proposed the use of probabilistic Datalog+/– as a way to tackle the prob-
lem of finding pairs of virtual objects for which we have reasons to believe that
they correspond to the same real-world entity, where a specific kind of probabilis-
tic TGDs (called deduplication hypothesis-generating dependencies) and EGDs
can be applied in order to derive deduplication hypotheses. The main strength
of this approach is that it allows to reason about unknown objects via value
invention, which is essential in this setting—not having this capability renders
traditional entity resolution tools ineffective. Another advantage of our approach
is that dhTGDs can be automatically obtained via relational association rule
mining algorithms, and corresponding EGDs can also be derived with additional
domain knowledge. Finally, we developed a two-stage probabilistic chase pro-
cedure in which (dh)TGDs are first exhaustively applied, and then pEGDs are
used to associate concrete values to null values generated in the first step. We
showed that this modified chase procedure can be used to answer deduplication
14

threshold queries, leveraging chase termination results from the existential rules
and Datalog+/– literature.
    Ongoing and future work involves investigating other kinds of deduplication
queries, such as instance checking (“what is the probability that two given virtual
objects correspond to the same real-world entity?”), all pairs threshold (“which
pairs of virtual objects are suspected to correspond to the same real-world entity
with probability at least p?”), and more general conjunctive queries. We are
also interested in investigating the complexity of query answering algorithms
for all such queries, and under which conditions they can be guaranteed to run
in polynomial time. We plan to evaluate the performance of our approach on
both synthetic and real-world data on malicious hacker forums and marketplaces
on the dark web obtained via a collaboration with a cyber threat intelligence
company. Finally, we would also like to explore the relationship between our
formalism and other approaches to probabilistic modeling with unknown objects,
such as [24]


Acknowledgments. This work was supported by funds provided by CONICET,
Agencia Nacional de Promoción Cientı́fica y Tecnológica, Universidad Nacional
del Sur, Argentina, by the EU H2020 research and innovation programme under
the Marie Sklodowska-Curie grant agreement 690974 for the project “MIREL:
MIning and REasoning with Legal texts”, and by the US Department of the
Navy, Office of Naval Research, grant N00014-15-1-2742. Any opinions, findings,
and conclusions or recommendations expressed in this material are those of the
authors and do not necessarily reflect the views of the Office of Naval Research.


References
 1. Abiteboul, S., Hull, R., Vianu, V. (eds.): Foundations of Databases: The Logical
    Level. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA (1995)
 2. Baget, J., Mugnier, M., Rudolph, S., Thomazo, M.: Walking the complexity lines
    for generalized guarded existential rules. In: Proce. IJCAI. pp. 712–717 (2011)
 3. Bahmani, Z., Bertossi, L.E., Vasiloglou, N.: Erblox: Combining matching depen-
    dencies with machine learning for entity resolution. Int. J. Approx. Reasoning 83,
    118–141 (2017)
 4. Beeri, C., Vardi, M.Y.: A proof procedure for data dependencies. J. ACM 31(4),
    718–741 (1984)
 5. Bertossi, L.E., Kolahi, S., Lakshmanan, L.V.S.: Data cleaning and query answering
    with matching dependencies and matching functions. Theory Comput. Syst. 52(3),
    441–482 (2013)
 6. Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM
    Trans. Knowl. Discov. Data 1(1) (2007)
 7. Bhattacharya, I., Getoor, L.: Query-time entity resolution. J. Artif. Intell. Res. 30,
    621–657 (2007)
 8. Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1), 1–41 (2009)
 9. Calı̀, A., Gottlob, G., Kifer, M.: Taming the infinite chase: Query answering under
    expressive relational constraints. J. Artif. Intell. Res. 48, 115–174 (2013)
                                                                                     15

10. Calı̀, A., Gottlob, G., Pieris, A.: Towards more expressive ontology languages: The
    query answering problem. Artif. Intell. 193, 87–128 (2012)
11. CVE: Common vulnerabilities and exposures: The standard for information secu-
    rity vulnerability names. http://cve.mitre.org/ (2018)
12. Dehaspe, L., Toivonen, H.: Discovery of relational association rules. In: Relational
    data mining, pp. 189–212. Springer (2001)
13. Diaz-Valenzuela, I., Martin-Bautista, M.J., Vila, M.A., Campaña, J.R.: An au-
    tomatic system for identifying authorities in digital libraries. Expert Syst. Appl.
    40(10), 3994–4002 (2013)
14. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A
    survey. IEEE T. Knowl. Data En. 19(1), 1–16 (2007)
15. Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: semantics and
    query answering. Theor. Comput. Sci. 336(1), 89–124 (2005)
16. Fan, W.: Dependencies revisited for improving data quality. In: Proc. ACM PODS.
    pp. 159–170 (2008)
17. Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. PVLDB
    2(1), 407–418 (2009)
18. Getoor, L., Machanavajjhala, A.: Entity resolution: Theory, practice and open
    challenges. PVLDB 5(12), 2018–2019 (2012)
19. Gottlob, G., Lukasiewicz, T., Martinez, M.V., Simari, G.I.: Query answering under
    probabilistic uncertainty in Datalog+/– ontologies. Ann. Math. Artif. Intell. 69(1),
    37–72 (2013)
20. Gottlob, G., Rudolph, S., Simkus, M.: Expressiveness of guarded existential rule
    languages. In: Proc. ACM PODS. pp. 27–38 (2014)
21. Leone, N., Manna, M., Terracina, G., Veltri, P.: Efficiently computable datalog∃
    programs. In: Proc. KR (2012)
22. Lukasiewicz, T., Cali, A., Gottlob, G.: A general Datalog-based framework for
    tractable query answering over ontologies. J. Web Semant. 14(0) (2012)
23. Maier, D., Mendelzon, A.O., Sagiv, Y.: Testing implications of data dependencies.
    ACM T. Database Syst. 4(4), 455–469 (1979)
24. Milch, B., Marthi, B., Russell, S.J., Sontag, D., Ong, D.L., Kolobov, A.: BLOG:
    probabilistic models with unknown objects. In: Proc. IJCAI. pp. 1352–1359 (2005)
25. NIST: National Vulnerability Database. https://nvd.nist.gov/ (2018)
26. Nunes, E., Diab, A., Gunn, A.T., Marin, E., Mishra, V., Paliath, V., Robertson, J.,
    Shakarian, J., Thart, A., Shakarian, P.: Darknet and deepnet mining for proactive
    cybersecurity threat intelligence. In: Proc. ISI. pp. 7–12 (2016)
27. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
    Inference. Morgan Kaufmann Publishers Inc. (1988)
28. Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62(1-2), 107–
    136 (2006)
29. Simari, G.I., Molinaro, C., Martinez, M.V., Lukasiewicz, T., Predoiu, L.: Ontology-
    Based Data Access Leveraging Subjective Reports. Springer Briefs in Computer
    Science, Springer (2017)
30. Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: En-
    tity resolution with iterative blocking. In: Proc. SIGMOD. Stanford (2009)