Leveraging Probabilistic Existential Rules for Adversarial Deduplication

Leveraging Probabilistic Existential Rules for Adversarial Deduplication JoseNParedes jose.paredes@cs.uns.edu.ar Dept. of Computer Science and Engineering Universidad Nacional del Sur (UNS) Institute for Computer Science and Engineering (UNS-CONICET)

San Andres 800 8000) Bahia Blanca Argentina

MariaVanina Martinez Dept. of Computer Science and Engineering Universidad Nacional del Sur (UNS) Institute for Computer Science and Engineering (UNS-CONICET)

San Andres 800 8000) Bahia Blanca Argentina

GerardoISimari Dept. of Computer Science and Engineering Universidad Nacional del Sur (UNS) Institute for Computer Science and Engineering (UNS-CONICET)

San Andres 800 8000) Bahia Blanca Argentina

MarceloAFalappa mfalappa@cs.uns.edu.ar Dept. of Computer Science and Engineering Universidad Nacional del Sur (UNS) Institute for Computer Science and Engineering (UNS-CONICET)

San Andres 800 8000) Bahia Blanca Argentina

Leveraging Probabilistic Existential Rules for Adversarial Deduplication 58BBB890A1153A49B962B900E3213EB2 GROBID - A machine learning software for extracting information from scholarly documents

The entity resolution problem in traditional databases, also known as deduplication, seeks to map multiple virtual objects to its corresponding set of real-world entities. Though the problem is challenging, it can be tackled in a variety of ways by means of leveraging several simplifying assumptions, such as the fact that the multiple virtual objects appear as the result of name or attribute ambiguity, clerical errors in data entry or formatting, missing or changing values, or abbreviations. However, in cyber security domains the entity resolution problem takes on a whole different form, since malicious actors that operate in certain environments like hacker forums and markets are highly motivated to remain semi-anonymous-this is because, though they wish to keep their true identities secret from law enforcement, they also have a reputation with their customers. The above simplifying assumptions cannot be made in this setting, and we therefore coin the term "adversarial deduplication". In this paper, we propose the use of probabilistic existential rules (also known as Datalog+/-) to model knowledge engineering solutions to this problem; we show that tuple-generating dependencies can be used to generate probabilistic deduplication hypotheses, and equality-generating dependencies can later be applied to leverage existing data towards grounding such hypotheses. The main advantage with respect to existing deduplication tools is that our model operates under the open-world assumption, and thus is capable of modeling hypotheses over unknown objects, which can later become known if new data becomes available.

Introduction

Deduplication of objects in databases, also known as entity resolution, is a classical problem in data cleaning [18]; the basic idea is that databases contain objects that are potential duplicates-seemingly different records that correspond to the same entity or object in the real world-that need to be identified and merged [8,14]. There has been a lot of interest on this topic in the last 20 years, yielding a large body of work that lies mostly in the databases literature. Traditionally, entities are resolved using pairwise similarity over the attributes of reference [18]. The most recent and promising proposal are those based on so-called collective entity resolution [6], which exploits additional relational information in the data since references to different entities may co-occur. For instance, one such approach, called iterative blocking [30], iteratively groups together matching records in blocks, giving rise to the possibility of using the information processed so far to inform further decisions. Data-driven approaches [6,7,13] leverage examples to determine if which pairs of records are duplicates. Recently, [3] has defined a declarative framework for entity resolution based on matching dependencies (MDs). This formalism, which was first introduced in [16,17], consists of declarative rules that generalize entity resolution tasks, eliminating duplicates via a matching process. These rules state that certain attribute values in relational tuples, under certain similarity conditions over possibly other attribute values in those tuples, must be made the same. The original semantics for MDs, defined in [17], was redefined and extended in [5].

All of the above-mentioned methods operate under the assumption that the undesired situation that multiple records refer to the same real-world object is caused by a mix of clerical errors in data entry (simple typos), ambiguity in names or other attributes (consider, for instance, the result of transliterations such as "Kurt Gödel" being mapped to "Kurt Goedel"), inconsistent abbreviations and formatting, missing values (for instance, phone number not provided), or changing values (for example, one record has an old phone number while others have an updated one). Though these assumptions adequately reflect many real-world scenarios, we are interested in others in which they cannot be assumed to hold. One example of such a scenario is in cyber-security domains such as malicious hacker forums on the dark/deep web (the part that is not directly accessible by traditional browsers and search engines)-the entity resolution problem in this setting becomes quite different. In [26], a system for cyber threat intelligence was developed that gathers data from various social platforms on the Internet, focusing particularly on sites in the dark net and deep net. They collect and store information from hacker forum discussions and marketplaces offering products and services that focus on malicious hacking. The system is designed to extract specific information from marketplaces (regarding sales of malware/exploits) and hacker forums (discussions regarding services and threats). This extracted information is well-structured and can be stored in a relational database; they maintain two databases, one for marketplaces and the other for forums. The crawling and parsing of these sites is carried out periodically so that time-varying data can be collected. For markets, they collect product fields such as item title, item description, vendor name, CVE number (identifier given by the National Vulnerability Database [11,25]), shipping details, item reviews, etc. For forums, they collect other fields such as topic content, post content, topic author, post author, author status, reputation, and topic interest.

These forums and marketplaces have an interesting characteristic: though malicious actors of course wish to remain anonymous from law enforcement agents that might be patrolling, they also enjoy a reputation with their customers and readers, and so they also wish to remain identifiable. In particular, the same actor typically operates under different screen names, keeping certain aspects constant so that others can recognize them. Furthermore, and perhaps most importantly, they also leave involuntary traces behind that can be leveraged in deduplication efforts. We refer to this as the adversarial deduplication/entity resolution problem. The following is a simplified example of the kind of information that we can obtain from dark web forums and marketplaces.

Example 1. As a running example, consider the simplified domain of the malicious hacker forums. Consider the following relational schema regarding information about users and their posts in forums:

User(Id u, Nick), Post(Id po, Id u, Id t), HasPosted(Id u, Id t), WritStSim(Id u, Id u),

which represent users of these forums, posts by users, topics in posts, which users have posted on which topics, and the writing style similarity between users, respectively. Furthermore, id u and id po are the keys for relations User and Post, respectively. Let D 1 be an instance of this schema, where t 1 is "private sqli tool", t 2 is "Sql Injection", and t 3 is "trojan servers as a backdoor":

D 1 = { d 1 : Post(p 1 , a 1 , t 1 ), d 2 : Post(p 2 , a 2 , t 2 ), d 3 : Post(p 3 , a 3 , t 3 ), d 4 : User(a 1 , 'Blackstar'), d 5 : User(a 2 , 'Archer'), d 6 : User(a 3 , 'Angel Eyes'), d 7 : WritStSim(a 1 , a 2 ) }

Finally, another basic assumption generally made by traditional approaches to entity resolution-which can be seen as a consequence of the other assumptionsis that the information contained in the databases is complete. In our setting, on the other hand, it may be the case that the relevant information is yet to be discovered; in these settings, the open-world assumption becomes crucial, allowing us to entertain deduplication hypotheses over unnamed entities. In the next section, we present a formalism that allows for this kind of reasoning, as well as deal with probabilistic uncertainty.

The rest of this paper is organized as follows: Section 2 presents the extension of Datalog+/-with probabilistic annotations and stratified negation, Section 3 presents a two-stage chase procedure that extends the classical chase with propagation of probabilistic events and EGDs. Sections 4 and 5 introduces deduplication programs and deduplication threshold queries that can be answered via the two-stage chase, respectively. We discuss conclusions and future work in Section 6.

Probabilistic Datalog+/-with Stratified Negation

In this section, we present the basics of the language of Datalog+/-from [22], its extension with probabilistic uncertainty from [19], and extend it with stratified negation; we redefine some of the basic concepts to adjust them to our presentation.

We assume the existence of the following pairwise disjoint (countably infinite) sets: a set ∆ of constants, a set V of variables, and a set N of nulls. Different constants represent different values (unique name assumption), while different nulls may represent the same value. We assume a lexicographic order on ∆ ∪ N , with every symbol in N following all symbols in ∆. We denote by X sequences of variables X 1 , . . . , X k with k > 0.

Consider relational schemas R with a possibly infinite data domain ∆, a finite set of database predicates (relations) R ∈ R, and a set of built-in predicates for evaluating "=", "¬", "≤", comparisons, and arithmetic operations. Each relation R ∈ R has an associated arity n, thus R has the form R(A 1 , . . . , A n ), where each attribute A i has associated domain Dom(A i ) ⊆ ∆. For the sake of simplicity we may assume that the A i 's are different, and different predicates do not share attributes. However, different attributes may share the same domain. An atomic formula (or atom) has the form P (t 1 , . . . , t n ), where t i ∈ Dom(A i )∪V. A conjunction of atoms is often identified with the set of all its atoms. A database instance D for R is a (possibly infinite) set of ground atoms (or tuples) with predicates from R and arguments from ∆.

A homomorphism is a mapping µ :

∆ ∪ V ∪ N → ∆ ∪ V ∪ N such that (i) if c ∈ ∆ then µ(c) = c, (ii) if c ∈ V ∪ N then µ(c) ∈ ∆ ∪ N ,

and (iii) µ is naturally extended to atoms, sets of atoms, and conjunctions of atoms, that is, µ(p(t 1 , . . . , t n )) = p(µ(t 1 ), . . . , µ(t n )). A grounding is a homomorphism ρ such that ρ(c) ∈ ∆ for every c ∈ ∆ ∪ V ∪ N . Also, given a conjunction of atoms Φ(X), we denote with pos(Φ(X)) the conjunction of atoms in Φ(X) that appear positive, and with neg(Φ(X)) the conjunction of atoms in Φ(X) that appear negated.

TGDs. A tuple generating dependency (TGD) is a first-order formula:

(σ) ∀XY Φ(X, Y) → ∃ZΨ (X, Z)(1)

where X, Y, Z are sequences of variables, Φ(X, Y) is a conjunction of atoms and negated atoms over R (called the body of σ, denoted body(σ)), and Ψ (X, Z) is a conjunction of atoms over R (called the head of σ, denoted head(σ)). We use body + (σ) and body − (σ) to denote pos(Φ(X, Y)) and neg(Φ(X, Y)), respectively. For ease of presentation, we sometimes omit the universal quantifiers, and assume that all variables without quantifier are universally quantified. We restrict the language so that all the variables in body − (σ) appear already in some atom in body + (σ). Furthermore, we assume Σ is stratified [20], i.e., Σ can be partitioned into mutually disjoint sets Σ 1 , . . . , Σ k such that:

-If Σ i contains a predicate A that appears positively in some rule, then there is no rule in Σ i+1 ∪ . . . ∪ Σ k with A in the head. -If Σ i contains a predicate A that appears negated in some rule, then there is no rule in Σ i ∪ . . . ∪ Σ k with A in the head.

We call Σ 1 , . . . , Σ k a stratification of Σ. As in the case of standard TGDs [22], w.l.o.g., we can assume that Ψ (X, Z) is a single atom (any set not satisfying this condition can be translated to an equivalent set that does).

Definition 1. Let σ be a TGD with body(σ) = ∀XY Φ(X, Y). We say that body(σ) is satisfied in D through homomorphism h if and only if h is such that

h(pos(Φ(X, Y))) ⊆ D and h(neg(Φ(X, Y))) ⊆ D.

A TGD σ is satisfied in D if and only if whenever the body of σ is satisfied through homomorphism h, then there exists an extension h of h such that h (head(σ)) ⊆ D.

EGDs. Equality generating dependencies (EGDs) are first order formulas of the form: ∀ XΦ(X) → X i = X j , where Φ(X) is a conjunction of atoms, X is a sequence of variables, and X i , X j ∈ X. As with TGDs, we sometimes omit the universal quantifiers.

The body of an EGD , denoted body( ), corresponds to Φ(X), and the head, denoted head( ), corresponds to the equality atom on the right hand side of the rule. An instance D satisfies an EGD , denoted D |= , if whenever there exists a homomorphism h such that h(body(Φ)) ⊆ D, then it must be the case that h(X i ) = h(X j ). Otherwise, we say that is violated in D.

Probabilistic Annotations

Following [19], we further assume the existence of a finite set of constants ∆ M , a finite set of predicate names R M such that R ∩ R M = ∅, and an infinite set of variables V M , such that V ∩ V M = ∅. These sets give rise to corresponding Herbrand bases and Herbrand universes consisting of all possible ground terms and atoms, respectively; these are denoted with H M and U M , respectively.

Let M be a probabilistic model that represents a joint probability distribution over a (finite) set of random Boolean variables X = {X 1 , . . . , X n } such that there is a 1-to-1 mapping from X to the set of all ground atoms over H M . Examples of the type of probabilistic models that we assume in this work are Markov Logic [28], Bayesian networks [27], etc.

A (probabilistic) annotation λ, relative to M , is a (finite) set of expressions A = x i , where A is an atom over R M , ∆ M , and V M , x i ∈ {0, 1}, and the X i 's are pairwise distinct. A probabilistic annotation is valid if and only if for any two different expressions A = x, B = y ∈ λ, there does not exist a substitution θ such that θ(A) = θ(B). A probabilistic scenario is a valid probabilistic annotation λ for which |λ| = |X| and all A = x i are such that A is ground. We use scn(M ) to denote the set of scenarios in M . A probabilistic annotation is equivalent to a formula that consists of the disjunction of all the scenarios that belong to the set denoted by the annotation. Therefore, with a slight abuse of notation, we sometimes combine annotations with logical operators.

Intuitively, a probabilistic annotation is used to describe an event in which the random variables in a probabilistic model M are compatible with the settings of the random variables described by λ, i.e., each X i has the value x i . We will attach probabilistic annotations λ to formulas in our language to produce annotated formulas F : λ, where F is an atom, a TGD, or an EGD. Intuitively, the annotation means that F holds whenever the event associated with λ occurs. Note that whenever a random variable's value is left unspecified in a probabilistic annotation, the variable is unconstrained; in particular, a formula annotated with an empty probabilistic annotation means that the formula holds in every world. An annotated formula F : λ is referred to as a probabilistic atom whenever F is atom, a probabilistic TGD (pTGD) if F is a TGD, and a probabilistic EGD (pEGD) whenever F is an EGD. Definition 2. A probabilistic Datalog+/-knowledge base is a triple KB = (O, M, af), where O is a finite set of atoms, TGDs, and EGDs, M is probabilistic model, and af is an annotation function that maps formulas to probabilistic annotations relative to M .

For ease of presentation, we sometimes use notation "σ : e" to denote "af(σ) = e". Given a probabilistic Datalog+/-KB KB = (O, M, af) and a scenario λ ∈ scn(M ), O λ denotes the (non-probabilistic) Datalog+/-knowledge base induced by λ. Formally, O λ consists of all formulas F i (atoms, TGDs, and EGDs) such that λ i ⊆ λ for some

F i : λ i ∈ O.

Example 2. Consider the DB from Example 1, and let KB 1 = (O 1 , M 1 , af 1 ) be a Datalog+/-KB. Then, O 1 could contain the following TGD ω 1 :

(ω 1 ) User(UID, N ) ∧ hasPosted(UID, t 1 ) → hasPosted(UID, t 2 )

The intuition of this rule is: "Every User that has posts on topic t 1 , also has posts on topic t 2 ". The annotation function af 1 is defined such that af 1 (ω 1 ) = e 1 (t 1 , t 2 ) and e 1 (t 1 , t 2 ) is an event associated with the probability that a user posted something related to t 2 given that he has posted something related to t 1 .

O 1 could contain the following EGD 1

( 1 ) User(UID 1 , N ) ∧ User(UID 2 , N ) → UID 1 = UID 2 .

The intuition of this rule is: "Two users cannot share the same user name.". This rule is annotated with the empty event, that is af 1 ( 1 ) = ∅-therefore, it always holds.

Probabilistic model M 1 in the above example can be learned from domain information, and it allows to infer probabilities about possible duplicate user and correlations regarding certain events, such as posts by users given specific topics. We will discuss this in greater detail in Section 4.

A Two-stage Chase Procedure

We now extend the classical operational semantics for Datalog+/-, based on the chase procedure, for the case of the extensions discussed above. This procedure provides the semantics for the application of rules and constraints, extending the well-known (oblivious) chase algorithm for testing implications of data dependencies [4,23] or completing deductive databases [1]. The chase provides a mechanism for repairing a database instance relative to a set of dependencies, so that the result of the chase (a database instance) satisfies the dependencies. Note that, though [19] presents a chase procedure for probabilistic Datalog+/ontologies, our two-stage chase includes stratified negation and EGDs.

The building block for the chase procedure is the TGD chase rule; in our setting, we must also deal with the propagation of probabilistic annotations.

pTGD Chase Rule. We will now define the probabilistic TGD chase rule for a knowledge base KB = (O = (D, Σ), M, af). Let σ ∈ O be a pTGD of the form ∀XY Φ(X, Y) → ∃ZΨ (X, Z) : e σ ; then, σ is applicable to D if and only if there exists a homomorphism h such that one the following cases occur: a) 1) h maps the atoms in pos(Φ(X, Y)) and neg(Φ(X, Y)) to probabilistic atoms {α 1 : e 1 , ..., α k : e k } ⊆ D and {α k+1 : e k+1 , ..., α j : e j } D, and 2) h(e σ ∧ i=1,...,k e i ) is a valid probabilistic annotation. b) 1) h maps the atoms in pos(Φ(X, Y)) and neg(Φ(X, Y)) to probabilistic atoms {α 1 : e 1 , ..., α k : e k } ⊆ D and {α k+1 : e k+1 , ..., α j : e j } ⊆ D, and 2) h(e σ ∧ i=1,...,k e i ∧ i=k+1,...,j ¬e i ) is a valid probabilistic annotations.

Let σ be applicable to D, and h be a homomorphism that extends h as follows: for each term X i ∈ X, h (X i ) = h(X i ) and h (Z) = z j , where Z is existentially quantified in head(σ) and z j is a "fresh" null, i.e., z j ∈ ∆ N , z j does not occur in D, and z j lexicographically follows all other nulls already introduced. The application of σ on D adds to D one of the following probabilistic atoms according to the case where σ is applicable: a) h (Ψ (X, Z)) : h (e σ ∧ i=1,...,k e i ). b) h (Ψ (X, Z)) : h (e σ ∧ i=1,...,k e i ∧ i=k+1,...,j ¬e i ).

As a second step, we apply the equality-generating dependencies. We do this in a separate step over the result of the pTGD chase in order to avoid undesired interactions between TGDs and EGDs that can lead to undecidability [22], since the standard way to avoid this via separability is too strong an assumption in our case (see Section 4). pEGD Chase Rule. Let KB = (O = (D, Σ), M, af) be a probabilistic Datalog+/-KB, and ∈ Σ be a pEGD of the form ∀X Φ(X) → X m = X n : e . Then, is applicable to D if and only if there exists a homomorphism h such that: a) h maps the atoms in Φ(X) to probabilistic atoms {α 1 : e 1 , ..., α k : e k } ⊆ D, and b) h(e ∧ i=1,...,k e i ) is a valid probabilistic annotation.

Let be applicable to D, and Φ Xm = {α 1 : e 1 , ..., α l : e l } ⊆ {α 1 : e 1 , ..., α k : e k } be the set of probabilistic atoms where X m occurs in α 1≤i≤l ; then, the application of on D results in:

(i) The chase fails if h(X m ) and h(X n ) are constants; this is called a hard violation. Otherwise, (ii) for all α i : e i ∈ Φ Xm replace h(X m ) with h(X n ) and e i with h(e ∧ i=1,...,k e i ), if h(X m ) is a null and h(X n ) is a constant, or h(X m ) precedes h(X n ) in the lexicographic order.

The following section shows how this chase procedure can be used to answer probabilistic deduplication queries.

Deduplication Programs

In this section we define a subclass of Probabilistic Datalog+/-programs that are specifically tailored to declaratively capture the requirements for a deduplication task in an open-world, adversarial environment.

Towards this end, we assume that there exists a set E ⊆ R that we identify as entity relations, i.e., relations that represent entities of interest in the application domain. Tuples in these relations have identifiers-keys-that allows us to compare extensions of the same predicate in different instances and to trace changes to them. Tuple identifiers can be accommodated by simply adding to each predicate E ∈ E an extra attribute (usually denoted with T ) that acts as a primary key in E. Finally, there is also a predefined predicate called dedup hypth.

Our formalism is inspired in (relational) matching dependencies (RMDs) [5], which provide a declarative way of defining conditions under which records (or attribute values) should be made equal-they also establish how this matching is supposed to be done by means of matching functions. However, as we discussed before, RMDs are developed under a closed-world assumption, i.e., everything that is not contained in the database instance is assumed to be false (or to not exist). Nevertheless, we argue that in the type of domain applications we are interested in, this assumption is not realistic and therefore we must agree in that the database instances we are dealing with are potentially incomplete. Under such scenarios, RMDs are not expressive enough to capture all potential matchings, since there might be matchings for which we cannot establish specific conditions to find them but rather just make hypotheses based on (weaker) evidence of other nature-the possibility of value invention afforded by existential rules thus becomes crucial.

To specify the kind of rules we require, we slightly modify pTGDs so they can be used to infer potential matchings (or matching hypotheses). For this, we assume that for every attribute A j appearing in a n-nary relation R ∈ R, where the schema of R is R(A 1 , . . . , A n ), there exists a binary similarity relation ≈ Aj that is reflexive and symmetric. Thus, they are capable of expressing similarity conditions over attribute values as done in RMDs, since the "≈" predicates can simply occur in the body.

dhTGDs. A Duplication Hypothesis Tuple Generating Dependency (or dhTGD)

σ is simply a pTGD of the following form: where X, Y, Z are sequences of variables, X and Z are variables and X ∈ X, Z ∈ Z, Φ(X, Y) is a conjunction of atoms and negated atoms over R, and Γ is also a conjunction of atoms.

∀XY Φ(X, Y) → ∃Zdedup hypth(X, Z) ∧ Γ (X, Z) r 1  r i  r n body(r i )  …  not(head(r i ))   dedup_hyp  …

dhTGDs are a special class of pTGDs that generate deduplication hypothesesessentially, they allow to create hypotheses about potential matchings, which are kept and later confirmed or discarded whenever new information is added to (or extracted from) the data sources. Informally, dhTGDs capture situations in which we have reasons to believe that entities may be duplicated but we do not know which other entity could correspond to them. Figure 1 describes one possible way in which dhTGDs could be automatically derived from data. A relational association rule mining algorithm (such as [12]) is used to discover patterns in the available data sources, which are rules of the form r : body(r) → head(r)); therefore, entities that satisfy the conditions in the body but not those in the head are potential candidates for entities that have duplicate identities-see Example 3 below for a concrete scenario.

pEGDs as companions to dhTGDs. As we saw in Section 3, equalitygenerating dependencies are applied as a second step after all (dh)TGDs are applied; the goal of this step is to complete the database with the information that might be available with respect to the null values in the dedup hypth atoms. One kind of pEGDs that might be useful can be automatically derived from dhTGDs σ by taking the negation of body − (σ) as the body of the pEGD. Figure 2 illustrates this procedure, and the following is a simple example based on the running example.

Example 3. Consider the running example, and suppose we have obtained a relational association rule that states that "if a user has posts related to topic t 1 , then he has posts related to topic t 2 ". This can be translated into the following dhTGD:

(ω 2 ) User(UID, N ) ∧ hasPosted (UID, t 1 ) ∧ not(hasPosted (UID, t 2 )) → ∃Z dedup hypth(UID, Z) ∧ hasPosted(Z, t 2 ).

(𝜔 2 ) 𝑈𝑠𝑒𝑟(𝑈𝐼𝐷, 𝑁) ∧ ℎ𝑎𝑠𝑃𝑜𝑠𝑡𝑒𝑑(𝑈𝐼𝐷, 𝑡 1 ) ∧ 𝑛𝑜𝑡(ℎ𝑎𝑠𝑃𝑜𝑠𝑡𝑒𝑑(𝑈𝐼𝐷, 𝑡 2 )) → ∃𝑍 𝑑𝑒𝑑𝑢𝑝_ℎ𝑦𝑝(𝑈𝐼𝐷, 𝑍) ∧ ℎ𝑎𝑠𝑃𝑜𝑠𝑡𝑒𝑑(𝑍, 𝑡 2 ) same atoms of body  ( 2 ) dedup_hyp atom body  ( 2 )

(𝜖 2 ) 𝑑𝑒𝑑𝑢𝑝_ℎ𝑦𝑝(𝑈𝐼𝐷 1 , 𝑍) ∧ 𝑈𝑠𝑒𝑟(𝑈𝐼𝐷 2 , 𝑁) ∧ ℎ𝑎𝑠𝑃𝑜𝑠𝑡𝑒𝑑(𝑈𝐼𝐷 2 , 𝑡 2 ) ∧ 𝑊𝑟𝑡𝑆𝑡𝑆𝑖𝑚(𝑈𝐼𝐷 1 , 𝑈𝐼𝐷 2 ) → 𝑍 = 𝑈𝐼𝐷 2 dedup_hyp atom same atoms of body  ( 2 )

atoms to find similar entity Fig. 2. Illustration of one possible procedure with which a pEGD can be derived from an existing dhTGD (σ)-the (unnegated) atoms from body − (σ) plus additional conditions that help find candidate entities to instantiate the deduplication hypothesis.

The intuition of this rule is "If a user that has some post on topic t 1 but he does not have any post on topic t 2 , then we generate the hypothesis that there exists another user corresponding to him (whose identity is currently unknown) who has posted on topic t 2 ".

As shown in Figure 2, an equality-generating dependency can be derived from ω 2 by taking the atoms from body − (ω 2 ) and some additional conditions that, when true, can help obtain the identity of the hitherto unknown entity that participates in the deduplication hypothesis:

( 2 ) dedup hypth(UID 1 , Z) ∧ User(UID 2 , N ) ∧ hasPosted(UID 2 , t 2 ) ∧ WritStSim(UID 1 , UID 2 ) → Z = UID 2 .

The intuition of this rule is "If we have a duplicate user hypothesis between user UID 1 and an unknown user Z, and user UID 2 has posts on topic t 2 with similar writing style to user UID 1 , then Z may correspond to UID 2 ".

Putting all of this together, the following example illustrates how the chase procedure unfolds over the scenario in the running example. af(ω 2 ) = e 1 (t 1 , t 2 ), where the event e 1 (t 1 , t 2 ) is associated with the probability that a user has some post on t 2 given that he has some post on t 1 .

dedup_hypth(a 1 ,a 2 )

dedup_hypth(a 1 ,z 1 ) hasPosted(z 1 , t 2 )

Post(p 3 ,a af( 2 ) = e 2 , where the event e 2 is associated with the probability that two users with similar writing style actually correspond to the same underlying user.

Figure 3 illustrates the chase for this example. Note that dhTGD ω 2 is applicable because body + (ω 2 ) is mapped to D 1 and body − (ω 2 ) is not mapped to D 1 ; also note how the event e 1 (t 1 , t 2 ) is propagated to the generated atoms. Then, pEGD 2 is applicable, so null value z 1 is instantiated and the events e 1 (t 1 , t 2 ) and e 2 are propagated. Finally, for simplicity we specify M directly as a joint probability distribution, considering all possible worlds λ i for the events e 1 (t 1 , t 2 ) and e 2 (cf. Figure 4).

In the previous example, that the application of the dhTGD ω 2 produces a deduplication hypothesis for user a 2 , but the lack of information causes the candidate for such a duplicate to remain as a null value. However, the subsequent application of pEGD 2 lets us conclude that user a 2 -who has a similar writing cf. Figure 2 Fig. 5. Overview of the application of probabilistic Datalog+/-to answer deduplication queries. A set of input databases is considered along with a set of dhTGDs. In the first stage of the chase procedure, a set of dedup hypth atoms is generated; in the second stage, a set of pEGDs is used to assign concrete values to nulls in such atoms-both stages involve keeping track of probabilistic events stating conditions under which inferences hold. The output is a set of dedup hypth atoms with probability values, which can be used to answer the input query.

style to a 1 -is a likely option (event e 1 (t 1 , t 2 ) ∧ e 2 has probability 0.76 according to Figure 4). Note that the dhTGD could be applied multiple times, so that if there are other possible candidates, fresh dedup hypth atoms can be generated as needed.

Finding Duplicates: Threshold Deduplication Queries

The problem of identifying seemingly different entities within a knowledge base (or set of knowledge bases) that actually correspond to the same real world entity can be addressed as a query answering problem: given a knowledge base, we want to have a principled way of answering questions regarding the occurrence of duplicate entities. The chase procedure presented above can be used to answer queries of the form:

Q(X) = dedup(eid, X, θ),

where eid is a specific entity of interest, and θ ∈ [0, 1] represents a threshold probability value. Answers to queries of this form simply consist of possible values for variable X such that the chase procedure generates atoms dedup hypth(eid,val ) with an associated event e with Pr M (e) ≥ θ. Figure 5 illustrates the overall query answering process.

The following result states that the two-stage chase procedure described in Section 3 can be stopped after a finite number of steps in order to answer threshold deduplication queries whenever the underlying classical TGDs belong to a fragment of Datalog+/-for which conjunctive queries are decidable; this can be guaranteed via syntactic conditions such such as guardedness [22], stickiness [10], acyclicity, or their weak counterparts [9,10,15], or semantic conditions such as shyness [21], finite expansion sets (fes), bounded treewidth sets (bts), and finite unification sets (fus)) [2]-cf. [29,Section 1.3.3] for a brief overview of many such conditions. Proposition 1. Let KB = (O = (D, Σ), M, af) be a probabilistic Datalog+/ontology with stratified negation, and Q(X) = dedup(eid, X, θ) be a threshold deduplication query. If O belongs to a fragment of Datalog+/-for which a finite portion of chase(D, Σ) can be computed in order to answer conjunctive queries, then there exists a finite portion of the two-stage probabilistic chase procedure that can be used to answer threshold deduplication query Q over KB.

Essentially, this result holds since the first stage of our chase procedure applies TGDs and dhTGDs (which are a special case of classical TGDs) in the order given by the stratification, and the second stage is simply an application of pEGDs over a fixed structure. Both stages involve carrying over probabilistic events, which are then used in conjunction with the probabilistic model to check which atoms surpass the threshold given in the query.

Conclusions and Future Work

In this paper, we have introduced the adversarial deduplication problem as a variant of the classical deduplication (or entity resolution) problem in which one cannot assume that the existence of multiple records corresponding to the same real-world entity is the result of innocent or casual errors such as typos, transliterations, or inconsistently-updated values, but rather the result of actors deliberately trying to avoid identification. As a motivating scenario we took malicious hacker forums and marketplaces on the dark web, where this sort of practice is commonplace and having tools at hand to address it would have a great impact on efforts such as early-warning of cyber attacks, understanding the social networks of buyers and sellers, and identification of leading vendors of specific types of products.

We proposed the use of probabilistic Datalog+/-as a way to tackle the problem of finding pairs of virtual objects for which we have reasons to believe that they correspond to the same real-world entity, where a specific kind of probabilistic TGDs (called deduplication hypothesis-generating dependencies) and EGDs can be applied in order to derive deduplication hypotheses. The main strength of this approach is that it allows to reason about unknown objects via value invention, which is essential in this setting-not having this capability renders traditional entity resolution tools ineffective. Another advantage of our approach is that dhTGDs can be automatically obtained via relational association rule mining algorithms, and corresponding EGDs can also be derived with additional domain knowledge. Finally, we developed a two-stage probabilistic chase procedure in which (dh)TGDs are first exhaustively applied, and then pEGDs are used to associate concrete values to null values generated in the first step. We showed that this modified chase procedure can be used to answer deduplication threshold queries, leveraging chase termination results from the existential rules and Datalog+/-literature.

Ongoing and future work involves investigating other kinds of deduplication queries, such as instance checking ("what is the probability that two given virtual objects correspond to the same real-world entity?"), all pairs threshold ("which pairs of virtual objects are suspected to correspond to the same real-world entity with probability at least p?"), and more general conjunctive queries. We are also interested in investigating the complexity of query answering algorithms for all such queries, and under which conditions they can be guaranteed to run in polynomial time. We plan to evaluate the performance of our approach on both synthetic and real-world data on malicious hacker forums and marketplaces on the dark web obtained via a collaboration with a cyber threat intelligence company. Finally, we would also like to explore the relationship between our formalism and other approaches to probabilistic modeling with unknown objects, such as [24]

Fig. 1 .1Fig.1. Overview of a possible dhTGD learning process: relational association rule mining algorithms are applied over a set of available data sources; each such rule then leads to one output dhTGD characterizing situations that violate the corresponding association rule and therefore are the basis of a duplicate entity hypothesis.

Example 4 .4Considering the DB D 1 presented in Example 1, a probabilistic Extended Datalog+/-KB representing a deduplication program can be KB 2 = (O 2 , M, af) where O 2 = (D 1 , Σ 2 ) and M 1 is the same probabilistic model from Example 2. The set Σ 2 contains the (classical) TGD: (σ) User(UID, N ) ∧ Post(Id po, UID, T ) → hasPosted (UID, T ), and also contains the dhTGD ω 2 and pEGD 2 from the previous example. The annotation function af is defined as follows:

Fig. 4 .4Fig.4. Joint probability distribution over the events e1(t1, t2) and e2 used in the running example. In general, such distributions can be compactly specified via a probabilistic model like a Bayesian network.

the confidence at least  with which it is believed to hold.

3 ,t 3 ) User(a 3 ,'Angel Eyes') Graphic representation of the chase for Example 4. Boxes with bold outline represent atoms in D1, boxes with dashed outline represent atoms that contain null values, and boxes with standard outline represent atoms without nulls obtained from some dhTGD or pEGD. The events associated with each atom are represented underneath the corresponding box; all atoms in D1 are associated with empty events, and the events propagated are represented next to the arrow of the corresponding dhTGD or pEGD.User(a 1 ,'Blackstar')Post(p 1 ,a 1 ,t 1 )User(a 2 ,'Archer')Post(p 2 ,a 2 ,t 2 )TGD TGD TGD hasPosted(a 1 ,t 1 )hasPosted(a 2 ,t 2 )hasPosted(a 3 ,t 3 )e 1 (t 1 ,t 2 )dhTGD  2pEGD  2WritStSim(a 1 ,a 2 )e 1 (t 1 ,t 2 )e 1 (t 1 ,t 2 )e 2e 1 (t 1 ,t 2 )  e 2Fig. 3. λi e1(t1, t2) e2 Probabilityλ1 true true0.76λ2 true false0.19λ3 false true0.04λ4 false false0.01

Acknowledgments. This work was supported by funds provided by CONICET, Agencia Nacional de Promoción Científica y Tecnológica, Universidad Nacional del Sur, Argentina, by the EU H2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement 690974 for the project "MIREL: MIning and REasoning with Legal texts", and by the US Department of the Navy, Office of Naval Research, grant N00014-15-1-2742. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Office of Naval Research.

Foundations of Databases: The Logical Level Abiteboul, S., Hull, R., Vianu, V. 1995 Addison-Wesley Longman Publishing Co., Inc Boston, MA, USA Walking the complexity lines for generalized guarded existential rules JBaget MMugnier SRudolph MThomazo Proce. IJCAI e. IJCAI 2011 Erblox: Combining matching dependencies with machine learning for entity resolution ZBahmani LEBertossi NVasiloglou Int. J. Approx. Reasoning 83 2017 A proof procedure for data dependencies CBeeri MYVardi J. ACM 31 4 1984 Data cleaning and query answering with matching dependencies and matching functions LEBertossi SKolahi LV SLakshmanan Theory Comput. Syst 52 3 2013 Collective entity resolution in relational data IBhattacharya LGetoor ACM Trans. Knowl. Discov. Data 1 1 2007 Query-time entity resolution IBhattacharya LGetoor J. Artif. Intell. Res 30 2007 Data fusion JBleiholder FNaumann ACM Comput. Surv 41 1 2009 Taming the infinite chase: Query answering under expressive relational constraints ACalì GGottlob MKifer J. Artif. Intell. Res 48 2013 Towards more expressive ontology languages: The query answering problem ACalì GGottlob APieris Artif. Intell 193 2012 CVE: Common vulnerabilities and exposures: The standard for information security vulnerability names 2018 Discovery of relational association rules LDehaspe HToivonen Relational data mining Springer 2001 An automatic system for identifying authorities in digital libraries IDiaz-Valenzuela MJMartin-Bautista MAVila JRCampaña Expert Syst. Appl 40 10 2013 Duplicate record detection: A survey AKElmagarmid PGIpeirotis VSVerykios IEEE T. Knowl. Data En 19 1 2007 Data exchange: semantics and query answering RFagin PGKolaitis RJMiller LPopa Theor. Comput. Sci 336 1 2005 Dependencies revisited for improving data quality WFan Proc. ACM PODS ACM PODS 2008 Reasoning about record matching rules WFan XJia JLi SMa PVLDB 2 1 2009 Entity resolution: Theory, practice and open challenges LGetoor AMachanavajjhala PVLDB 5 12 2012 Query answering under probabilistic uncertainty in Datalog+/-ontologies GGottlob TLukasiewicz MVMartinez GISimari Ann. Math. Artif. Intell 69 1 2013 Expressiveness of guarded existential rule languages GGottlob SRudolph MSimkus Proc. ACM PODS ACM PODS 2014 Efficiently computable datalog∃ programs NLeone MManna GTerracina PVeltri Proc. KR KR 2012 A general Datalog-based framework for tractable query answering over ontologies TLukasiewicz ACali GGottlob J. Web Semant 14 0 2012 Testing implications of data dependencies DMaier AOMendelzon YSagiv ACM T. Database Syst 4 4 1979 BLOG: probabilistic models with unknown objects BMilch BMarthi SJRussell DSontag DLOng AKolobov Proc. IJCAI IJCAI 2005 National Vulnerability Database NIST Darknet and deepnet mining for proactive cybersecurity threat intelligence ENunes ADiab ATGunn EMarin VMishra VPaliath JRobertson JShakarian AThart PShakarian Proc. ISI ISI 2016 Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference JPearl 1988 Morgan Kaufmann Publishers Inc Markov logic networks MRichardson PDomingos Mach. Learn 62 1-2 2006 Ontology-Based Data Access Leveraging Subjective Reports GISimari CMolinaro MVMartinez TLukasiewicz LPredoiu Springer Briefs in Computer Science 2017 Springer Entity resolution with iterative blocking SEWhang DMenestrina GKoutrika MTheobald HGarcia-Molina Proc. SIGMOD. Stanford SIGMOD. Stanford 2009