Training Relation Embeddings under Logical Constraints Pushpendre Rastogi1 Adam Poliak1 Benjamin Van Durme1,2 pushpendre@jhu.edu azpoliak@cs.jhu.edu vandurme@cs.jhu.edu 1 Center for Language and Speech Processing 2 Human Language Technology Center of Excellence Johns Hopkins University Abstract We present ways of incorporating logical rules into the construction of embedding based Knowledge Base Completion (KBC) systems. Enforcing “logical consis- tency” in the predictions of a KBC system guarantees that the predictions comply with logical rules such as symmetry, implication and generalized transitivity. Our method encodes logical rules about entities and relations as convex constraints in the embedding space to enforce the condition that the score of a logically entailed fact must never be less than the minimum score of an antecedent fact. Such constraints provide a weak guarantee that the predictions made by our KBC model will match the output of a logical knowledge base for many types of logical inferences. We validate our method via experiments on a knowledge graph derived from WordNet. 1 Introduction A number of state of the art methods for Knowledge Base Completion (KBC) utilize a representation learning framework and learn distributed vector representations, i.e. embeddings, of the entities and relations in a Knowledge Base (KB). Although such models make correct predictions on a sizable portion of the data, they cannot guarantee to follow logical rules and to make inferences consis- tent with those rules. For example, there is no way to guarantee in existing KBC methods that if an embeddings based KB predicts the fact that Alice murdered Bob (Murdered,(Alice,Bob)) then it will also predict that Alice Killed Bob, even though it is very simple to enforce this in a traditional logical inference system by specifying the rule that Murdered implies Killed. In this paper we present a novel method for directly encoding logical rules via convex constraints on the embeddings. Such meth- ods for directly “shaping” the feasible subspaces of embeddings based on logical properties of relations have not been deeply explored before, and we will show through our experiments that such a method can improve the performance of an existing KBC system. 2 Method Let a knowledge base be defined as a tuple (F,L), with F a set of statements, and a set of first order logic rules L. Every element f ∈F is itself a nested tuple (r,(e,e0)) which states that the entities e and e0 are connected via the relation r. Let E and R be the set of all entities and relations respectively. Let T be the set of all entity tuples that appear in F, and let U denote the universe of all possible facts, i.e. T ={t|(r,t)∈F}, and U ={(r,(e,e0))| r ∈R, e,e0 ∈E}. Note that T≤F.1 Finally, F c =U \F is the set of unknown facts. The goal of a KBC system is to rank the elements of F c so that facts that are correct receive a smaller rank than incorrect facts. Embedding Model: We assume that every relation r ∈R and entity e∈E can be represented using real valued vectors r∈Rd and e∈Rd̃; d and d̃ may have different values. The vector representation of each tuple t is computed from its constituent entities via a composition function c : Rd̃×Rd̃→Rd, i.e. t = c(e,e0). For example c may denote vector concatenation, in which case Copyright c by the paper’s authors. Copying permitted for private and academic purposes. In: L. Dietz, C. Xiong, E. Meij (eds.): Proceedings of the First Workshop on Knowledge Graphs and Semantics for Text Retrieval and Analysis (KG4IR), Tokyo, Japan, 11-Aug-2017, published at http://ceur-ws.org 1 Per standard convention we denote the size of a set using the corresponding roman symbol. E.g. E is the size of E. 25 T t = [eT ,e0 ]T . We will use the semicolon symbol ; as an infix operator to denote vector concatenation, i.e. (x;y) = [xT ,yT ]T . Finally, x≥y denotes that the vector x is elementwise larger than y and B(x,r) denotes the L2 ball centered at x with radius r. Score Function: A majority of the existing work on embedding based KBC measures the correctness of a fact via a scoring function, score:R×E×E→R, with the property that when score(f)>score(f 0), fact f is more likely to be correct than f 0. The two major classes of score functions are: score(f)=hr,ti (1) score(f)=−||r−t||2 (2) 0 0 In Equations (1–2), r and t are vector representations of r and t=(e,e ), respectively, that are constituents of f =(r,(e,e )). For brevity, we will omit this expansion from here onwards. Unconstrained Objectives for Learning Score Function Rendle et al. 2009 proposed the Bayesian Personalized Ranking (BPR) objective as a way of tuning recommendation systems when a user can only observe positive training data, such as correct facts, but the facts that are absent may be either correct or incorrect. In this paper we will focus on the BPR objective since this objective has been used for learning the parameters of a KBC system by various researchers Rendle et al. (2009); Demeester et al. (2016); Riedel et al. (2013). Wang and Cohen 2016 experimentally showed that the BPR objective outperforms other objectives such as Hinge Loss and Log Loss. BPR posits that the training data is a single joint sample of U(U−1) bernoulli random variables {bff 0 |f ∈U,f 0 ∈U,f 0 = 6 f}. bff 0 equals 1 when f is in F and f 0 is in F c and 0 otherwise. bff 0 is parameterized by its probability pff 0 and all bff 0 are conditionally independent given the probabilities pff 0 . The probability values must obey the reasonable condition that pff 0 =1−pf 0 f . A natural way to satisfy this condition is to parameterize pff 0 as σ(score(f)−score(f 0)) where σ is the sigmoid function.2 The BPR estimator is simply the L2 regularized MLE estimator of this probabilistic model, with regularization strength α. Table 1 lists instances of the BPR objective that arise with different score functions. Model score Minimization Objective (J) A |t=(e;e0) X X X − log(σ(hr,ti−hr̄,t̄i))+α( ||r||2+ ||e||2) (1) R |t=e⊗e0 (f,f̄)∈F×F c r∈R t∈E X X X B − log(σ(hr1,ei+hr2,e0i+he,e0i −hr̄1,ēi−hr̄2,ē0i−hē,ē0i))+α( ||r||2 + ||e||2) (1) t=(e;e0;eT e0) (f,f̄)∈F×F c r∈R t∈E C |t=(e;e0) X X X − log(σ(||r̄−t̄||2−||r−t||2))+α( ||r||2+ ||e||2) (2) T |t=e−e0 (f,f̄)∈F×F c r∈R !t∈E 2 0 2 0 2 D X ||r̄1−ē|| +||r̄2−ē || +||ē−ē || P 2 P 2  0 0 (2) − logσ 2 0 2 0 2 +α r∈R ||r|| + t∈E ||e|| t=(e;e ||e−e ||) −||r1−e|| −||r2−e || −||e−e || (f,f̄)∈F×F c Table 1: Instances of the BPR objective corresponding to different choices of composition and score functions. For example, if c(e,e0)=(e;e0) and Equation 1 is used as the score function then we need to minimize the function in the first row with respect to r,e. In the first and third row, r=(r1;r2), in the second row r=(r1;r2;1) and in last row r=(r1;r2;0). The symbol ⊗ refers to the vector outer product operator that takes two vectors of size d̃ and produces a vector of size d̃2. Since this scoring function is equivalent to the score function of RESCAL we call the model R. Similarly the scoring function for T is the same as TransE. Logical Consistency of Embeddings through Constraints Our general scheme for incorporating logical relations into em- beddings is to ensure that during the learning of the vector representation of entities and relations, the score of a consequent fact will be greater than the score of any of its antecedents. In other words if f1,...,fn−1 =⇒ fn then score(fn)≥mini∈[1,n−1](score(fi)). If this does not hold true, then it will be possible for our KB to assign a low score to a logically entailed fact even though all of its antecedents have a high score. We analyze common logical rules found in large scale KBs and for different combinations of a logical rule and scoring function we devise inequalities that the score function should satisfy. We translate those inequalities into constraints that restrict the entity and relation representations in a KB. We use the projected subgradient descent algorithm for learning the parameters of our KBS system. Algorithm 1 shows a specific instance, for Model A and batch size 1, of our parameter learning algorithm with a general set of rules L. We now show how to construct convex constraints from logical rules. 2.1 Constraints for Logical Consistency: Relational Implication We now present the constraints for guaranteeing that the predictions from embeddings based KBC systems are consistent with logical rules starting with implication rules. An implication rule of the form, RELIMP(r,r0), specifies that if a fact f = (r,t) is 2 The sigmoid function, σ(x)= 1 dσ(x) 1+exp(−x) , has the useful properties that σ(x)+σ(−x)=1 and dx =σ(x)σ(−x). 26 correct, then (r0,t) must also be correct. For example, the rule RELIMP(Husband Of, Spouse Of) enforces that if our KB predicts the fact, (HusbandOf,(Don,Mel)), then it will also predict (SpouseOf,(Don,Mel)). As explained above we can enforce such a rule by ensuring that score(r0,t)≥score(r,t) ∀t∈T . 3 When we use the inner product score function (1) then this inequality can be enforced by ensuring that hr0 −r,ti ≥ 0 for all t ∈ T . We constrain t to lie in a subset of Rd, say T, then the implication rule can be enforced by constraining r0 −r to lie in the dual cone of T, denoted T∗. A very convenient special case arises when we chose T to be a “self dual cone” for which T=T∗. The set of positive real vectors Rd+ is one example of such a self dual cone. 4 When we use the L2 distance score function (2) then the restriction on the score function translates into the following constraints on the vector representations: ||r−t||2 −||r0 −t||2 ≥ 0 =⇒ hr−r0,r0 +r−2ti ≥ 0 =⇒ hr−r0,r0 +ri/2 ≥ hT(r−r0). Here hT(x) is the value of the support function of T at x which is defined as hT(x)=supt∈Thx,ti. It is necessary and sufficient for the feasibility of this constraint that the hT function should be finite for at least one value of x=r−r0. Once we have fixed r−r0 then r+r0 can be easily chosen from the halfspace H −(r0 −r,2hT(r−r0)). Note that if hT is difficult to compute then implementing this constraint will also be difficult, therefore we must chose T wisely. One example of a good choice of T is Rd+. hRd+ (r−r0) is finite and zero iff r−r0 ≤ 0 therefore, the value of r+r0 must lie in the halfspace hr−r0,r0 +ri≥0. Unfortunately, the problem of finding r and r0 vectors that satisfy this constraint is non-convex and it is not possible to project on to this set given a pair of vectors that violate the constraints. We remedy this situation by adding an additional constraint that r+r0 must also lie in the negative orthant, i.e. r+r0 ≤ 0. Table 2 presents all the derived constraints. Unfortunately, since the T model defines t = e−e0, therefore it is not possible to set T = Rd+. In the case of the T 0 0 model if we constrain e to lie in B(0,ρ) then t must lie in B(0,2ρ) and hT(r−r0) = 2ρ(r−r0) =⇒ hr−r ,r +ri ||r−r0 || ≥ 4ρ. One way to make this constraint amenable to efficient projection is to enforce that r+r =4ρ(r−r ) and ||r−r || ≥1 =⇒ ||r0||2 ≥|2ρ−.5|. 0 0 0 2 This constraint becomes trivial if ρ=0.25 2.1.1 Reverse Relational Implication and Symmetry A reverse relational implication rule denoted by REVIMP(r,r0) specifies that if (r,(x,y)) is correct, then (r0,(y,x)) is also correct for all (x,y) ∈ T . This rule can be enforced through the inequality that score(r0,y,x) ≥ score(r,x,y). Depending on the model let r=(r1;r2) or (r1;r2;1 or 0) as shown in Table 1, and similarly decompose r0. We will omit this detail in later sections. Under models A and B, this inequality translates to the following constraint hy,r01i+hx,r02i ≥ hx,r1i+hy,r2i and under models C and D, the necessary constraints are hr1 −r02,r1 +r02 −2xi ≥ hr01 −r2,r01 +r2 −2yi. Stronger versions of these constraints, which are more efficient to enforce, are shown in Table 2. A symmetry rule denoted as SYMM(r) specifies that if the fact (r,(e,e0)) is known to be correct then (r,(e0,e)) is also correct. We can only comply with this logical rule in an embedding base KB by ensuring that score(r,e,e0) = score(r,e0,e). Under all 4 score models this rule can be enforced only by ensuring that r1 =r2. 2.1.2 Entailment A type A entailment logical rule denoted as ENTAILA(r,e,r0,e0) specifies that (r,(e,x)) implies (r0,(e0,x)) for all x in E.5 A Type B entailment rule, ENTAILB (r,e,r0,e0) specifies that (r,(x,e)) implies (r0,(x,e0)). r and r0 may denote the same relation. For example, the rule ENTAILB (IsA,Man,IsA,Mortal) can be used to enforce that if (IsA,(Socrates,Man)) then the KB must also predict that (IsA,(Socrates,Mortal)). The final constraints required to implement a type B entailment rule are shown in Table 3.6 2.1.3 Property Transitivity A property transitivity rule denoted PROTRANS(r,r0,e0,r00,e00) specifies that if (r,(x,y)) and (r0,(y,e0)) are true then (r00,(x,e00)) is also true. For example, the rule PROTRANS(Partner,Convicted,Criminal,Suspected,Criminal) can be used to incorporate the common sense rule that if an entity is the partner of a convicted criminal then it will be suspected of being a criminal into the embeddings based KB. Note that the score of the hypothesis fact (r00,(x,e00)) should be high if the antecedent facts have high 3 We abuse notation in saying that score(r,(x,y))=score(r,x,y). qP 4 Other self-dual cones, distinct from Rd also exist such as the Lorentz cone {x ∈ Rd | x ≥ d−1 2 + d i=1 xi }. We refer the reader to Gruber (2007) for more details on the geometry of closed convex cones and their polar and dual sets. 5 We use the term, entailment, in the sense of entailment of properties. Note that this is different from implication. 6 Details: To implement a type B entailment rule we need to ensure that score(r0 ,x,e0 )≥ score(r,x,e) for all x∈E. Under model A this inequality translates to, hr01 −r1 ,xi ≥ hr2 ,ei−hr02 ,e0 i. Model B requires hr01 −r1 +e0 −e,xi ≥ hr2 ,ei−hr02 ,e0 i. Model C requires hr1 −r01 ,r1 +r01 −2xi ≥ hr02 −e0 +r2 − e,r02 −e0 −r2 +ei, and finally the constraints over model D’s score functions are hr1 −r01 ,r1 +r01 i+he−e0 ,e+e0 i−hr02 −e0 −r2 +e,r02 −e0 +r2 −ei ≥ 2hr1 −r01 −e−e0 ,xi. 27 score for any possible entity y. A natural way in which we can incorporate such a rule into score based KBC models is by ensuring that score(r00,x,e00)≥maxy∈E min(score(r,x,y),score(r0,y,e0)). In order to derive efficient constraints that can enforce this inequal- ity we strengthen the constraint imposed on the score function by replacing the min function in the lower bound to a convex combina- tion of the scores, i.e. let λ∈(0,1), we enforce the inequality that score(r00,x,e00)≥maxy∈E λscore(r,x,y)+(1−λ)score(r0,y,e0) . Since a convex combination of two values is greater than their minimum, this stronger inequality translates to the following con- r00 −λr +e00 (1−λ)(r01 +e0 )+λr1 straint for model A: he00,r002 i−(1−λ)he0,r02i+hx,r001 −λr1i ≥ hy,λr2 +(1−λ)r01i. Let a = 1 λ1 , b=− λ , hr00 ,e00 i−(1−λ)hr0 ,e0 i c= 2 λ 2 , and let E contain the set {e | e ∈ E}. For Model B, the above inequality on the score function leads to the the constraint: ∀x,y ∈ E, hx,yi ≤ hx,ai + hy,bi + c. Remember that our goal is to devise a set E, and constraints on relation embeddings such that it is efficient to project onto it and for which the above inequality can be guaranteed. The following proposition – proof ommitted for lack of space – shows how to construct such a set: Proposition 1. Let x,y be members of Rd+ ∩B(a,||a||) and a≥0 then hx,yi≤hx+y,ai. The above proposition shows that if a=b and c≥0 then by setting E=Rd+ ∩B(a,||a||) we can satisfy the above constraints. Alg. 1 Projected SGD for Model A, Batch Size=1 Given: F,F c,L. Hyperparameters: α,η,S. Rule Model Constraints for each fact f ∈F do for S steps do A, R, B r≤r0 RELIMP(r,r0) Sample f̄ =(t̄,r̄) from F c C, D r≤r0 ≤−r Let v =σ(hr̄,t̄i−hr,ti) A,B r02 ≥r1, r01 ≥r2 . Fix e and optimize J REVIMP(r,r0) C, D r1 ≤r02 ≤−r1, r2 ≤r01 ≤−r2. ∂J (f) ∂J (f) R matrix(r0)≥matrix(r) ∂r =−tv,  ∂r̄ = t̄v  (f) (f) (r;r̄)←proj L (r;r̄)−η(( ∂J∂r ; ∂J∂r̄ )+2α(r;r̄)) Table 2: Constraints sufficient for enforcing RELIMP(r,r0) and . Fix r and optimize J REVIMP(r,r0) The constraint e≥0∀e∈E applies for all models. ∂J (f) ∂J (f) matrix is the inverse of the operation that converts a matrix to a ∂t =−rv, ∂ t̄ =r̄1 v  (f) (f) (t;t̄)←ProjL (t;t̄)−η(( ∂J∂t ; ∂J∂t̄ )+2α(t;t̄)) vector by concatenating its columns. I.e. matrix(r) denotes the matrix form of the vector r. 2.1.4 Type Implication A type implication rule, denoted as TYPEIMP(r,e,r0), specifies that if the fact (r,(x,y)) is correct then (r0,(x,e)) is also correct ∀(x,y)∈T . In other words, this rule enforces that positional arguments of a relation possess certain properties. For example, the rule TYPEIMP(Husbandof,Man,Gender) can enforce that if our KB predicts the fact that (Husbandof,(Don,Mel)) then it also predicts that (Gender,(Don,Man)). Under model A the TYPEIMP(r, e, r0) rule translates to the following inequality for the parameters hx, r01i − hx, r1i ≥ hy,r2i−he,r02i∀(x,y) ∈ T . Let a = e+r01 −r1, b = −r2 and c = hr02,ei. Under model B, the restriction on the score function translates to: hx,yi ≤ ha,xi+hb,xi+c. The analysis for this case again relies on Propostion 1 and the analysis for models C and D is yet out of reach. See Table 3. Rule Model Constraints Model Constraints r001 ≥λr1,λr2 +(1−λ)r01 ≤0 A r01 ≥r1, hr2,ei≤hr02,e0i A he00,r002 i≥(1−λ)he0,r02i B r01 ≥r1 +e−e0, hr2,ei≤hr02,e0i PROTRANS r001 +e00 +(1−λ)(r01 +e0)=0, r1 ≤r01 ≤−r1,r2−e≤r02−e0, B hr002 ,e00i≥(1−λ)hr02,e0i,and C a≥0,∀x∈E, x∈Rd+ ∩B(a,||a||) r02 +r2 ≤e0 +e r1−r01 ≤e0−e,e≥e0,r1 ≤r01 ≤−r1, A r01 ≥r1, he,r02i≥0 and r2 ≤0 D TYPEIMP e+r01 =r1 −r2, hr02,ei>0,and r2−e≤r02−e0,r02 +r2 ≤e0 +e B −r2 ≥0,∀x∈Ex∈Rd+ ∩B(−r2,||r2||) Table 3: Sufficient constraints for ENTAILB (r,e,r0,e0). The constraint e ≥ 0∀e ∈ E Table 4: Constraints for enforcing PROTRANS(r, r0, e0, r00, e00) and r00 −λr +e00 applies for all models. TYPEIMP(r,e,r0). a= 1 λ1 28 3 Related Work The problem of enforcing consistency between the predictions made by a machine learning system and a first order logic system, which is what our work attempts to do, has a large history of research but we will only be able to review recent work on learning representations of entities and relations of a knowledge graph and refer the reader to reviews of neural-symbolic systems Garcez et al. (2002); Hammer and Hitzler (2007) for more references. Grefenstette 2013 presented a novel model for simulating propositional logic with the help of tensors, however their model relied on high-dimensional boolean embeddings of the entities and relations, and it only guaranteed adherence to the RELIMP rule out of the ones presented in this paper. Rocktäschel et al. 2014; Rocktäschel et al. 2015 generalized Grefenstette’s work learning embeddings of entities and relations that were real valued and low dimensional and their learning mechanism could accomodate arbitrary first order logic formulae into the parameter learning objective by propositionalizing the formulae. Their method has two drawbacks in comparison to our proposal — 1) The process of propositionalization can be very expensive, especially for rules like PROTRANS and TYPEIMP that quantify over tuples of entities, and 2) Their method of differentiation through logic does not guarantee that the learnt embeddings will always be able to predict unseen relations that are logically entailed given the rules and the training data. Bowman et al. 2015a,b presented a neural network based method for predicting the existence of natural logic relations between two entities. Their approach too had the drawback that it could not guarantee the inference of logically entailed facts. Guo et al. 2015 presented a method based on LLE Roweis and Saul (2000) for incorporating side information in the form of semantic categories of entities but their method is not capable of incorporating the range of logical rules that we can. Demeester et al. 2016 and Vendrov et al. 2016 proposed an approach to constrain the learnt embeddings in a way that is identical to the method prescribed by us in Subsection 2.1. Our work generalizes their approach in two ways — Firstly, we generalize their proposed constraints by using the language of convex geometry, and secondly, we propose constraints for many more logical rules and score functions than either of the two papers. Wang and Cohen 2016 presented a novel method of factorizing the adjacency matrix of a proof graph of a probabilistic logic language to learn embeddings of first order logic formulas. Our method is conceptually simpler than theirs and requires fewer training stages. Finally, Guo et al. 2016 proposed an alternative method for embedding rules and entities based on t-norm fuzzy logics which was very similar to Rocktäschel et al. 2015’s approach. 4 Experiment: Logical Deduction and Knowledge Base Completion on WordNet Our method for training embeddings based KBC systems allows for a very interesting application of solving logical puzzles using an embeddings based KBC system without using an external logical-symbolic subsystem. We perform a controlled experiment where we compare the performance of an embedding based KBC system trained with the constraints versus a system that has been trained without those constraints. Data Consider the logical deduction problem shown in Table 5. This is a simplified version of a logical puzzle presented in Russell et al. 1995. In this puzzle, Nono is a country that possesses a W MD and Benedict has traded with Nono. The KB has to deduce whether Benedict is a criminal based on just two input facts and 3 rules. The total number of facts is 52 ×4=100. Rules RELIMP(TradeWith,TransactWith) ENTAILB (Possess,WMD,Considered,Enemy) PROTRANS(TransactWith,Enemy, Baseline ELKB Considered,Criminal,Considered) Model P@10 MRR MAP P@10 MRR MAP Facts A 0.00 0.02 0.01 0.20† 0.44† 0.83† (Possess,(Nono,WMD)) B 0.00 0.03 0.03 0.17† 0.26† 0.35† (TradeWith,(Benedict,Nono)) Query ? (Considered,(Benedict,Criminal)) Table 6: Table of Results. The baseline of R is equivalent to the RESCAL method.Bold marks that the average performance Table 5: A Logical Deduction Problem. Based on the rules and is higher. † implies that the difference is significant with two facts a KB should infer that Benedict is a Criminal. tailed p-value ≤0.005 as measured by a matched pair t-test. Evaluation We train two versions of two KBC systems, Models A and B, with batch size=1, α=0.001,η =0.1,S =200,d=50, and d̃ = 25 using Algorithm 1. Both KBs were trained in one pass using the two training facts. The only difference was that the baseline system did not constrain the embeddings to obey logically derived geometric constraints. After training we queried the KBs for the scores of all possible facts. We ranked all the facts based on their scores, excluding the training facts, and marked all facts that could be logically entailed from the two training facts as correct results and the rest of them as incorrect. We performed 10 runs and in each run we computed the MRR, P@10, MAP for the two models. Finally we averaged these quantities over 10 runs. Results Table 6 shows that our method was able to rank logically entailed facts with much higher precision and recall than the baseline systems. This validates our intuition that logical rules can be usefully incorporated into the parameter learning mechanism 29 of a KBC system via simple geometric constraints even for low dimensional embeddings. The reason for the large improvement in performance by the ELKB system in comparison to the baseline is that the ELKB model makes the score of entailed facts higher than the score of non-entailed facts because of the constraints during learning. E.g. the scores of entailed facts such as (Considered,(Nono, Enemy)), and (TransactWith,(Benedict, Nono)) are forced to be high in comparison to non-entailed facts such as (Trade With,(Benedict,WMD)). In comparison the baseline method does not have this systematic advantage and its scores remain unchanged. 4.1 Link Prediction on WordNet In the link prediction task, the KBC system is given incomplete facts, with either a missing head entity or tail entity, i.e. given either (r,( ,e0)), or (r,(e, )) the system has to predict e or e0 respectively. We evaluated the utility of proposed constraints by comparing the performance of model A and model B trained with and without the constraints. We now present the results of our experiments on the WN18 knowledge graph,7 derived from WordNet, and released by Bordes et al. 2013, which is a popular testbed for KBC algorithms Wang et al. (2014); Lin et al. (2015); Toutanova et al. (2015); Yang et al. (2015). Data The WN18 dataset comes with standard train, development and test splits. These three splits of the data contain 141442, 5000, and 5000 facts respectively. The total number of relations in the WN18 dataset is 18 and the number of entities is 40,943. Recently Guo et al. 2016 publicly released a list of logical rules8 which we directly incorporated into our framework. All of their rules were REVIMP rules. For all the models we fixed batch size=10, α=0.001,η =0.125,S =200,d̃=100, for model T , d=100 and otherwise d=200. Following existing work we measured the MRR, HITS@3 and Hits@10 metrics and report their average over the two tasks of head entity prediction and tail entity prediction. Instead of training in a single pass we trained our models for 50 epochs on the WN18 dataset and chose the best parameters using early stopping on the validation set. In other words, we used the parameters from that epoch which performed the best on the validation set in terms of the HITS@10 metric. Finally, we combined the predictions of the best performing T model and the best performing system based on model B. In order to combine the two ranking systems we trained a logistic regression classifier using the default settings in vowpal webbit9 to first predict whether model T or model B will produce a better ranking and then output that system’s ranking over entities for evaluation. Our logistic regression classifier had 73% accuracy on the training data and 70% accuracy over the test data. By using this third system we were able to create a single ranking system that performed better than model T which is very similar to the TransE model.10 Results Table 7 shows that both the constrained and un- Model Project MRR HITS@3 HITS@10 constrained versions of model A perform quite poorly. This A No 0.0152 0.016 0.0330 0 is to be expected since model A scores a triplet (r,(e,e )) as A Yes 0.0238 0.03 0.0514 0 0 hr,ei+hr,e i. Regardless of e , the ranking produced by the B No 0.0677 0.072 0.137 model will remain the same. Therefore model A is clearly B Yes 0.241 0.283 0.50 unsuitable for this task, for similar reason model C is also an T No 0.311 0.412 0.66 project unsuitable model. However the drastic improvement in the B +T – 0.367 0.475 0.708 performance of model B when it is trained according to the Table 7: MRR, HITS@3 and HITS@10 of the constrained and constraints corresponding to the REVIMP rules demonstrates unconstrained versions of models A, B and unconstrained T. B+T the utility of our proposed constraints. After adding the con- reports the results of combining models B and model T. straints, the MRR increased almost 3 times and the value of HITS@10 by 4 times from 0.137. Recall that at test / inference time the constraints do not play any role so the only role of the constraints is as a form of regularization on the parameters of the model. 5 Conclusion We have presented a novel method for incorporating logical constraints into an embedding based knowledge base by constraining the parameters of a KB. Our experiments on a small logical deduction problem, and on WordNet, indicate that our ideas of imposing geometric constraints on embeddings for enforcing logical rules are sound, and that they can improve the generalization of models that are hard to train otherwise. Although the KBC models A, B, C and D do not perform as well as existing models trained without constraints such as TransE, we show that they can be used as part of a combination of systems to improve upon existing methods. 7 We found that the performance of models C,D and R was too low therefore we do not report their results. 8 aclweb.org/anthology/attachments/D/D16/D16-1019.Attachment.zip 9 https://github.com/JohnLangford/vowpal_wabbit 10 The main differences between model T and TransE are that TransE used hinge loss versus the BPR objective. TransE does not regularize the relation embeddings and forces the entity embeddings to lie on the unit sphere, instead in model T we add a quadratic regularization term to regularize the embeddings. 30 References Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., and Yakhnenko, O. (2013). Translating embeddings for modeling multi-relational data. In NIPS, pages 2787–2795. Bowman, S. R., Potts, C., and Manning, C. D. (2015a). Learning distributed word representations for natural logic reasoning. In Proceedings of the AAAI Spring Symposium on Knowledge Representation and Reasoning. Bowman, S. R., Potts, C., and Manning, C. D. (2015b). Recursive neural networks can learn logical semantics. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, pages 12–21, Beijing, China. ACL. Demeester, T., Rocktäschel, T., and Riedel, S. (2016). Lifted rule injection for relation embeddings. In Proceedings of the EMNLP, pages 1389–1399, Austin, Texas. ACL. Garcez, A. S. d., Gabbay, D. M., and Broda, K. B. (2002). Neural-Symbolic Learning System: Foundations and Applications. Springer-Verlag New York, Inc., Secaucus, NJ, USA. Grefenstette, E. (2013). Towards a formal distributional semantics: Simulating logical calculi with tensors. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1, pages 1–10, Atlanta, Georgia, USA. ACL. Gruber, P. M. (2007). Convex and Discrete Geometry. Springer. Guo, S., Wang, Q., Wang, B., Wang, L., and Guo, L. (2015). Semantically smooth knowledge graph embedding. In Proceedings of the ACL, pages 84–94, Beijing, China. ACL. Guo, S., Wang, Q., Wang, L., Wang, B., and Guo, L. (2016). Jointly embedding knowledge graphs and logical rules. In Proceedings of the EMNLP, pages 192–202, Austin, Texas. ACL. Hammer, B. and Hitzler, P. (2007). Perspectives of neural-symbolic integration, volume 77. Springer. Lin, Y., Liu, Z., Sun, M., Liu, Y., and Zhu, X. (2015). Learning entity and relation embeddings for knowledge graph completion. In The 29th AAAI Conference on Artificial Intelligence, pages 2181–2187. AAAI. Rendle, S., Freudenthaler, C., Gantner, Z., and Schmidt-Thieme, L. (2009). Bpr: Bayesian personalized ranking from implicit feedback. In Proceedings of the UAI, pages 452–461. AUAI Press. Riedel, S., Yao, L., McCallum, A., and Marlin, B. M. (2013). Relation extraction with matrix factorization and universal schemas. In Proceedings of the ACL, pages 74–84, Atlanta, Georgia. ACL. Rocktäschel, T., Bošnjak, M., Singh, S., and Riedel, S. (2014). Low-dimensional embeddings of logic. In Proceedings of the ACL 2014 Workshop on Semantic Parsing, pages 45–49, Baltimore, MD. ACL. Rocktäschel, T., Singh, S., and Riedel, S. (2015). Injecting logical background knowledge into embeddings for relation extraction. In NAACL. Roweis, S. T. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323– 2326. Russell, S., Norvig, P., and Intelligence, A. (1995). A modern approach. Artificial Intelligence. Prentice-Hall, Egnlewood Cliffs, 25:27. Toutanova, K., Chen, D., Pantel, P., Poon, H., Choudhury, P., and Gamon, M. (2015). Representing text for joint embedding of text and knowledge bases. In Proceedings of the EMNLP, pages 1499–1509, Lisbon, Portugal. ACL. Vendrov, I., Kiros, J. R., Fidler, S., and Urtasun, R. (2016). Order-embeddings of images and language. In Proceedings of the International Conference on Learning Representations. Wang, W. Y. and Cohen, W. W. (2016). Learning first-order logic embeddings via matrix factorization. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI 2016), New York, NY. AAAI. Wang, Z., Zhang, J., Feng, J., and Chen, Z. (2014). Knowledge graph embedding by translating on hyperplanes. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pages 1112–1119. AAAI Press. Yang, B., Yih, W.-t., He, X., Gao, J., and Deng, L. (2015). Embedding entities and relations for learning and inference in knowledge bases. In Proceedings of the International Conference on Learning Representations. 31