Training Relation Embeddings under Logical Constraints

              Pushpendre Rastogi1                            Adam Poliak1                             Benjamin Van Durme1,2
              pushpendre@jhu.edu                          azpoliak@cs.jhu.edu                          vandurme@cs.jhu.edu
                                                   1
                                                   Center for Language and Speech Processing
                                             2
                                                 Human Language Technology Center of Excellence
                                                           Johns Hopkins University


                                                                      Abstract

                            We present ways of incorporating logical rules into the construction of embedding
                            based Knowledge Base Completion (KBC) systems. Enforcing “logical consis-
                            tency” in the predictions of a KBC system guarantees that the predictions comply
                            with logical rules such as symmetry, implication and generalized transitivity. Our
                            method encodes logical rules about entities and relations as convex constraints in the
                            embedding space to enforce the condition that the score of a logically entailed fact
                            must never be less than the minimum score of an antecedent fact. Such constraints
                            provide a weak guarantee that the predictions made by our KBC model will match
                            the output of a logical knowledge base for many types of logical inferences. We
                            validate our method via experiments on a knowledge graph derived from WordNet.

1 Introduction
A number of state of the art methods for Knowledge Base Completion (KBC) utilize a representation learning framework and learn
distributed vector representations, i.e. embeddings, of the entities and relations in a Knowledge Base (KB). Although such models
make correct predictions on a sizable portion of the data, they cannot guarantee to follow logical rules and to make inferences consis-
tent with those rules. For example, there is no way to guarantee in existing KBC methods that if an embeddings based KB predicts
the fact that Alice murdered Bob (Murdered,(Alice,Bob)) then it will also predict that Alice Killed Bob, even though it
is very simple to enforce this in a traditional logical inference system by specifying the rule that Murdered implies Killed.
    In this paper we present a novel method for directly encoding logical rules via convex constraints on the embeddings. Such meth-
ods for directly “shaping” the feasible subspaces of embeddings based on logical properties of relations have not been deeply explored
before, and we will show through our experiments that such a method can improve the performance of an existing KBC system.

2 Method
Let a knowledge base be defined as a tuple (F,L), with F a set of statements, and a set of first order logic rules L. Every element
f ∈F is itself a nested tuple (r,(e,e0)) which states that the entities e and e0 are connected via the relation r. Let E and R be the set of
all entities and relations respectively. Let T be the set of all entity tuples that appear in F, and let U denote the universe of all possible
facts, i.e. T ={t|(r,t)∈F}, and U ={(r,(e,e0))| r ∈R, e,e0 ∈E}. Note that T≤F.1 Finally, F c =U \F is the set of unknown
facts. The goal of a KBC system is to rank the elements of F c so that facts that are correct receive a smaller rank than incorrect facts.
    Embedding Model: We assume that every relation r ∈R and entity e∈E can be represented using real valued vectors r∈Rd
and e∈Rd̃; d and d̃ may have different values. The vector representation of each tuple t is computed from its constituent entities
via a composition function c : Rd̃×Rd̃→Rd, i.e. t = c(e,e0). For example c may denote vector concatenation, in which case

Copyright c by the paper’s authors. Copying permitted for private and academic purposes.
In: L. Dietz, C. Xiong, E. Meij (eds.): Proceedings of the First Workshop on Knowledge Graphs and Semantics for Text Retrieval and Analysis (KG4IR), Tokyo,
Japan, 11-Aug-2017, published at http://ceur-ws.org
    1 Per standard convention we denote the size of a set using the corresponding roman symbol. E.g. E is the size of E.


                                                                        25
              T
t = [eT ,e0 ]T . We will use the semicolon symbol ; as an infix operator to denote vector concatenation, i.e. (x;y) = [xT ,yT ]T .
Finally, x≥y denotes that the vector x is elementwise larger than y and B(x,r) denotes the L2 ball centered at x with radius r.
   Score Function: A majority of the existing work on embedding based KBC measures the correctness of a fact via a scoring
function, score:R×E×E→R, with the property that when score(f)>score(f 0), fact f is more likely to be correct than f 0. The
two major classes of score functions are:
                                                    score(f)=hr,ti                                                           (1)
                                                             score(f)=−||r−t||2                                                               (2)
                                                                                  0                                                       0
 In Equations (1–2), r and t are vector representations of r and t=(e,e ), respectively, that are constituents of f =(r,(e,e )). For
brevity, we will omit this expansion from here onwards.
    Unconstrained Objectives for Learning Score Function Rendle et al. 2009 proposed the Bayesian Personalized Ranking
(BPR) objective as a way of tuning recommendation systems when a user can only observe positive training data, such as correct
facts, but the facts that are absent may be either correct or incorrect. In this paper we will focus on the BPR objective since this
objective has been used for learning the parameters of a KBC system by various researchers Rendle et al. (2009); Demeester et al.
(2016); Riedel et al. (2013). Wang and Cohen 2016 experimentally showed that the BPR objective outperforms other objectives
such as Hinge Loss and Log Loss.
    BPR posits that the training data is a single joint sample of U(U−1) bernoulli random variables {bff 0 |f ∈U,f 0 ∈U,f 0 =  6 f}.
bff 0 equals 1 when f is in F and f 0 is in F c and 0 otherwise. bff 0 is parameterized by its probability pff 0 and all bff 0
are conditionally independent given the probabilities pff 0 . The probability values must obey the reasonable condition that
pff 0 =1−pf 0 f . A natural way to satisfy this condition is to parameterize pff 0 as σ(score(f)−score(f 0)) where σ is the sigmoid
function.2 The BPR estimator is simply the L2 regularized MLE estimator of this probabilistic model, with regularization strength
α. Table 1 lists instances of the BPR objective that arise with different score functions.
       Model             score        Minimization Objective (J)
    A |t=(e;e0)
                                          X                        X        X
                                        − log(σ(hr,ti−hr̄,t̄i))+α(   ||r||2+ ||e||2)
                          (1)
    R |t=e⊗e0                          (f,f̄)∈F×F c                       r∈R               t∈E
                                        X                                                               X         X
         B                             −  log(σ(hr1,ei+hr2,e0i+he,e0i −hr̄1,ēi−hr̄2,ē0i−hē,ē0i))+α(   ||r||2 + ||e||2)
                          (1)
   t=(e;e0;eT e0)                    (f,f̄)∈F×F c                                                                  r∈R           t∈E
    C |t=(e;e0)
                                          X                               X        X
                                         − log(σ(||r̄−t̄||2−||r−t||2))+α(   ||r||2+ ||e||2)
                          (2)
    T |t=e−e0                          (f,f̄)∈F×F c                                   r∈R
                                                                                                   !t∈E
                                                                 2          0 2              0 2
          D                               X             ||r̄1−ē|| +||r̄2−ē || +||ē−ē ||               P        2
                                                                                                                    P             2
                                                                                                                                      
          0       0       (2)            − logσ                  2          0 2              0 2
                                                                                                    +α    r∈R ||r|| +    t∈E ||e||
 t=(e;e ||e−e ||)                                     −||r1−e|| −||r2−e || −||e−e ||
                                       (f,f̄)∈F×F c


Table 1: Instances of the BPR objective corresponding to different choices of composition and score functions. For example, if
c(e,e0)=(e;e0) and Equation 1 is used as the score function then we need to minimize the function in the first row with respect to
r,e. In the first and third row, r=(r1;r2), in the second row r=(r1;r2;1) and in last row r=(r1;r2;0). The symbol ⊗ refers to
the vector outer product operator that takes two vectors of size d̃ and produces a vector of size d̃2. Since this scoring function is
equivalent to the score function of RESCAL we call the model R. Similarly the scoring function for T is the same as TransE.
    Logical Consistency of Embeddings through Constraints Our general scheme for incorporating logical relations into em-
beddings is to ensure that during the learning of the vector representation of entities and relations, the score of a consequent fact will
be greater than the score of any of its antecedents. In other words if f1,...,fn−1 =⇒ fn then score(fn)≥mini∈[1,n−1](score(fi)).
If this does not hold true, then it will be possible for our KB to assign a low score to a logically entailed fact even though all
of its antecedents have a high score.
    We analyze common logical rules found in large scale KBs and for different combinations of a logical rule and scoring function
we devise inequalities that the score function should satisfy. We translate those inequalities into constraints that restrict the entity
and relation representations in a KB. We use the projected subgradient descent algorithm for learning the parameters of our KBS
system. Algorithm 1 shows a specific instance, for Model A and batch size 1, of our parameter learning algorithm with a general
set of rules L. We now show how to construct convex constraints from logical rules.

2.1 Constraints for Logical Consistency: Relational Implication
We now present the constraints for guaranteeing that the predictions from embeddings based KBC systems are consistent with
logical rules starting with implication rules. An implication rule of the form, RELIMP(r,r0), specifies that if a fact f = (r,t) is
   2 The sigmoid function, σ(x)=       1                                                      dσ(x)
                                   1+exp(−x)
                                             , has the useful properties that σ(x)+σ(−x)=1 and dx =σ(x)σ(−x).


                                                                     26
correct, then (r0,t) must also be correct. For example, the rule RELIMP(Husband Of, Spouse Of) enforces that if our KB
predicts the fact, (HusbandOf,(Don,Mel)), then it will also predict (SpouseOf,(Don,Mel)). As explained above
we can enforce such a rule by ensuring that score(r0,t)≥score(r,t) ∀t∈T . 3
    When we use the inner product score function (1) then this inequality can be enforced by ensuring that hr0 −r,ti ≥ 0 for all
t ∈ T . We constrain t to lie in a subset of Rd, say T, then the implication rule can be enforced by constraining r0 −r to lie in
the dual cone of T, denoted T∗. A very convenient special case arises when we chose T to be a “self dual cone” for which T=T∗.
The set of positive real vectors Rd+ is one example of such a self dual cone. 4
    When we use the L2 distance score function (2) then the restriction on the score function translates into the following constraints
on the vector representations: ||r−t||2 −||r0 −t||2 ≥ 0 =⇒ hr−r0,r0 +r−2ti ≥ 0 =⇒ hr−r0,r0 +ri/2 ≥ hT(r−r0). Here
hT(x) is the value of the support function of T at x which is defined as hT(x)=supt∈Thx,ti. It is necessary and sufficient for the
feasibility of this constraint that the hT function should be finite for at least one value of x=r−r0. Once we have fixed r−r0 then
r+r0 can be easily chosen from the halfspace H −(r0 −r,2hT(r−r0)). Note that if hT is difficult to compute then implementing
this constraint will also be difficult, therefore we must chose T wisely.
    One example of a good choice of T is Rd+. hRd+ (r−r0) is finite and zero iff r−r0 ≤ 0 therefore, the value of r+r0 must lie
in the halfspace hr−r0,r0 +ri≥0. Unfortunately, the problem of finding r and r0 vectors that satisfy this constraint is non-convex
and it is not possible to project on to this set given a pair of vectors that violate the constraints. We remedy this situation by
adding an additional constraint that r+r0 must also lie in the negative orthant, i.e. r+r0 ≤ 0. Table 2 presents all the derived
constraints. Unfortunately, since the T model defines t = e−e0, therefore it is not possible to set T = Rd+. In the case of the T
                                                                                                                0 0
model if we constrain e to lie in B(0,ρ) then t must lie in B(0,2ρ) and hT(r−r0) = 2ρ(r−r0) =⇒ hr−r              ,r +ri
                                                                                                             ||r−r0 || ≥ 4ρ. One way
to make this constraint amenable to efficient projection is to enforce that r+r =4ρ(r−r ) and ||r−r || ≥1 =⇒ ||r0||2 ≥|2ρ−.5|.
                                                                                   0           0         0 2

This constraint becomes trivial if ρ=0.25

2.1.1 Reverse Relational Implication and Symmetry
A reverse relational implication rule denoted by REVIMP(r,r0) specifies that if (r,(x,y)) is correct, then (r0,(y,x)) is also correct
for all (x,y) ∈ T . This rule can be enforced through the inequality that score(r0,y,x) ≥ score(r,x,y). Depending on the model
let r=(r1;r2) or (r1;r2;1 or 0) as shown in Table 1, and similarly decompose r0. We will omit this detail in later sections. Under
models A and B, this inequality translates to the following constraint hy,r01i+hx,r02i ≥ hx,r1i+hy,r2i and under models C and
D, the necessary constraints are hr1 −r02,r1 +r02 −2xi ≥ hr01 −r2,r01 +r2 −2yi. Stronger versions of these constraints, which
are more efficient to enforce, are shown in Table 2.
    A symmetry rule denoted as SYMM(r) specifies that if the fact (r,(e,e0)) is known to be correct then (r,(e0,e)) is also correct.
We can only comply with this logical rule in an embedding base KB by ensuring that score(r,e,e0) = score(r,e0,e). Under all
4 score models this rule can be enforced only by ensuring that r1 =r2.

2.1.2 Entailment
A type A entailment logical rule denoted as ENTAILA(r,e,r0,e0) specifies that (r,(e,x)) implies (r0,(e0,x)) for all x in E.5 A Type
B entailment rule, ENTAILB (r,e,r0,e0) specifies that (r,(x,e)) implies (r0,(x,e0)). r and r0 may denote the same relation. For
example, the rule ENTAILB (IsA,Man,IsA,Mortal) can be used to enforce that if (IsA,(Socrates,Man)) then the
KB must also predict that (IsA,(Socrates,Mortal)). The final constraints required to implement a type B entailment
rule are shown in Table 3.6

2.1.3 Property Transitivity
A property transitivity rule denoted PROTRANS(r,r0,e0,r00,e00) specifies that if (r,(x,y)) and (r0,(y,e0)) are true then (r00,(x,e00)) is
also true. For example, the rule PROTRANS(Partner,Convicted,Criminal,Suspected,Criminal) can be used to
incorporate the common sense rule that if an entity is the partner of a convicted criminal then it will be suspected of being a criminal
into the embeddings based KB. Note that the score of the hypothesis fact (r00,(x,e00)) should be high if the antecedent facts have high
   3 We abuse notation in saying that score(r,(x,y))=score(r,x,y).
                                                                                                 qP
   4 Other self-dual cones, distinct from Rd also exist such as the Lorentz cone {x ∈ Rd | x ≥     d−1 2
                                           +                                                d      i=1 xi }. We refer the reader to Gruber (2007) for more
details on the geometry of closed convex cones and their polar and dual sets.
    5 We use the term, entailment, in the sense of entailment of properties. Note that this is different from implication.
    6 Details: To implement a type B entailment rule we need to ensure that score(r0 ,x,e0 )≥ score(r,x,e) for all x∈E. Under model A this inequality translates

to, hr01 −r1 ,xi ≥ hr2 ,ei−hr02 ,e0 i. Model B requires hr01 −r1 +e0 −e,xi ≥ hr2 ,ei−hr02 ,e0 i. Model C requires hr1 −r01 ,r1 +r01 −2xi ≥ hr02 −e0 +r2 −
e,r02 −e0 −r2 +ei, and finally the constraints over model D’s score functions are hr1 −r01 ,r1 +r01 i+he−e0 ,e+e0 i−hr02 −e0 −r2 +e,r02 −e0 +r2 −ei ≥
2hr1 −r01 −e−e0 ,xi.


                                                                          27
score for any possible entity y. A natural way in which we can incorporate such a rule into score based KBC models is by ensuring
that score(r00,x,e00)≥maxy∈E min(score(r,x,y),score(r0,y,e0)). In order to derive efficient constraints that can enforce this inequal-
ity we strengthen the constraint imposed on the score function by replacing the min function in the lower bound to a convex combina-
tion of the scores, i.e. let λ∈(0,1), we enforce the inequality that score(r00,x,e00)≥maxy∈E λscore(r,x,y)+(1−λ)score(r0,y,e0) .
    Since a convex combination of two values is greater than their minimum, this stronger inequality translates to the following con-
                                                                                                r00 −λr +e00       (1−λ)(r01 +e0 )+λr1
straint for model A: he00,r002 i−(1−λ)he0,r02i+hx,r001 −λr1i ≥ hy,λr2 +(1−λ)r01i. Let a = 1 λ1               , b=−          λ          ,
    hr00 ,e00 i−(1−λ)hr0 ,e0 i
c= 2         λ
                    2
                        , and let E contain the set {e | e ∈ E}. For Model B, the above inequality on the score function leads
to the the constraint: ∀x,y ∈ E, hx,yi ≤ hx,ai + hy,bi + c. Remember that our goal is to devise a set E, and constraints on
relation embeddings such that it is efficient to project onto it and for which the above inequality can be guaranteed. The following
proposition – proof ommitted for lack of space – shows how to construct such a set:

Proposition 1. Let x,y be members of Rd+ ∩B(a,||a||) and a≥0 then hx,yi≤hx+y,ai.

   The above proposition shows that if a=b and c≥0 then by setting E=Rd+ ∩B(a,||a||) we can satisfy the above constraints.

Alg. 1 Projected SGD for Model A, Batch Size=1
Given: F,F c,L. Hyperparameters: α,η,S.
                                                                          Rule          Model      Constraints
for each fact f ∈F do
  for S steps do                                                                       A, R, B     r≤r0
                                                                      RELIMP(r,r0)
    Sample f̄ =(t̄,r̄) from F c                                                           C, D     r≤r0 ≤−r
    Let v =σ(hr̄,t̄i−hr,ti)                                                               A,B      r02 ≥r1, r01 ≥r2
    . Fix e and optimize J                                            REVIMP(r,r0)        C, D     r1 ≤r02 ≤−r1, r2 ≤r01 ≤−r2.
    ∂J (f)         ∂J (f)                                                                    R     matrix(r0)≥matrix(r)
      ∂r =−tv,  ∂r̄ = t̄v                          
                               (f)    (f)
   (r;r̄)←proj L (r;r̄)−η(( ∂J∂r ; ∂J∂r̄ )+2α(r;r̄))                Table 2: Constraints sufficient for enforcing RELIMP(r,r0) and
   . Fix r and optimize J                                           REVIMP(r,r0) The constraint e≥0∀e∈E applies for all models.
   ∂J (f)        ∂J (f)                                             matrix is the inverse of the operation that converts a matrix to a
     ∂t =−rv, ∂ t̄ =r̄1 v                         
                              (f)    (f)
   (t;t̄)←ProjL (t;t̄)−η(( ∂J∂t ; ∂J∂t̄ )+2α(t;t̄))                 vector by concatenating its columns. I.e. matrix(r) denotes the
                                                                    matrix form of the vector r.


2.1.4 Type Implication
A type implication rule, denoted as TYPEIMP(r,e,r0), specifies that if the fact (r,(x,y)) is correct then (r0,(x,e)) is also correct
∀(x,y)∈T . In other words, this rule enforces that positional arguments of a relation possess certain properties. For example, the
rule TYPEIMP(Husbandof,Man,Gender) can enforce that if our KB predicts the fact that (Husbandof,(Don,Mel))
then it also predicts that (Gender,(Don,Man)).
   Under model A the TYPEIMP(r, e, r0) rule translates to the following inequality for the parameters hx, r01i − hx, r1i ≥
hy,r2i−he,r02i∀(x,y) ∈ T . Let a = e+r01 −r1, b = −r2 and c = hr02,ei. Under model B, the restriction on the score function
translates to: hx,yi ≤ ha,xi+hb,xi+c. The analysis for this case again relies on Propostion 1 and the analysis for models C
and D is yet out of reach. See Table 3.
                                                               Rule        Model                    Constraints
  Model                    Constraints                                                r001 ≥λr1,λr2 +(1−λ)r01 ≤0
   A          r01 ≥r1, hr2,ei≤hr02,e0i                                           A
                                                                                           he00,r002 i≥(1−λ)he0,r02i
     B        r01 ≥r1 +e−e0, hr2,ei≤hr02,e0i               PROTRANS                       r001 +e00 +(1−λ)(r01 +e0)=0,
              r1 ≤r01 ≤−r1,r2−e≤r02−e0,                                          B         hr002 ,e00i≥(1−λ)hr02,e0i,and
     C                                                                                a≥0,∀x∈E, x∈Rd+ ∩B(a,||a||)
              r02 +r2 ≤e0 +e
              r1−r01 ≤e0−e,e≥e0,r1 ≤r01 ≤−r1,                                    A    r01 ≥r1, he,r02i≥0 and r2 ≤0
     D                                                       TYPEIMP                          e+r01 =r1 −r2, hr02,ei>0,and
              r2−e≤r02−e0,r02 +r2 ≤e0 +e                                         B
                                                                                      −r2 ≥0,∀x∈Ex∈Rd+ ∩B(−r2,||r2||)
Table     3:        Sufficient constraints   for
ENTAILB (r,e,r0,e0). The constraint e ≥ 0∀e ∈ E           Table 4: Constraints for enforcing PROTRANS(r, r0, e0, r00, e00) and
                                                                              r00 −λr +e00
applies for all models.                                   TYPEIMP(r,e,r0). a= 1 λ1


                                                               28
3 Related Work
The problem of enforcing consistency between the predictions made by a machine learning system and a first order logic system,
which is what our work attempts to do, has a large history of research but we will only be able to review recent work on learning
representations of entities and relations of a knowledge graph and refer the reader to reviews of neural-symbolic systems Garcez
et al. (2002); Hammer and Hitzler (2007) for more references.
    Grefenstette 2013 presented a novel model for simulating propositional logic with the help of tensors, however their model
relied on high-dimensional boolean embeddings of the entities and relations, and it only guaranteed adherence to the RELIMP rule
out of the ones presented in this paper. Rocktäschel et al. 2014; Rocktäschel et al. 2015 generalized Grefenstette’s work learning
embeddings of entities and relations that were real valued and low dimensional and their learning mechanism could accomodate
arbitrary first order logic formulae into the parameter learning objective by propositionalizing the formulae. Their method has two
drawbacks in comparison to our proposal — 1) The process of propositionalization can be very expensive, especially for rules
like PROTRANS and TYPEIMP that quantify over tuples of entities, and 2) Their method of differentiation through logic does not
guarantee that the learnt embeddings will always be able to predict unseen relations that are logically entailed given the rules and
the training data.
    Bowman et al. 2015a,b presented a neural network based method for predicting the existence of natural logic relations between
two entities. Their approach too had the drawback that it could not guarantee the inference of logically entailed facts. Guo et al.
2015 presented a method based on LLE Roweis and Saul (2000) for incorporating side information in the form of semantic
categories of entities but their method is not capable of incorporating the range of logical rules that we can. Demeester et al. 2016
and Vendrov et al. 2016 proposed an approach to constrain the learnt embeddings in a way that is identical to the method prescribed
by us in Subsection 2.1. Our work generalizes their approach in two ways — Firstly, we generalize their proposed constraints by
using the language of convex geometry, and secondly, we propose constraints for many more logical rules and score functions than
either of the two papers. Wang and Cohen 2016 presented a novel method of factorizing the adjacency matrix of a proof graph of a
probabilistic logic language to learn embeddings of first order logic formulas. Our method is conceptually simpler than theirs and
requires fewer training stages. Finally, Guo et al. 2016 proposed an alternative method for embedding rules and entities based on
t-norm fuzzy logics which was very similar to Rocktäschel et al. 2015’s approach.

4 Experiment: Logical Deduction and Knowledge Base Completion on WordNet
Our method for training embeddings based KBC systems allows for a very interesting application of solving logical puzzles using
an embeddings based KBC system without using an external logical-symbolic subsystem. We perform a controlled experiment
where we compare the performance of an embedding based KBC system trained with the constraints versus a system that has
been trained without those constraints.
   Data Consider the logical deduction problem shown in Table 5. This is a simplified version of a logical puzzle presented in
Russell et al. 1995. In this puzzle, Nono is a country that possesses a W MD and Benedict has traded with Nono. The KB
has to deduce whether Benedict is a criminal based on just two input facts and 3 rules. The total number of facts is 52 ×4=100.
                           Rules
         RELIMP(TradeWith,TransactWith)
         ENTAILB (Possess,WMD,Considered,Enemy)
         PROTRANS(TransactWith,Enemy,                                                     Baseline                     ELKB
          Considered,Criminal,Considered)                               Model     P@10     MRR MAP            P@10      MRR      MAP
                           Facts                                         A         0.00     0.02   0.01       0.20†     0.44†    0.83†
         (Possess,(Nono,WMD))                                            B         0.00     0.03   0.03       0.17†     0.26†    0.35†
         (TradeWith,(Benedict,Nono))
                          Query ?
         (Considered,(Benedict,Criminal))                             Table 6: Table of Results. The baseline of R is equivalent to
                                                                      the RESCAL method.Bold marks that the average performance
Table 5: A Logical Deduction Problem. Based on the rules and          is higher. † implies that the difference is significant with two
facts a KB should infer that Benedict is a Criminal.                  tailed p-value ≤0.005 as measured by a matched pair t-test.

   Evaluation We train two versions of two KBC systems, Models A and B, with batch size=1, α=0.001,η =0.1,S =200,d=50,
and d̃ = 25 using Algorithm 1. Both KBs were trained in one pass using the two training facts. The only difference was that
the baseline system did not constrain the embeddings to obey logically derived geometric constraints. After training we queried the
KBs for the scores of all possible facts. We ranked all the facts based on their scores, excluding the training facts, and marked all
facts that could be logically entailed from the two training facts as correct results and the rest of them as incorrect. We performed 10
runs and in each run we computed the MRR, P@10, MAP for the two models. Finally we averaged these quantities over 10 runs.
   Results Table 6 shows that our method was able to rank logically entailed facts with much higher precision and recall than the
baseline systems. This validates our intuition that logical rules can be usefully incorporated into the parameter learning mechanism


                                                               29
of a KBC system via simple geometric constraints even for low dimensional embeddings. The reason for the large improvement
in performance by the ELKB system in comparison to the baseline is that the ELKB model makes the score of entailed facts
higher than the score of non-entailed facts because of the constraints during learning. E.g. the scores of entailed facts such as
(Considered,(Nono, Enemy)), and (TransactWith,(Benedict, Nono)) are forced to be high in comparison
to non-entailed facts such as (Trade With,(Benedict,WMD)). In comparison the baseline method does not have this
systematic advantage and its scores remain unchanged.

4.1 Link Prediction on WordNet
In the link prediction task, the KBC system is given incomplete facts, with either a missing head entity or tail entity, i.e. given
either (r,( ,e0)), or (r,(e, )) the system has to predict e or e0 respectively. We evaluated the utility of proposed constraints by
comparing the performance of model A and model B trained with and without the constraints. We now present the results of
our experiments on the WN18 knowledge graph,7 derived from WordNet, and released by Bordes et al. 2013, which is a popular
testbed for KBC algorithms Wang et al. (2014); Lin et al. (2015); Toutanova et al. (2015); Yang et al. (2015).
    Data The WN18 dataset comes with standard train, development and test splits. These three splits of the data contain 141442,
5000, and 5000 facts respectively. The total number of relations in the WN18 dataset is 18 and the number of entities is 40,943.
Recently Guo et al. 2016 publicly released a list of logical rules8 which we directly incorporated into our framework. All of their
rules were REVIMP rules.
    For all the models we fixed batch size=10, α=0.001,η =0.125,S =200,d̃=100, for model T , d=100 and otherwise d=200.
Following existing work we measured the MRR, HITS@3 and Hits@10 metrics and report their average over the two tasks of
head entity prediction and tail entity prediction. Instead of training in a single pass we trained our models for 50 epochs on the
WN18 dataset and chose the best parameters using early stopping on the validation set. In other words, we used the parameters
from that epoch which performed the best on the validation set in terms of the HITS@10 metric. Finally, we combined the
predictions of the best performing T model and the best performing system based on model B. In order to combine the two
ranking systems we trained a logistic regression classifier using the default settings in vowpal webbit9 to first predict whether
model T or model B will produce a better ranking and then output that system’s ranking over entities for evaluation. Our logistic
regression classifier had 73% accuracy on the training data and 70% accuracy over the test data. By using this third system
we were able to create a single ranking system that performed better than model T which is very similar to the TransE model.10

    Results Table 7 shows that both the constrained and un-           Model       Project MRR        HITS@3 HITS@10
constrained versions of model A perform quite poorly. This               A          No       0.0152  0.016       0.0330
                                                          0
is to be expected since model A scores a triplet (r,(e,e )) as           A          Yes      0.0238  0.03        0.0514
             0                   0
hr,ei+hr,e i. Regardless of e , the ranking produced by the              B          No       0.0677  0.072       0.137
model will remain the same. Therefore model A is clearly                 B          Yes      0.241   0.283       0.50
unsuitable for this task, for similar reason model C is also an          T          No       0.311   0.412       0.66
                                                                      project
unsuitable model. However the drastic improvement in the             B        +T      –      0.367   0.475       0.708
performance of model B when it is trained according to the
                                                                   Table 7: MRR, HITS@3 and HITS@10 of the constrained and
constraints corresponding to the REVIMP rules demonstrates
                                                                   unconstrained versions of models A, B and unconstrained T. B+T
the utility of our proposed constraints. After adding the con-
                                                                   reports the results of combining models B and model T.
straints, the MRR increased almost 3 times and the value of
HITS@10 by 4 times from 0.137. Recall that at test / inference time the constraints do not play any role so the only role of the
constraints is as a form of regularization on the parameters of the model.

5 Conclusion
We have presented a novel method for incorporating logical constraints into an embedding based knowledge base by constraining
the parameters of a KB. Our experiments on a small logical deduction problem, and on WordNet, indicate that our ideas of imposing
geometric constraints on embeddings for enforcing logical rules are sound, and that they can improve the generalization of models
that are hard to train otherwise. Although the KBC models A, B, C and D do not perform as well as existing models trained without
constraints such as TransE, we show that they can be used as part of a combination of systems to improve upon existing methods.
   7 We found that the performance of models C,D and R was too low therefore we do not report their results.
   8 aclweb.org/anthology/attachments/D/D16/D16-1019.Attachment.zip
   9 https://github.com/JohnLangford/vowpal_wabbit
  10 The main differences between model T and TransE are that TransE used hinge loss versus the BPR objective. TransE does not regularize the relation

embeddings and forces the entity embeddings to lie on the unit sphere, instead in model T we add a quadratic regularization term to regularize the embeddings.


                                                                          30
References
Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., and Yakhnenko, O. (2013). Translating embeddings for modeling
  multi-relational data. In NIPS, pages 2787–2795.
Bowman, S. R., Potts, C., and Manning, C. D. (2015a). Learning distributed word representations for natural logic reasoning. In
  Proceedings of the AAAI Spring Symposium on Knowledge Representation and Reasoning.
Bowman, S. R., Potts, C., and Manning, C. D. (2015b). Recursive neural networks can learn logical semantics. In Proceedings of
  the 3rd Workshop on Continuous Vector Space Models and their Compositionality, pages 12–21, Beijing, China. ACL.
Demeester, T., Rocktäschel, T., and Riedel, S. (2016). Lifted rule injection for relation embeddings. In Proceedings of the EMNLP,
  pages 1389–1399, Austin, Texas. ACL.
Garcez, A. S. d., Gabbay, D. M., and Broda, K. B. (2002). Neural-Symbolic Learning System: Foundations and Applications.
  Springer-Verlag New York, Inc., Secaucus, NJ, USA.
Grefenstette, E. (2013). Towards a formal distributional semantics: Simulating logical calculi with tensors. In Second Joint
  Conference on Lexical and Computational Semantics (*SEM), Volume 1, pages 1–10, Atlanta, Georgia, USA. ACL.
Gruber, P. M. (2007). Convex and Discrete Geometry. Springer.
Guo, S., Wang, Q., Wang, B., Wang, L., and Guo, L. (2015). Semantically smooth knowledge graph embedding. In Proceedings of
  the ACL, pages 84–94, Beijing, China. ACL.
Guo, S., Wang, Q., Wang, L., Wang, B., and Guo, L. (2016). Jointly embedding knowledge graphs and logical rules. In Proceedings
  of the EMNLP, pages 192–202, Austin, Texas. ACL.
Hammer, B. and Hitzler, P. (2007). Perspectives of neural-symbolic integration, volume 77. Springer.
Lin, Y., Liu, Z., Sun, M., Liu, Y., and Zhu, X. (2015). Learning entity and relation embeddings for knowledge graph completion. In
  The 29th AAAI Conference on Artificial Intelligence, pages 2181–2187. AAAI.
Rendle, S., Freudenthaler, C., Gantner, Z., and Schmidt-Thieme, L. (2009). Bpr: Bayesian personalized ranking from implicit
  feedback. In Proceedings of the UAI, pages 452–461. AUAI Press.
Riedel, S., Yao, L., McCallum, A., and Marlin, B. M. (2013). Relation extraction with matrix factorization and universal schemas.
  In Proceedings of the ACL, pages 74–84, Atlanta, Georgia. ACL.
Rocktäschel, T., Bošnjak, M., Singh, S., and Riedel, S. (2014). Low-dimensional embeddings of logic. In Proceedings of the ACL
  2014 Workshop on Semantic Parsing, pages 45–49, Baltimore, MD. ACL.
Rocktäschel, T., Singh, S., and Riedel, S. (2015). Injecting logical background knowledge into embeddings for relation extraction.
  In NAACL.
Roweis, S. T. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–
  2326.
Russell, S., Norvig, P., and Intelligence, A. (1995). A modern approach. Artificial Intelligence. Prentice-Hall, Egnlewood Cliffs,
  25:27.
Toutanova, K., Chen, D., Pantel, P., Poon, H., Choudhury, P., and Gamon, M. (2015). Representing text for joint embedding of text
  and knowledge bases. In Proceedings of the EMNLP, pages 1499–1509, Lisbon, Portugal. ACL.
Vendrov, I., Kiros, J. R., Fidler, S., and Urtasun, R. (2016). Order-embeddings of images and language. In Proceedings of the
  International Conference on Learning Representations.
Wang, W. Y. and Cohen, W. W. (2016). Learning first-order logic embeddings via matrix factorization. In Proceedings of the 25th
 International Joint Conference on Artificial Intelligence (IJCAI 2016), New York, NY. AAAI.
Wang, Z., Zhang, J., Feng, J., and Chen, Z. (2014). Knowledge graph embedding by translating on hyperplanes. In Proceedings of
 the Twenty-Eighth AAAI Conference on Artificial Intelligence, pages 1112–1119. AAAI Press.
Yang, B., Yih, W.-t., He, X., Gao, J., and Deng, L. (2015). Embedding entities and relations for learning and inference in knowledge
  bases. In Proceedings of the International Conference on Learning Representations.


                                                             31