<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>AIC</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Quantifying the Impact of Predicate Similarities on Knowledge Graph Triple Embeddings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexander Kalinowski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuan An</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Drexel University, Department of Information Science</institution>
          ,
          <addr-line>3141 Chestnut Street, Philadelphia, PA 19104</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>8</volume>
      <abstract>
        <p>In devising methods for representing knowledge graph triples in low-dimensional spaces, care must be taken to quantify the similarities between all components, especially the predicate components common to all triples. Unfortunately, knowledge graph benchmarks do not come equipped with scores indicating the semantic similarity between two arbitrary triples. To proxy these scores, we introduce a weakly supervised method we call PTSS, or pairwise triple similarity scoring. A neural model then utilizes this information to update triple representations. We conduct experiments using this method by substituting three methods for predicate relatedness measures: linear algebraic similarities of predicate embeddings, predicate frequency/inverse predicate frequency, and KL-divergence of predicate distributions. We analyze the information captured by these approaches and their impacts when utilized as weak supervision signals for triple representations to test the hypothesis that these approaches reflect a notion of cognitive similarity. Our findings indicate a combined model using scores driven from neural embeddings for entities and fact distributions for predicates achieves the best results, highlighting the eficacy of combining neural and distributional approaches and suggests this pattern of combination may be fruitful in other cognitively inspired AI solutions.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;knowledge representation</kwd>
        <kwd>semantic web</kwd>
        <kwd>triple embedding</kwd>
        <kwd>knowledge graph embedding</kwd>
        <kwd>predicate similarities</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Knowledge graphs (KGs) serve as a representation framework for real-world facts, stored as
triples featuring a head entity ℎ, a tail entity  and a predicate  expressing a relation between
ℎ and . Knowledge graphs may store millions of such triples; determining how similar each
triple is to all other triples in the graph is still an area in need of development. The majority
of knowledge graph research is in link prediction and is heavily dependent on latent vector
representations, herein referred to as knowledge graph embeddings (KGE). Treating the heads,
tails and predicates as distinct embedding vectors requires the combination of all three pieces
to arrive at a single vector representation for a triple. To this end, we propose a weakly
supervised method called PTSS (pairwise triple similarity scoring) for triple embeddings. The
method utilizes existing knowledge graph embeddings to create weakly supervised semantic
similarity scores between triples. These scores can then be used for determining a single vector
representation of a triple.</p>
      <p>The PTSS method achieves significant improvements in two downstream applications, triple
classification and triple clustering tasks, compared to the state-of-the-art baselines. The detailed
design and evaluation of the proposed PTSS method is under review for publication. In this
workshop paper, we expand on this work and probe the eficacy of three diferent scoring
functions of predicate similarity. The rest of the paper begins with the problem formulation and
study motivation, followed by details on the PTSS method. The remaining sections describe our
results on evaluating the impacts of diferent predicate similarity functions using an information
theoretic metric and a downstream triple classification task .</p>
    </sec>
    <sec id="sec-2">
      <title>2. Problem Formulation and Motivation</title>
      <p>
        We begin with definitions of knowledge graphs and knowledge graph embeddings, followed by
motivation for why triple embedding methods require novel approaches. A knowledge graph
(KG)  = {(ℎ, , )} ⊂ ℰ ×  × ℰ is comprised of triples  = (ℎ, , ), where ℎ ∈ ℰ is the
head entity,  ∈ ℰ is the tail entity and  ∈  is a predicate, or relation, between ℎ and . Here,
ℰ and  are, respectively, the sets of entities and predicates; we denote the number of entities
as #ℰ and number of predicates #. A knowledge graph embedding (KGE) method encodes
the patterns and regularities in a knowledge graph as low-dimensional, dense vectors, where
entities  ∈  are -dimensional vectors e ∈ R and predicates  ∈  are scoring functions
 : R × R → R. Each KGE method applies a diferent scoring function, and, as a result,
predicates can be embedded as a vector e ∈ R as in TransE [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or a matrix E ∈ R×  as in
DistMult [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. A triple embedding (TE) method encodes the regularity of a triple ⟨ℎ, , ⟩ in a
knowledge graph as a dense numeric vector. A highly applicable regularity among triples is
the semantic similarity between triples. In particular, if two triples share one or more common
elements, the embeddings of the triples should reflect a certain degree of geometrical proximity.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Motivation: Semantic Aggregation of KGEs for TEs</title>
        <p>A great deal of efort has been put into developing KGE methods (see Section 7 for more details.)
Let  be an arbitrary KGE model which generates a set of vectors e for entities and e for
predicates. Let 1 = ⟨ℎ1, 1, 1⟩ be an arbitrary triple from the graph. To move into a space that
is representative of the triple, we can exploit the existing KGE by aggregating the embedding
vectors of its components for the whole triple embedding, i.e.,</p>
        <p>e1 = (eℎ1 , e1 , e1 ).</p>
        <p>As many of these models are built for link prediction, meaningful aggregations only make
use of entity embeddings. Table 1 summarizes several common aggregation operations in the
literature.</p>
        <p>The issue with this type of aggregation comes when two triples share the same head and tail
entities, but difering predicates. Let 1 = ⟨ℎ1, 1, 1⟩ and 2 = ⟨ℎ1, 2, 1⟩ be two such triples.
Substituting into the aggregations in Table 1, we find that each representation would be exactly
the same, although it is obvious that 1 and 2 are representing two diferent facts in the graph.
This motivates research in finding methods to represent triples in a more semantic way that
respects the information contained in the predicates.
3. PTSS: A Weakly Supervised Method for Building Triple</p>
        <p>Embeddings
Our main goal in this line of research is to build a method for embedding knowledge graph
triples in a semantically oriented way, namely, respecting that triples sharing similar entities,
predicates, or meanings should have high similarity when compared in their latent spaces. For
this purpose, we propose a weakly supervised method called pairwise triple similarity scoring
(PTSS).</p>
        <p>
          Let  be a KG and  a corresponding KGE function. For triples ⟨ℎ, , ⟩, the embeddings
learned through  may already encode vector space notions of similarities between entities and
predicates. For example, two entities sharing many properties and relationships will participate
in many of the same triples, and we assume that their respective embeddings should exhibit
these similarities. This assumption is highly dependent on the selection of the pre-trained
embedding function  , and there exists a line of research into how much semantic information
these embeddings actually encode [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. We can use these embeddings, henceforth referred to as
seed embeddings, as a means to proxy pairwise triple similarity scores. Specifically, given two
triples  = ⟨ℎ, , ⟩ and  = ⟨ℎ, , ⟩ in the knowledge graph, we define the pairwise
triple similarity score (PTSS) between  and  as an average of the similarities between the
embeddings of their head entities, tail entities, and predicates, respectively.
        </p>
        <p>Formally, the PTSS is defined as</p>
        <p>(, ) = ((eℎ , eℎ ), (e , e ), (e , e )),
where eℎ is the embedding vector of the head entity ℎ and so on, (, , ) computes
an average of its arguments, and the function (e1, e2) is a function for computing the
similarities of two embeddings. The similarity function (e1, e2) could be an arbitrary
function capturing the geometric structures in KG embedding spaces. The current PTSS employs
cosine similarity and arithmetic mean, leaving the evaluation of more general functions to
future work. Our evaluation demonstrates that these simple functions generate useful weak
supervision signals.</p>
        <p>For weakly supervised training, we extract from the knowledge graph a set of training
examples containing both “positive" and “negative" examples. For a triple  = ⟨ℎ, , ⟩, we
define positive training examples as a set of potential candidate matches. In particular, triple 
is composed of three elements, or three potential ‘slots’ to match on: the head, the tail and the
predicate. For each ‘slot’ we select  other triples with the same head entity,  other triples
with the same tail entity,  other triples with the same predicate. For negative examples, we
select  other triples with no commonalities in any available slot. In instances where the set of
candidate matches has cardinality less than  , we add the entire set to be scored. Thus, for
each triple in the graph, we build a set of at most 4 triples to compare and contrast with. In
all following experiments, we set  = 5, leaving an ablation of this sampling hyperparameter
to future work.</p>
        <p>These scores and training examples can then be utilized as an input to a fine-tuning process
to generate low-dimensional representations of the composite triple. In particular, we define
a Siamese-like neural network that takes two triples as input to an embedding layer and
initializes values at an embedding layer with aggregations of the embeddings of the triple
components. The network subsequently forward propagates the embedding layer through a
series of encoding (ENC) layers. The final layer is a scoring (SCO) layer where the network
computes the cosine similarity of the representations. This score is then compared to the
estimated pairwise similarity scores (PTSS), with the errors back-propagated through the
network. To update the triple embeddings, we make the embedding layer tunable and use the
ifnal values as the triple embeddings.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Measuring Predicate Similarity in PTSS</title>
      <p>In computing   (, ), we have several choices for building the entity and predicate
signals. This paper focuses on the selection of predicate similarity measures, (e , e ).
Specifically, we hone in on three major approaches: those based on predicate vector similarities,
those based on predicate frequency measures, and those based on fact distributions with respect
to predicates. Each of these then defines a symmetric similarity matrix  used as input to the
PTSS scoring function.</p>
      <p>Predicate Vector Similarity. We utilize the similarity functions shown in Table 2 for
similarities based on predicate vectors. The column ’Method’ lists the embedding methods
selected for this study (see Section 5.1). The pairwise similarity functions give rise to a symmetric
similarity matrix  . We refer to this approach as EMB.</p>
      <p>Predicate Frequency Measure. For similarities based on predicate frequency measures, we
use the method of [4], who define triple frequency and inverse triple frequency as
  (,  ) = log(1 + C, )
  ( , ) = log</p>
      <p>||
| : C, &gt; 0|
where , counts the number of times the predicates  and  link the same heads and tails,
and  is the set of edges. These metrics are used to build a symmetric matrix</p>
      <p>C (, ) =   (,  ) ×   ( , )
to represent a vector for each predicate in the graph; each vector in C can then be used
to build the matrix  of pairwise similarities between predicates. Herein, we refer to this
approach as PF/IPF.</p>
      <p>Predicate Distribution Measure. An alternative quantification of predicate similarity can
be found in [5], where conditional probabilities distributions based on the occurrences of ℎ, 
and  are used. Specifically, each predicate is assigned a conditional probability distribution
 (ℎ, |), where similar predicates are expected to show similar probability distributions, i.e.
if two predicates 1 and 2 share many of the same head, tail pairs, they should have similar
probability mass for ℎ, . Here, rather than directly enumerate through all ℎ,  pairs for each ,
the conditional probability distribution is parameterized via a two latent neural network. This
network can be expressed as</p>
      <p>˜ 1 (ℎ; ) = MLP 1 (p)⊤h,
˜ 2 (; ℎ, ) = MLP 2 ([h; p])⊤t</p>
      <p>Here, h, p, t are embeddings of ℎ, , , respectively. The output of this network can then be
utilized to build a final conditional probability</p>
      <p>* (ℎ, |) = exp( 1 (ℎ; ) +  2 (; ℎ, ))
To compute the similarity between any arbitrary pair of predicates, the KL divergence of their
respective conditional probability distributions is computed via a sampling-based approach.
Important to this sampling approach is the number of samples  drawn from each conditional
probability distribution when estimating the KL divergence; the authors fix this choice at 20,
and we follow suit, leaving exploration of other choices to future work. For each experiment,
we normalize and diagonalize the resulting KL divergence scores to build the matrix  of
predicate similarities. For the remainder of this paper, we will refer to this method as KL-div.</p>
      <p>Our goal is to compare and contrast these approaches, evaluating the information gained
from each and their eventual impact on triple classification.</p>
    </sec>
    <sec id="sec-4">
      <title>5. Evaluation Approach</title>
      <p>With an interest in quantifying the amount of semantic information captured by both neural
embedding models and information-theoretic approaches, we perform our experiments in two
branches. First, we aim to quantify the amount of information contained in these
representations, specifically in their pairwise similarities. Second, we aim to quantify the impact of this
information on a task related to triple representations: triple classification. We outline these
approaches in the following sections, followed by coverage of our benchmark datasets.</p>
      <sec id="sec-4-1">
        <title>5.1. Selecting Representations</title>
        <p>
          To evaluate the degree to which knowledge graph embeddings capture predicate semantics,
we selected six popular neural embedding architectures and two information-theoretic models.
For embeddings, we utilized pre-trained models from the [6] library, selecting ComplEx[7],
ConvE[8], DistMult[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], RotatE[9], RESCAL[10], and TransE[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. These models were selected
for their variety in scoring functions and training hyperparameters, covering a large portion of
research in this area while taking advantage of pre-trained model weights for initialisation. We
cover the details of these approaches in Section 7.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>5.2. Quantifying Predicate Representation Informativeness</title>
        <p>To start, we wish to quantify the amount of information contained in the matrices  described
above. One such numerical quantification is the von Neumann entropy of a matrix, defined as
entropy = − tr() · log().</p>
        <p>Here, we posit that higher entropy measures correspond to higher capture of information and
should be an indicator of which representations are best. We further hypothesize that the
representation of predicate similarities with the best entropy will maximize the impact on our
downstream triple embedding tasks, detailed in Section 5.3.</p>
        <p>In addition to numerical metrics on the information capture by each predicate similarity
matrix, we are interested in the extent to which these similarities capture actual semantics
about the graph. We perform an exploration of the semantics of each space, making sure that
predicates that are ranked as ‘most similar’ are actually in-line with human expectations. We can
also compare how similar each approach is by measuring the overlap in each of the similarity
lists. To do so, for each pair of approaches, we compute the Jaccard similarity of the top five
similar predicates for every predicate in the list. We then average over all predicates to arrive
at a single metric. This approach allows us to quantify how similar the predicted predicate
similarities are and determine which approaches have agreeable rankings. Approaches with
highest overlapping predicate similarities are expected to have similar performance on the
downstream tasks.</p>
      </sec>
      <sec id="sec-4-3">
        <title>5.3. Probing Triple Embeddings</title>
        <p>For each selected representation model (see Section 5.1), we apply the aggregation schemes
found in Table 1 to build the initialization of the triple vectors. This creates five diferent PTSS
models for each selected representation scheme, for example, TransE_AVG, TransE_HAD,
TransE_L1, TransE_L2, and TransE_HT. We then compute the PTSS scores using the EMB,
the PF/IPF approach and the KL-div approach. This additionally creates three training datasets
used as input to each model, for example,</p>
        <p>TransE_AVG_EMB, TransE_AVG_PFIPF, TransE_AVG_KLDIV.</p>
        <p>Thus, there are three models per aggregation, five aggregations per model and six representation
approaches for a total of ninety models for each experiment. Each of these generates a unique
triple representation used in the following evaluation tasks.</p>
        <p>We probe the resulting triple embeddings for their ability to predict their respective predicate
labels – a task of triple classification. Following the work of [ 11] and [12], we perform the
classification using both a low-capacity and high-capacity model. For our low-capacity model,
we employ one-vs-rest logistic regression, providing the model with triple representations
as features and the predicate indices as labels. For the high-capacity model, we replace the
logistic regression with a multi-layer perceptron with 512 hidden nodes, trained with the Adam
optimizer and batch size of 256 for a total of 10 epochs. In each case, we split the triples into
training and test sets, varying the number of training triples to be between 20% and 90% of all
available triples, repeated using ten diferent random seeds. Both models are then evaluated on
their respective Micro-F1 scores.</p>
      </sec>
      <sec id="sec-4-4">
        <title>5.4. Knowledge Graph Benchmarks</title>
        <p>In this work, we utilize two well-established knowledge graphs for evaluation, WordNet
(WN18RR) and Freebase (FB15K-237). Summary statistics for both datasets are presented
in Table 3. Freebase contains general facts about people, places and events. In our case, we
selected the FB15K-237 benchmark, where inverse relations that are simple to learn (and thus
inflate performance metrics) have been removed, leaving 237 distinct relations to be modelled.
WordNet covers lexical and grammatical knowledge; we selected the WN18RR benchmark,
which similarly removes inverse relations. These benchmark datasets were selected for two
reasons. First, the frequency of usage in the research community allows us to take advantage
of publicly available, pre-trained models, such as those found in LibKGE [6]. Secondly, these
graphs are complex in nature, making our evaluation more reflective of knowledge graphs
that occur in practice. Of major importance in our approach is determining the number of
multi-edge triples, i.e. those that contain matching head and tail entities but multiple predicates.
As outlined in Section 2, these are the predicates that sufer most from methods focused on
aggregation of head and tail vectors for representing a triple. Our methodology is specifically
focused on adding predicate information back to triple representations, capturing the semantics
of these predicates and helping to build triple representations that can be diferentiated along
this dimension.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Analysis of Results</title>
      <p>We present our findings in two categories: analysis of the information encoded in Section 6.1,
and impact on downstream tasks in Section 6.2.</p>
      <sec id="sec-5-1">
        <title>6.1. Analysis of Available Information</title>
        <p>In terms of entropy, the largest scores (and hence where most information is captured) come
from the neural approaches for FB15K-237, but this is flipped for WN18RR, where
informationtheoretic approaches shine. We suspect this is largely due to the diversity in predicates; Freebase
has many predicates and associated facts, allowing neural approaches more opportunities for
learning while misleading information-theoretic approaches. WordNet, on the other hand,
has fewer, more concentrated predicates and regularities in facts that information-theoretic
approaches are likely to capture. From an entropy point-of-view, the best models of predicate
similarity are TransE for FB15K-237 and KL-div for WN18RR, followed closely by TransE.
This relationship is particularly interesting as KL-div is dependent on TransE embeddings as
input, hence the close entropy scores should almost be expected, while the deviation in entropy
for Freebase highlights an area for further investigation.</p>
        <p>For FB15K-237, we find that there is a great deal of diversity in the similarity rankings between
approaches. No pair of methods have an average Jaccard similarity exceeding 0.3125, which
is the similarity between ComplEx and DistMult. The remaining approaches show little to
no overlap, suggesting that each approach is modeling the similarities between predicates in
entirely diferent ways. It is interesting to find that there are very few commonalities between
the neural approaches, many of which utilize similar architectures. We also acknowledge that
the KL-div approach has the least overlap with other approaches. Upon further investigation,
we believe this is due to improper model specification. To highlight this issue, we selected the
ifve most frequent predicates in FB15K-237. For each of these predicates, we computed the top
ifve most similar predicates using each of the selected approaches. Overall, we found that the
KL-div method favored one particular predicate, namely
as the most similar to all other predicates. We suspect that this predicate serves as a hub,
where its frequency in the amount of triples is ‘about average’. Clearly, this type of inference is
incorrect and the representation of this predicate derived from the KL-div method dominates
the predicate space. For other approaches, the similarities are more semantically interpretable–
predicates related to acting and film are all found to be similar to one another, and similar
patterns exist for sports related predicates and place/location predicates.</p>
      </sec>
      <sec id="sec-5-2">
        <title>6.2. Analysis of Triple Classification</title>
        <p>In work currently under review, we show that the PTSS post-processing scheme leads to
dramatic improvements in triple classification when compared to the previous state-of-the-art
benchmark, Triple2vec [12]. In particular, we show a 67% improvement in Micro-F1 scores on
WN18RR (.4008 to .6723) and 1247% improvement in Micro-F1 scores for FB15K-237 (.059 to
.7949). These advances come from the ability of PTSS to leverage information already learned
through the KGE methodologies serving as seed vectors as well as deviating from the random
walk methodologies used in Triple2vec, which we believe are less applicable to sparse knowledge
graphs such as Freebase. Given these improvements, we ran the following experiments to test
the eficacy of the predicate similarity function and exploit any further gains from alternate
similarity measures. To summarize the results of the ninety experiments done on each dataset,
we have compiled the following high level findings.</p>
        <p>First, we identify the best performing models for each dataset, which we define as the
maximum Micro-F1 score when training on 80% of the available triples. For FB15K-237, the
best benchmark models is RotatE_HT_EMB with an average score of 0.725 and 0.754 for the
low and high capacity models, respectively. This benchmark is improved upon when using the
scores derived from KL-div, where the best model is ConvE_HT_KLDIV with average scores
of 0.792 and 0.796. While we see an improvement when using the KL-div scores in conjunction
with the PTSS scoring, the case is very diferent when comparing PF/IPF, where there is no
clear individual model winner, with the best approach being RotatE_HT_PFIPF having scores
of 0.485 and 0.513. Substituting KL-div for the baseline embedding similarity results in lifts of
9.24% and 5.57% for low and high capacity models, respectively.</p>
        <p>For WN18RR, the findings are consistent, with the KL-div method giving best results,
followed by EMB and PF/IPF. Using the PF/IPF scores gives best performance with the model
ConvE_HAD_PFIPF with scores of 0.347 and 0.35. When using the KL-div features, the
same base model and aggregation yields best performance, with ConvE_HAD_KLDIV having
maximum scores of 0.522 and 0.545. Performance on ConvE_HAD_EMB resulted in scores of
0.397 and 0.5. This translates to a lift of 31.48% and 9% when moving from the embedding-based
model to KL-div model for the low and high capacity models, respectively. The greater lift
seen on the WN18RR dataset correlates back to the finding that the entropy of the predicate
similarity matrix was greatest when using this approach. In the case of both datasets, including
the KL-div results leads to models that improve on the baseline embedding models, suggesting
that the information captured by this method is more reflective of predicate semantics. We
present the full results for evaluation of triple classification when using KL-div in Figures 6.2
and 6.2.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>7. Related Work</title>
      <p>
        We cover the requisite background on the knowledge graph embedding models tested in our
experiments. For a full coverage of these techniques, please see [13]. These methods fall into
three main groups: translation-based models, semantic-matching models, and graph-based
models. The intuition behind translation-based models is to build vector representations of
⟨ℎ, , ⟩ such that h + p ≈ t. Model choices then depend on which space or spaces the entities
and relations are embedded in as well as the scoring function used to help the model learn to
diferentiate between true triples from the graph and noise triples that do not reflect real-world
facts. TransE [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is the simplest of these models that embeds entities and predicates by using
a distance function defined by (h, t) = − ‖ h + p − t‖1/2. While this model is simple, it
struggles to properly encode one to many triples, where a single relation may hold between a
head entity and several tail entities. Semantic-matching models deviate away from the
distancebased assumption and focus on using similarity-based scoring functions. These methods leverage
dot-product-like scoring functions to measure angles between low-dimensional representations,
sometimes referred to as ‘semantic energy’ functions. The simplest of such models is RESCAL
[10], which relies on a tensor representation of the underlying knowledge graph  , where
each entry of the tensor X = 1 if the fact is represented in the knowledge graph, otherwise
zero. The tensor can then be decomposed into latent components, where each predicate-specific
matrix  is a matrix of dimension  ×  representing interactions between each corresponding
component. In a simplification, DistMult [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] requires each  to be diagonal, reducing the
parameters of the model while sacrificing some of its representational capacity. This reduction
in capacity is especially felt when modeling anti-symmetric relations as interactions in these
diagonal matrices have no notion of directionality. To circumvent this issue, the ComplEx [7]
model allows for the low-dimensional representations to live in the complex space C. The
RotatE approach also utilizes complex-valued representations, but it also models each individual
relation as a rotation between the head and tail entities. This additional parameterization allows
for increased learning capacity with the end goal of capturing anti-symmetry, inversion, and
composition. The work of [8] takes this one step further, defining the ConvE model where
entities interact through the convolution operator, introducing additional non-linearities to
increase the capacity for learning complicated relational structures.
      </p>
      <p>
        These models have been probed on the amount of semantic information they capture. The
work of [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] finds that entity semantics are not universal; only a small subset of entities follow
expected semantic patterns. While we agree that semantics are not the target of KGE models,
our work shows that they may still be leveraged via weak supervision to reintroduce semantics
to a triple representation. Our ablation of predicate similarity measures shows there is room for
improvement on pure associative vector representations; our future research will continue to
explore blended methodologies.
      </p>
    </sec>
    <sec id="sec-7">
      <title>8. Conclusion</title>
      <p>In this work, we introduce a novel method for quantifying the similarity between knowledge
graph triples and utilize these scores to build neural representations of triples as first-class
objects. We find that the information contained in traditional knowledge graph embeddings
is not enough to build semantically grounded triple representations. Instead, we find that
introducing additional measures of predicate relatedness, namely through the KL-divergence of
fact distributions introduced by [5], leads to improvements when coupled with our methodology.
Even with this gain, there is clear room for improvement, such as removing the tendency
to identify predicate hubs detailed in Section 6.1. Future work will continue to explore the
relationship between neural and distributional approaches and how they best combine for
semantic knowledge graph triple representation.
[4] G. Pirrò, Building relatedness explanations from knowledge graphs, Semantic Web 10
(2019) 963–990.
[5] W. Chen, H. Zhu, X. Han, Z. Liu, M. Sun, Quantifying similarity between relations
with fact distribution, CoRR abs/1907.08937 (2019). URL: http://arxiv.org/abs/1907.08937.
arXiv:1907.08937.
[6] S. Broscheit, D. Rufinelli, A. Kochsiek, P. Betz, R. Gemulla, LibKGE - A knowledge graph
embedding library for reproducible research, in: Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations,
2020, pp. 165–174.
[7] T. Trouillon, C. R. Dance, É. Gaussier, J. Welbl, S. Riedel, G. Bouchard, Knowledge graph
completion via complex tensor factorization, Journal of Machine Learning Research (JMLR)
18 (2017) 1–38.
[8] T. Dettmers, M. Pasquale, S. Pontus, S. Riedel, Convolutional 2d knowledge graph
embeddings, in: Proceedings of the 32th AAAI Conference on Artificial Intelligence, 2018, pp.
1811–1818. URL: https://arxiv.org/abs/1707.01476.
[9] Z. Sun, Z.-H. Deng, J.-Y. Nie, J. Tang, Rotate: Knowledge graph embedding by relational
rotation in complex space, 2019. arXiv:1902.10197.
[10] M. Nickel, V. Tresp, H.-P. Kriegel, A three-way model for collective learning on
multirelational data, in: Proceedings of the 28th International Conference on International
Conference on Machine Learning, ICML’11, Omnipress, Madison, WI, USA, 2011, p. 809–816.
[11] A. Conneau, D. Kiela, SentEval: An evaluation toolkit for universal sentence
representations, in: Proceedings of the Eleventh International Conference on Language Resources
and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki,
Japan, 2018.
[12] V. Fionda, G. Pirrò, Learning triple embeddings from knowledge graphs, volume 34, 2020,
pp. 3874–3881. URL: https://ojs.aaai.org/index.php/AAAI/article/view/5800. doi:10.1609/
aaai.v34i04.5800.
[13] Q. Wang, Z. Mao, B. Wang, L. Guo, Knowledge graph embedding: A survey of approaches
and applications, IEEE Transactions on Knowledge and Data Engineering PP (2017) 1–1.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bordes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usunier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>García-Durán</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Yakhnenko</surname>
          </string-name>
          ,
          <article-title>Translating embeddings for modeling multi-relational data</article-title>
          , in: C.
          <string-name>
            <surname>J. C. Burges</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Ghahramani</surname>
            ,
            <given-names>K. Q.</given-names>
          </string-name>
          <string-name>
            <surname>Weinberger</surname>
          </string-name>
          (Eds.),
          <source>Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems</source>
          (NeurIPS)
          <year>2013</year>
          .,
          <year>2013</year>
          , pp.
          <fpage>2787</fpage>
          -
          <lpage>2795</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <article-title>Embedding entities and relations for learning and inference in knowledge bases</article-title>
          ,
          <source>in: 3rd International Conference on Learning Representations, ICLR</source>
          <year>2015</year>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Kalo</surname>
          </string-name>
          , W.-T. Balke,
          <string-name>
            <given-names>R.</given-names>
            <surname>Krestel</surname>
          </string-name>
          ,
          <article-title>Do embeddings actually capture knowledge graph semantics?</article-title>
          , in: Eighteenth Extended Semantic Web Conference (ESWC) - Research
          <string-name>
            <surname>Track</surname>
          </string-name>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>