<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Consistent Relational Transfer Learning with Auto-encoders</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Harald Strömfelt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luke Dickens</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Artur d'Avila Garcez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandra Russo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Representation Learning, Relation Learning, Variational AutoEncoders, Concept Learning</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>City, University of London</institution>
          ,
          <addr-line>Northampton Square, London EC1V 0HB</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Imperial College London</institution>
          ,
          <addr-line>Exhibition Rd, South Kensington, London SW7 2BX</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University College London</institution>
          ,
          <addr-line>Gower St, London WC1E 6BT</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <abstract>
        <p>Human defined concepts are inherently transferable, but it is not clear under what conditions they can be modelled efectively by non-symbolic artificial learners. This paper argues that for a transferable concept to be learned, the system of relations that define it must be coherent across domains and properties. That is, they should be consistent with respect to relational constraints, and this consistency must extend beyond the representations encountered in the source domain. Further, where relations are modelled by diferentiable functions, their gradients must conform - the functions must at times move together to preserve consistency. We propose a Partial Relation Transfer (PRT) task which exposes how well relation-decoders model these properties, and exemplify this with ordinality prediction transfer task, including a new data set for the transfer domain. We evaluate this on existing relation-decoder models, as well as a novel model designed around the principles of consistency and gradient conformity. Results show that consistency across broad regions of input space indicates good transfer performance, and that good gradient conformity facilitates consistency.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In many situations, concepts that pertain to one set of data can also be relevant to another
[
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Take, for instance, the general concept of ordinality, whose semantics are defined
by relations: isSuccessor, isPredecessor, isGreater, isLess and isEqual; together with their
constraints. Successfully capturing this concept involves learning the corresponding relations
such that they maintain data set and property independence, with no retraining. This is to say
that they have been abstracted from the specific property and act instead as a generic set of
characterizing relations for the semantics of ordinality. For this, we argue that the relations
must be consistent with their expected constraints and coherent across ordinal properties
spanning diferent data sets, which means their consistency is maintained regardless of data set
or particular ordinal property.
      </p>
      <p>As a concrete example, suppose that we have successfully learned to order images of numbers
by their abstract digit identity, and are presented with a new data set containing images of
individual stacks of blocks. Suppose then that we wish to obtain an ordering over them, such
that we can compare arbitrary data instances using the above relations. Provided that the
learned relations are consistent with their expected constraints, it should be possible to obtain
an encoding that establishes each successor, via our isSuccessor relation, and immediately be
able to compare data instance over the remaining relations. Following this logic, the primary
purpose of this paper is to evaluate under which conditions a relation-decoder model is able to
obtain the ordinality concept. We do this by taking a set of popular relation-decoder models,
including a proposed Dynamic Comparator (DC) model, and assess 1. their consistencies as
measured in the source data set, and 2. their ability to perform a Partial Relation Transfer
(PRT) task to a novel target data set, which measures the robustness of their consistencies
across domains. The evaluation takes place in two steps. In the first, we learn the above set of
ordinality relations by ordering MNIST images based on their abstract digit identity and report
each model’s consistency profile. In the next step, we take the now pretrained isSuccessor
relation-decoder and apply it to a proposed BlockStacks data set, which consists of images of
multicolored block stacks. Each stack contains a single red block at various heights, which we
use to test the degree to which ordering the encodings of each block stack image, subject to the
pretrained isSuccessor relation, leads to transferred prediction accuracy across the remaining
relations. In summary, the contributions of our work are:
• We devise an experimental setup that can expose the degree to which learning relations
leads to concept abstraction, together with a new BlockStacks data set that presents a
challenging ordering task based on a complex property.
• We introduce a set of data set agnostic characteristic measures for relation-decoders
which can help determine their ability to perform PRT.
• We present a Dynamic Comparator model that achieves excellent PRT.
• Finally, we present a comprehensive analysis of model characteristics against
corresponding PRT performance, for a set of popular relation-decoders.</p>
      <p>The rest of the paper is presented as follows. Section 2 firstly positions our paper with respect
to related work. Section 3 formalises the PRT task and outlines the architecture we employ to
solve it, including the proposed DC relation-decoder model. We then define how we compute
model consistency and gradient-conformity in Section 4. Finally, we provide results and analysis
in Section 5, with concluding remarks in Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Relational representations play a prominent role in Knowledge Graph Embedding (KGE),
wherein sets of relation-decoders are jointly learned, through triplet link prediction, in
order to obtain a semantic latent factor representation for entities [
        <xref ref-type="bibr" rid="ref10 ref11 ref3 ref4 ref5 ref6 ref7 ref8 ref9">3, 4, 5, 6, 7, 8, 9, 10, 11</xref>
        ]. In
principle any KGE link prediction model can be employed in this work, but we focus on those
that assume a Euclidean representation space and do not require any additional per-triplet
engineering. Although KGE methods typically do not use a shared auto-encoder as we do in
this paper, Schlichtkrull et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] did adopt an auto-encoding framework, where a graph neural
network is used as the encoder, however they did not work with visual data and the model was
not applied to transfer. Disentanglement, which also aims to learn semantic representations
for data is of relevance to this work [
        <xref ref-type="bibr" rid="ref13 ref2">13, 2</xref>
        ], wherein multiple methods have been proposed, for
example using Generative Adversarial Networks [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and VAEs [
        <xref ref-type="bibr" rid="ref1 ref15 ref16 ref17 ref18 ref19 ref20">15, 1, 16, 17, 18, 19, 20</xref>
        ]. Of
particular relevance to our work are investigations looking at the transferability of disentangled
representations [
        <xref ref-type="bibr" rid="ref21 ref22 ref23">21, 22, 23</xref>
        ], but these did not include relation learning. A bridge between
relation learning and disentanglement, wherein relation-decoders are employed as a
semisupervision to VAEs, can be found in [
        <xref ref-type="bibr" rid="ref24 ref25 ref26">24, 25, 26</xref>
        ]. Lastly, we note that our experimental setup is
most remnant of domain adaptation [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. To the best of our knowledge, no work has compared
relation-decoders in their ability to abstract concepts, as measured by their consistency and its
transfer across domains.
3. The Partial Relation Transfer Task and Model
      </p>
      <p>relation-decoder  
encodings   ∈  . The superscript 
an encoder</p>
      <p>/ ∶  → 
through reconstruction of the input image1.</p>
      <p>
        and decoder  
Partial Relation Transfer (PRT) is at its core a domain adaptation task [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], wherein we have
a source and target data domain, consisting of a set of images,   and   , respectively, and a
set of shared relation prediction tasks, ℛ = { 1, … ,   }. We approximate each relation using a
∶  ×  → [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] , where  denotes a latent space that contains all image
multiple variants. To obtain embeddings we use a domain-specific auto-encoder, consisting of
denotes a specific relation-decoder model, as we test
/ ∶  →  , which helps to minimise information loss
      </p>
      <p>The evaluation takes place as a two-step procedure. In the first, all relation decoders are
trained in the source domain, as a semi-supervision to the auto-encoder, using available labels,</p>
      <p>∈ ℝ|ℛ|×|  |×|  |, that specify whether a relation  ∈ ℛ holds between image   ,   ∈   . Here,
| ⋅ | denotes the cardinality of the operand set, but in practice we only use a small fraction
of the available labels. In the second evaluation step, we initialise a new auto-encoder to be
applied to the target dataset and use a subset of the pretrained relation-decoders, with labels
  ∈ ℝ|ℛ|×|  |×|  |, to act as fixed-parameter ‘guides’ for the encoder.</p>
      <p>
        To obtain informative data encodings, we use a Variational AutoEncoder (VAE), specifically
the  -VAE, given its simplicity and demonstrated ability to separate distinct factors in the latent
representation [
        <xref ref-type="bibr" rid="ref1 ref15 ref28">1, 15, 28</xref>
        ]. The  -VAE achieves this by optimising the ELBO objective, which for
the purposes of this paper we express as a loss over both encoder and decoder:
ℒ 
 -VAE = ℒ ( 
/
,  / ) + ℒ ( 
/
,  ( 0, )),
(1)
where an additional  scalar hyperparameter is used to influence disentanglement through
stronger distribution matching pressure to an isotropic zero-mean Gaussian prior,  ( 0, ) .
When  = 1
      </p>
      <p>
        we obtain the original VAE objective [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. We provide the full ELBO loss, with
a detailed explanation, in Appendix B. Each experiment involves taking embeddings from a
corresponding encoder and passing them through to sets of relation-decoders (either the full
set in the case in the source domain, or only a subset in the target domain). We can treat
each relation-decoder as producing a prediction  ̂ for whether relation  holds between data
1Further analysis on the performance of BlockStacks embeddings for domain-dependent task can be found in
and fixing parameters for each included   relation-decoder.
instances  and  [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Using the ground truth   , we can then compute the loss over all
relationdecoders, ℒ 
the final joint objective between VAE and relation-decoders:
      </p>
      <p>, as the binary cross-entropy of prevision versus ground truth. This gives us
ℒ 
= ℒ 
 -VAE −   ⏟⏟⏟,⏟ ⏟⏟⏟,⏟⏟,</p>
      <p>log( ̂ ) + (1 −  )log(1
⏟⏟[⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟−⏟⏟⏟⏟̂⏟⏟)],
(2)
ℒ 
where  is a scalar weighting parameter.</p>
      <sec id="sec-2-1">
        <title>3.1. Dynamic Comparator</title>
        <p>In our analysis, we include a proposed low-complexity, but nonetheless expressive, “Dynamic
Comparator” (DC) model, which is designed to model systems of relations, whilst encouraging
desirable properties for PRT. The overall DC model is composed of two modes, a distance-based
measure,  †, that can compute how close the vector diference between two inputs is to a
positive or negative valued reference vector, and a step-like function,  ‡, that determines the
sign of the diference between two points, optionally with an ofset. The overall DC model is
given by2:
 

(  ,   ) =  0 ⋅ ⏟⏟0⏟(⏟⏟⏟0⏟(⏟⏟‖⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟†⏟)⏟⏟‖⏟22⏟)) + 1 ⋅ ⏟⏟1⏟(⏟(⏟⏟⏟1⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟‡⏟⏟)⏟)) .</p>
        <p>
          ⊙ ( −  +  ⋅  ⊤( −  + 
(3)
 
†
 
‡
2In the main text we report results for this DC model, but we can use any function that has the required
characteristics for  † and  ‡. We include results for other versions in Appendix D.
where  = S o f t m a x ( ) ∈ ℝ2 is an attention weighting between the two modes, and ensures
that  
Softmax ( ) ∈ ℝ is an attention mask which is applied to  -dimensional latent embeddings;
is bound to [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ].  0,  1 are an exponential and sigmoid function, respectively;  =
 †,  ‡ ∈ ℝ
 are learnable bias terms that enables an ofset to each mode; and  0 ∈ ℝ
+ are
non-negative and  1 ∈ ℝ any-valued scalar terms, respectively. Lastly, ⊙ denotes the Hadamard
product (elementwise multiplication) and ‖ ⋅ ‖2 is the 2 -norm. Due to a convergence issue when
using a pretrained DC with fixed parameters, we needed to use a flexible fitting procedure in
which we enable the DC parameters to train in the target domain, but with the additional loss
term ‖ ∗ −  ‖, between pretrained  ∗ and untrained parameters  , respectively. In all cases we
evaluated the final parameter values in the target domain and found them to be approximately
equivalent to the  ∗. We did not apply this method to the other models as they were all able to
ift the isSuccessor relation in the target domain.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Measuring relation-decoder characteristics</title>
      <p>In this section we describe a series of measures that we use to understand more about the
intrinsic characteristics of each relation-decoder, which together help identify the behaviour of
each relation-decoder model and provide insight regarding their respective PRT performance.</p>
      <p>For any system of relations, we can write down a truth-table that defines the valid
truthstates that they may collectively take, which we expect our relation-decoders to model. For
example, we know that any time isGreater is true, isLess must not be. By assuming that each
relation-decoder output is pairwise conditionally independent given   ,   , for instance,
( isGreater, isLess|  ,   ) = ( isGreater|  ,   )( isLess|  ,   ),
we can produce a probability statement for whether the relations are consistent with valid
entries to the truth-table. Taking  1 = isGreater and  2 = isLess as our entire system of relations,
we can produce the following truth-table conversion, where invalid entries are omitted:
 1(  ,   )  2(  ,   ) ℱ ( 1,  2)









⟹
ℱ ( 1,  2) = ∀  ,   (( 1(  ,   ) ∧  2(  ,   ))
∨(¬ 1(  ,   ) ∧  2(  ,   ))
∨(¬ 1(  ,   ) ∧ ¬ 2(  ,   )))
which, using our relation-decoders for each relation and with  , =  
( , ) and ¬  (  ,   ) =
1 −   (  ,   ), we express the probability of ℱ being true as:
(ℱ |   ,   ) = ((  1
(  ,   ) ⋅   2</p>
      <p>(  ,   ))
+ ((1 −   1
(  ,   )) ⋅   2</p>
      <p>(  ,   ))
+ ((1 −   1
(  ,   )) ⋅ (1 −   2
(  ,   ))).</p>
      <p>Finally, since ℱ should hold for all input combinations, we heavily penalise violations by using
a binary cross-entropy loss between ℱ and the expected outcome:
  
((ℱ )) = −</p>
      <p>1 ⋅ log (ℱ |   ,   ),
1</p>
      <p>∑
  ,  ∈

(4)
(5)
(6)
where  is the latent space, as we can compute this score for any samples from this space3 and 
is a normalising constant, equal to the number of   and   sample pairs used in the calculation.
We refer to this measure as Con-A referring to the fact that we use it to measure consistency
across multiple relations.</p>
      <p>To provide a deeper understanding about how relation-decoders collectively interact with
their inputs, we use a gradient evaluation to see whether models respond similarly to changes in
their input. For a set of relations, we define the gradient-conformity (GC) of relation   against
all others by the following cosine-similarity:
 =
|



 
||  ||2 ||  ||2
  
  
|
  = 

  
  
|
  = 

,
|
where   =
and   =
∀ ≠ 
(7)
where | ⋅ | denotes the absolute of the operand and   is the concatenation of each
relationaligned and zero if orthogonal4.
decoder’s inputs, with gradients evaluated at reference inputs  
 . GC will be 1 if gradients are</p>
    </sec>
    <sec id="sec-4">
      <title>5. Results</title>
      <p>
        This section presents results for the PRT task on a range of relation-decoder models. In the
source domain, we learn a system of binary relations: ℛ ={isSuccessor (S), isPredecessor (P),
isGreater (G), isEqual (E), isLess (L)}, on digits represented in MNIST images, alongside a  -VAE.
In the target domain, we take the pretrained S relation as a fixed-parameter guide for a new
 -VAE applied to BlockStacks images (see Appendix A for BlockStacks image examples), and
then evaluate PRT accuracy on the held-out G, E, L and P relations. Relation-decoder models
compared here are: TransR [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], HolE [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ], NTN [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], our proposed DC and a basic
neuralnetwork baseline, NN. NN is a simple four-layer ( in,  1,  2,  out) neural-network with layer sizes
 in = 2  ,  1 = 2  and  2 =   , with ReLU activations. The final output layer  out is a single value
passed through a sigmoid function, to bound the output to [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ]. Further model details are
provided in Appendix C.
      </p>
      <p>We vary  only in the source domain, ranging across values {1, 4, 8, 12}, but fix it in the target
domain.  is fixed in both domains (see Appendix</p>
      <sec id="sec-4-1">
        <title>C.3 for further details on hyperparameter</title>
        <p>settings). For Con-A and GC measures, we produce encodings for three data splits:
dataembeddings, where all inputs are encodings of a domain’s test data; interpolation, where we
obtain an empirical mean and variance for the domain’s data-embeddings and sample from a
corresponding Gaussian distribution; and extrapolation, where we sample from regions strictly
outside the data-embeddings region.
and target BlockStacks (right), domains. Key observations are that DC produces excellent
PRT performance, whilst NN, NTN and HolE all see some degradation from their source
accuracies. TransR seems to maintain a similar accuracy profile. We include  ’s impact on
these performances in Figure 2-bottom. Barring DC which has little discernible change in either
3in practice as we cannot include every encoding combination, we provide an estimate.</p>
        <p>4We can evaluate against this measure for arbitrary samples from  .
domain, PRT performance is significantly impacted by  in all models, but has little efect in
the source domain. Additionally, TransR has a strong positive correlation with  , whereas
NN, NTN and HolE produce the best PRT performance with intermediate disentanglement
pressure. To interrogate further how  afects each model, we provide: (Figure 3-top) mean
relation Con-A referenced to both source (left) and target (right) domain embeddings; and
(Figure 3-bottom) source domain referenced GC measures for each model. In the left and
bottom plots, blue (left group), green (middle group) and red (right group) show results for
the data-embeddings, interpolation and extrapolation regions of latent space, in respective
order. From the source domain Con-A results, we note that DC shows excellent consistency
across relations in all regions. Most other models have worse interpolation and extrapolation
consistency. Increasing  appears to give some improvement for all but HolE, but there are
indications that this trend does not persist into the largest  = 12 value. Interestingly, Con-A
values for target data-embeddings (right) are notably worse than for source data-embeddings,
with values closer to those for interpolation or extrapolation in the source domain. For GC, DC
performance is close to 1 for all  with no discernible change. All other models show a weaker
GC with positive correlation between GC and  . TransR and NN achieve significantly higher</p>
      </sec>
      <sec id="sec-4-2">
        <title>GC than NTN and HolE.</title>
        <sec id="sec-4-2-1">
          <title>5.1. Key Experimental Results</title>
          <p>5.1.1. Does good source task accuracy lead to successful PRT?
Since we transfer pretrained models from source to target domain and ensure that the target
encoder,  
 , fits its encodings to</p>
          <p>S, we might expect that relation-decoding performance will
be the same in both domains. However, despite DC, NN, NTN and HolE all performing close to
100% accuracy, and TransR achieving above 80%, across all relations, and with all relations able
to achieve similar prediction accuracy (or better in the case of DC) on the guide relation S, PRT
performance varies significantly across models. It is firstly evident that
DC is successful at PRT,
sustaining approximately 100% accuracy across all held-out relations. NN achieves mostly good
performance, with greater degradation across P and E relations. Although HolE and NTN both
achieve good PRT for P, there is increasing degradation across E and G, L relations. TransR is
able to achieve strong relative performance where PRT accuracy per relation is comparable to
what was possible in the source domain. These results indicate that source accuracy alone is
not enough to determine whether models will be successful at PRT.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>5.1.2. How does  afect Con-A and GC and how does this impact model coherence?</title>
          <p>To provide an overview of how increased disentanglement pressure afects each model we can
ifrstly compare how  afects model performance in both source and target domain. Figure
2-bottom demonstrates that, although relation prediction accuracies for most models either do
not respond, or respond negatively, to increases in  in the source domain, their PRT behaviour
difers significantly across models:</p>
          <p>DC shows no discernible change, whilst NN, NTN and HolE
all show a parabolic response with a maximum PRT around  = 8 ; TransR shows a general
positive correlation but with diminishing returns above  = 8 . To gain further insight into the
role of disentanglement pressure, it is necessary to look at how each model’s intrinsic behaviour
responds to  changes.</p>
          <p>First, we attempt to expose the relationship between  and consistency and whether this has
any efect on PRT performance. By Figure 3-top, DC clearly outperforms all other models on
Con-A and this coincides with better PRT performance. The next best performing model on
Con-A in the source domain is also the next best on PRT performance. In most cases Con-A
degrades for all models when moving from data-embeddings to interpolation and extrapolation,
but the degree of degradation changes depending on the model. Interestingly, across all models,
their target Con-A is notably close to that of interpolation or extrapolation in the source

domain analysis. This suggests that guiding  
to fit relation</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>S produces data embeddings</title>
        <p>that lie in the interpolation or extrapolation regions with respect to MNIST embeddings. This
suggests that a relation-decoder model’s ability to retain consistency over regions of latent space
beyond where MNIST embeddings are found leads to improved PRT. These findings provide
compelling evidence in support of our claim that consistency across relations is important for
PRT performance.</p>
        <p>Secondly, we examine how gradient-conformity afects PRT performance. To achieve
successful PRT, fitting the target encoder to a single pretrained relation should lead to embeddings
that are structured correctly with respect to the other pretrained relations. For this to be
possible there must be a degree of conformity between how each model computes its system of
relations. As an extreme case, suppose we have a two-dimensional latent representation, with
two relations that are each calculated using entirely diferent dimensions of latent space. By
iftting an encoder to one of these relations, there is no guarantee that the latent dimension, that
the other relation requires, receives the necessary guidance. DC shows excellent and stable
GC values (near 1) across all conditions. This is by design as the use of masks per relation
ensures that if masks match for any two relations, then their gradients will be either parallel
or anti-parallel. Excluding HolE, all remaining models show a positive correlation between
GC and  , and it appears that models with either higher GC values, or  response, typically
perform better at PRT. Together this provides tentative evidence to suggest that GC is important
to model coherence, as measured by their PRT performance. It is possible that we do not see a
monotonic benefit of GC against PRT, due to no further extrapolation or interpolation Con-A
gains with  &gt; 8 .</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusion</title>
      <p>We provide a comprehensive analysis of relation-decoder characteristics when learning the
system of relations that together define the semantics of a concept. We then compare these
characteristics with a Partial Relation Transfer task setting, which determines whether, given
logical constraints between relations, fitting embeddings to one relation-decoder leads to
embeddings that satisfy all other relations in terms of their logical consistency and accuracy.
Our results demonstrate that model consistency, and possibly gradient-conformity, across
diferent regions of input space together determine whether a set of relation-decoders have
learned a consistent and coherent notion of a given concept, in this case ordinality. These
measures make it possible to check whether a set of relation-decoders have indeed learned a
transferable concept, or if they are limited to a single data domain and property.</p>
    </sec>
    <sec id="sec-6">
      <title>A. BlockStacks dataset description</title>
      <p>
        The BlockStacks dataset consists of 12,000 images (200×200 pixels but resized in code to 128 × 128)
of individual block stacks, of varying height (between 1-10 blocks), block colors (uniformly
sampled from options: { gray, blue, green, brown, purple, cyan, yellow} ) and position (uniformly
sampled from ,  range (-3,-3) to (3,3)), but with the requirement that each instance consists of
a single red block at a random height (see Figure 4 for example images). These were rendered
using the CLEVR rendering agent with the help of code from [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. The dataset is divided into
9000:1500:1500 train, validation and test splits.
      </p>
      <p>
        B. Explanation of the  -VAE
The VAE is derived by introducing an approximate posterior   ( | ), from which a lower bound
(commonly referred to as the Evidence LOwer Bound (ELBO)) on the true marginal log   ( )
can be obtained by using Jensen’s inequality [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. The VAE maximises the log-probability by
maximising this lower bound, given by:
ℒ 
 -VAE =    ( | )[log   ( | )] −   (  ( | )‖  ( )),
(8)
where   ( | ) is the approximate posterior, typically modelled as a neural network encoder with
parameters  . Similarly   ( | ) is modelled as a decoder with parameters  and is calculated as a
Monte Carlo estimation. A reparameterization trick is used to enable diferentiation through an
otherwise undiferentiable sampling from   ( | ) (see [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]). In the  -VAE [
        <xref ref-type="bibr" rid="ref1 ref15">1, 15</xref>
        ], an additional 
scalar hyperparameter was added as it was found to influence disentanglement through stronger
distribution matching pressure with respect to the prior   ( ), where this prior is typically set
to an isotropic zero-mean Gaussian  ( 0, )) . When  = 1 we obtain the standard VAE objective
[
        <xref ref-type="bibr" rid="ref28">28</xref>
        ].
      </p>
    </sec>
    <sec id="sec-7">
      <title>C. Model Descriptions</title>
      <p>In this section we provide model details for each relation-decoder that we use and the VAE
architecture that we employ for each data set.</p>
      <sec id="sec-7-1">
        <title>C.1. Relation Decoder implementations</title>
        <p>TransR:
with,
 TransR(  ,   ) = ‖ℎ +  −   ‖2
ℎ =    
and   =     .
 TransR+(  ,   ) ≈ 0. In all experiments we set  = 10 .</p>
        <sec id="sec-7-1-1">
          <title>NTN (modified version from [ 32, 33]):</title>
          <p>
            As we want to obtain a [
            <xref ref-type="bibr" rid="ref1">0,1</xref>
            ] output, we modify TransR through  TransR+
where  is a sigmoid function and c is a scalar that ensures that at  TransR+(  ,   ) = 0, then
=  ( −  
(9)
where   ∈ ℝ ,   ∈ ℝ(−1)⋅  ×(−1)⋅  × ,   ∈ ℝ×(−1)⋅  ) and   ∈ ℝ . The only hyperparameter to
consider is  , which controls the NTN’s capacity - in all experiments, we set this to 1. Here   is
a concatenation of the inputs  0, … ,   , which was introduced in [
            <xref ref-type="bibr" rid="ref32 ref33">32, 33</xref>
            ]. In contrast the original
NTN (see [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]) is only applicable to binary relations and does not include the outer sigmoid.
HolE:
          </p>
          <p>
            HolE(  ,   ) =  (  ⊤(  ⋆   ))
where ⋆ ∶ ℝ × ℝ → ℝ denotes the circular correlation operator and is given by,
−1
=0
[  ⋆   ] =
∑  ,
 ,+
mod 
NN: a simple four-layer neural-network with hidden layer sizes  in = 2  ,  1 = 2  and  2 =   ,
with ReLu activations, for latent representations with size   . The final output layer,  out, is a
single value passed through a sigmoid function, to cap the output within [
            <xref ref-type="bibr" rid="ref1">0,1</xref>
            ].
          </p>
        </sec>
      </sec>
      <sec id="sec-7-2">
        <title>C.2. VAE configuration</title>
        <p>
          In all representation learning experiments, we use a  -VAE trained for 300,000 steps, following
accepted practice from [
          <xref ref-type="bibr" rid="ref20 ref22">20, 22</xref>
          ].
        </p>
        <p>The encoder-decoder model parameters are given in Table 1 - we include the model
configurations used for both MNIST and BlockStacks datasets.</p>
        <p>C.3. ℒ</p>
        <p>configuration
we fix  to 10−4 and  = 10 −2 and normalise the ℒ 
In the source domain, we vary  values between {1, 4, 8, 12} and fix  = 10 3. In the target domain,
 -VAE reconstruction term by dividing by a
factor
, for height  , width  and color channels  , and normalize ℒ ( 
 ,  ( 0, )) by a
√ ⋅⋅</p>
        <p>1
factor 1 , for latent representation size   .</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>D. Supplementary Results</title>
      <p>
        individual relation properties covering transitivity, asymmetry and reflexivity) and Con-A,
configured on the same data splits as described in the main text. These results cover variants of
modified Cauchy distribution for  †, via
the DC and NN models. DC variants include: DC-Basic, uses the same  ‡ as DC, but uses a
similar  † to that of [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] but includes the dynamic  mask and  † ofset; DC-Gaus, again same
 ‡ but uses a Gaussian function for  †; DC-Cauchy, uses a Cauchy distribution form for  †
and a Cauchy cumulative distribution function for  ‡; and finally DC-CCS which employs a
 ((2 ⋅  DC-Cauchy,† − 1))
where  is the sigmoid function and  is a scalar value. This modification enables a clif-like
shape for  †, such that it can output close to 1 for a wider vector diference range. Note all
distribution forms are unnormalized so that they cover the interval [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ].
      </p>
      <p>The NN variants vary layer depth and size, but all use a common input layer of size  in = 2 ∗   .
NN2 is a three-layer neural network with hidden layer size   and NN3 is a four-layer neural
network which is the same as NN, but in contrast has a   pre-final layer size, thereby omitting
the bottleneck dimension reduction of NN. NN1-shallow includes only one hidden layer, like
NN2, of size   (  −1) which enables a pairwise comparison between each input dimension.</p>
      <p>2
NN1-sig is the same as NN but employs sigmoid activations, instead of ReLUs. NN-DC is the
again same as the NN from the main text, but includes an additional  †-type node that can
compute relative diferences between inputs in the same way as DC.
decoders for ordinality relation decoding. We attempt to predict the overall BlockStacks stack height on
the final fixed embeddings obtained after isSuccessor relation-decoder alignment.</p>
    </sec>
    <sec id="sec-9">
      <title>E. How does each model impact the retention of domain-dependent information</title>
      <p>on fixed encodings of each block stack, after</p>
      <p>isSuccessor relation-decoder alignment as been
applied. Note  is fixed in the target domain, so the only moving part are the pretrained models
which are trained with varied source  values. Note also that dc has an unfair advantage here,
as the steered fitting approach allows more flexibility to the VAE learning phase - for this
reason the result is only included in the appendix. Since we are interested in capturing general
representations that encode both domain-dependent and -independent information, we use

each target encoder</p>
      <p>
        obtained from each PRT experiment and produce encodings for the full
BlockStacks test set. The resulting encodings are then divided into a new train and test subset,
used to train both a Sci-Kit Learn Linear regressor and Support Vector Machine regressor with
a RBF ∘ kernel [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ]. We present the resulting Mean Squared Errors (MSE) in Figure 7, with
Ordinary Least Squares (OLS) (a) and Support Vector Regression (SVR) (b).
      </p>
      <p>There are a number of noteworthy details: firstly, DC shows no dependence on  and leads
to a lower MSE across all settings; second, excluding DC, for all models we observe an optimum
MSE at  = 8 , with TransR reaching DC MSE performance for OLS and NN doing the same for
SVR. These results indicate that lower MSE can be obtained by using non-linear regression,
which indicates that to some degree, the block stack height factor is not encoded linearly,
regardless of selected model. Next, by contrasting with Figure 3-bottom, these results suggest
that models with higher GC lead to embeddings that are more amenable to domain-specific
factor prediction. However, the parabolic trend, where increasing  to 12 leads to an increase
in error, is in agreement with Figure 2-bottom-right, which showed that most models do not
improve at PRT for the largest  .</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I.</given-names>
            <surname>Higgins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Matthey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Burgess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Glorot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Botvinick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Lerchner, beta-VAE:
          <article-title>Learning Basic Visual Concepts with a Constrained Variational Framework</article-title>
          , in: 5th International Conference on Learning Representations, {ICLR}, Toulon, France,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I.</given-names>
            <surname>Higgins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pfau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Racaniere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Matthey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rezende</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerchner</surname>
          </string-name>
          ,
          <article-title>Towards a Definition of Disentangled Representations</article-title>
          , arXiv preprint arXiv:
          <year>1812</year>
          .
          <volume>02230</volume>
          (
          <year>2018</year>
          ). URL: http://arxiv.org/abs/
          <year>1812</year>
          .02230. doi:
          <article-title>a r X i v : 1 8 1 2 . 0 2 2 3 0 v 1 . a r X i v : 1 8 1 2 . 0 2 2 3 0</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <article-title>Reasoning With Neural Tensor Networks for Knowledge Base Completion</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>926</fpage>
          -
          <lpage>934</lpage>
          .
          <article-title>a r X i v : a r X i v : 1 3 0 1 . 3 6 1 8 v 2</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Trouillon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Welbl</surname>
          </string-name>
          , S. Riedel, É. Gaussier, G. Bouchard,
          <article-title>Complex Embeddings for Simple Link Prediction</article-title>
          ,
          <source>in: Proceedings of the 33nd International Conference on Machine Learning, {ICML}</source>
          , New York, NY, USA,
          <year>2016</year>
          , pp.
          <fpage>2071</fpage>
          -
          <lpage>2080</lpage>
          .
          <article-title>a r X i v : 1 6 0 6 . 0 6 3 5 7</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Trouillon</surname>
          </string-name>
          , É. Gaussier,
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Dance</surname>
          </string-name>
          , G. Bouchard,
          <article-title>On inductive abilities of latent factor models for relational learning</article-title>
          ,
          <source>Journal of Artificial Intelligence Research</source>
          <volume>64</volume>
          (
          <year>2019</year>
          )
          <fpage>21</fpage>
          -
          <lpage>53</lpage>
          .
          <source>doi:1 0 . 1 6 1 3 / j a i r . 1 . 1 1 3 0 5 . a r X i v : 1</source>
          <volume>7 0 9 . 0 5 6 6 6 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bordes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usunier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garcia-Duran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Yakhnenko</surname>
          </string-name>
          ,
          <article-title>Translating Embeddings for Modeling Multi-relational Data</article-title>
          , in: C.
          <string-name>
            <surname>J. C. Burges</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Welling</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Ghahramani</surname>
            ,
            <given-names>K. Q.</given-names>
          </string-name>
          <string-name>
            <surname>Weinberger</surname>
          </string-name>
          (Eds.),
          <source>Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems</source>
          , Curran Associates, Inc.,
          <string-name>
            <surname>Lake</surname>
            <given-names>Tahoe</given-names>
          </string-name>
          , USA,
          <year>2013</year>
          , pp.
          <fpage>2787</fpage>
          -
          <lpage>2795</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nickel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Murphy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tresp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gabrilovich</surname>
          </string-name>
          ,
          <article-title>A review of relational machine learning for knowledge graphs</article-title>
          ,
          <source>Proceedings of the IEEE</source>
          <volume>104</volume>
          (
          <year>2016</year>
          )
          <fpage>11</fpage>
          -
          <lpage>33</lpage>
          .
          <source>doi:1 0 . 1 1</source>
          <volume>0</volume>
          <fpage>9</fpage>
          <string-name>
            <surname>/ J P R O C .</surname>
          </string-name>
          <article-title>2 0 1 5 . 2 4 8 3 5 9 2 . a r X i v : 1 5 0 3 . 0 0 7 5 9</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Knowledge graph embedding: A survey of approaches and applications</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>29</volume>
          (
          <year>2017</year>
          )
          <fpage>2724</fpage>
          --
          <lpage>2743</lpage>
          .
          <source>doi:1 0 . 1 1</source>
          0 9 / T K D E .
          <volume>2 0 1 7 . 2 7 5 4 4 9 9 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. N.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          <article-title>Survey on Knowledge Graph Embedding: Approaches, Applications</article-title>
          and Benchmarks,
          <source>Electronics</source>
          <volume>9</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>29</lpage>
          .
          <source>doi:1 0 . 3 3</source>
          <volume>9 0</volume>
          / e l e c
          <source>t r o n i c s 9 0</source>
          <volume>5 0 7 5 0 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Kazemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Poole</surname>
          </string-name>
          ,
          <article-title>Simple embedding for link prediction in knowledge graphs</article-title>
          ,
          <source>Advances in Neural Information Processing Systems 2018-December</source>
          (
          <year>2018</year>
          )
          <fpage>4284</fpage>
          -
          <lpage>4295</lpage>
          .
          <article-title>a r X i v : 1 8 0 2 . 0 4 8 6 8</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Abboud</surname>
          </string-name>
          , İ. İ. Ceylan,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lukasiewicz</surname>
          </string-name>
          , T. Salvatori,
          <article-title>Boxe: A box embedding model for knowledge base completion</article-title>
          , in: H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Balcan</surname>
          </string-name>
          , H. Lin (Eds.),
          <source>Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems</source>
          <year>2020</year>
          ,
          <article-title>NeurIPS 2020</article-title>
          , December 6-
          <issue>12</issue>
          ,
          <year>2020</year>
          , virtual,
          <year>2020</year>
          . URL: https://proceedings.neurips.cc/paper/2020/hash/ 6dbbe6abe5f14af882ff977fc3f35501-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schlichtkrull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. N.</given-names>
            <surname>Kipf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bloem</surname>
          </string-name>
          , R. van den Berg, I. Titov,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          ,
          <source>Modeling Relational Data with Graph Convolutional Networks, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 10843 LNCS</source>
          (
          <year>2018</year>
          )
          <fpage>593</fpage>
          -
          <lpage>607</lpage>
          .
          <source>doi:1 0 . 1 0</source>
          <volume>0 7 / 9 7 8 - 3 - 3 1 9 - 9 3 4 1 7 - 4</volume>
          _
          <fpage>3</fpage>
          <lpage>8</lpage>
          .
          <article-title>a r X i v : 1 7 0 3 . 0 6 1 0 3</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vincent</surname>
          </string-name>
          ,
          <article-title>Representation learning: A review and new perspectives</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>35</volume>
          (
          <year>2013</year>
          )
          <fpage>1798</fpage>
          -
          <lpage>1828</lpage>
          .
          <source>doi:1 0 . 1 1</source>
          <volume>0</volume>
          <fpage>9</fpage>
          <string-name>
            <surname>/ T P A M I .</surname>
          </string-name>
          <article-title>2 0 1 3 . 5 0 . a r X i v : 1 2 0 6 . 5 5 3 8</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Houthooft</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          , Infogan:
          <article-title>Interpretable representation learning by information maximizing generative adversarial nets</article-title>
          , in: D. D.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sugiyama</surname>
            , U. von Luxburg,
            <given-names>I. Guyon</given-names>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems</source>
          <year>2016</year>
          , December 5-
          <issue>10</issue>
          ,
          <year>2016</year>
          , Barcelona, Spain,
          <year>2016</year>
          , pp.
          <fpage>2172</fpage>
          -
          <lpage>2180</lpage>
          . URL: https://proceedings. neurips.cc/paper/2016/hash/7c9d0b1f96aebd7b5eca8c3edaa19ebb-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C. P.</given-names>
            <surname>Burgess</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Higgins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Matthey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Watters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Desjardins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerchner</surname>
          </string-name>
          , Understanding disentangling in  -VAE, in
          <source>: Advances in Neural Information Processing Systems</source>
          <volume>30</volume>
          ,
          <string-name>
            <surname>Nips</surname>
          </string-name>
          , Long Beach, CA, USA,
          <year>2017</year>
          . URL: http://arxiv.org/abs/
          <year>1804</year>
          .03599.
          <article-title>a r X i v : 1 8 0 4 . 0 3 5 9 9</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R. T. Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Grosse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Duvenaud</surname>
          </string-name>
          , Isolating Sources of Disentanglement in Variational Autoencoders,
          <source>in: Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems</source>
          , Montreal, Quebec, Canada,
          <year>2018</year>
          , pp.
          <fpage>2615</fpage>
          --
          <lpage>2625</lpage>
          .
          <article-title>a r X i v : 1 8 0 2 . 0 4 9 4 2</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ridgeway</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Mozer</surname>
          </string-name>
          ,
          <string-name>
            <surname>Learning Deep Disentangled Embeddings With the F-Statistic</surname>
            <given-names>Loss</given-names>
          </string-name>
          ,
          <source>in: Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems</source>
          , Montreal, Quebec, Canada,
          <year>2018</year>
          , pp.
          <fpage>185</fpage>
          --
          <lpage>194</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>C.</given-names>
            <surname>Eastwood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. K. I.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <article-title>A framework for the quantitative evaluation of disentangled representations</article-title>
          ,
          <source>in: 6th International Conference on Learning Representations, {ICLR}</source>
          , Vancouver, BC, Canada,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sattigeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Balakrishnan</surname>
          </string-name>
          ,
          <article-title>Variational inference of disentangled latent concepts from unlabeled observations</article-title>
          ,
          <source>in: 6th International Conference on Learning Representations, {ICLR}</source>
          , Vancouver, BC, Canada,
          <year>2018</year>
          .
          <article-title>a r X i v : 1 7 1 1 . 0 0 8 4 8</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>F.</given-names>
            <surname>Locatello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lucic</surname>
          </string-name>
          , G. R{\”{a}}tsch, S. Gelly, B. Sch{\”{o}}lkopf,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bachem</surname>
          </string-name>
          ,
          <article-title>Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations</article-title>
          ,
          <source>in: Proceedings of the 36th International Conference on Machine Learning,{ICML}</source>
          , Long Beach, California, USA,
          <year>2019</year>
          , pp.
          <fpage>4114</fpage>
          --
          <lpage>4124</lpage>
          .
          <article-title>a r X i v : a r X i v : 1 8 1 1 . 1 2 3 5 9 v 4</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>F.</given-names>
            <surname>Locatello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Poole</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Rätsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schölkopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bachem</surname>
          </string-name>
          , M. Tschannen, WeaklySupervised Disentanglement Without Compromises, CoRR abs/
          <year>2002</year>
          .0 (
          <year>2020</year>
          ).
          <article-title>a r X i v : 2 0 0 2 . 0 2 8 8 6</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>X.</given-names>
            <surname>Steenbrugge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Leroux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Verbelen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dhoedt</surname>
          </string-name>
          ,
          <article-title>Improving Generalization for Abstract Reasoning Tasks Using Disentangled Feature Representations</article-title>
          ,
          <source>in: Neural Information Processing Systems (NeurIPS) Workshop on Relational Representation Learning</source>
          , Montreal, Canada,
          <year>2018</year>
          . doi:h t t p :
          <article-title>/ / a r x i v . o r g / a b s / 1 8 1 1 . 0 4 7 8 4 . a r X i v : a r X i v : 1 8 1 1 . 0 4 7 8 4 v 1</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>S. van Steenkiste</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Locatello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bachem</surname>
          </string-name>
          ,
          <article-title>Are Disentangled Representations Helpful for Abstract Visual Reasoning?</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems</source>
          , Vancouver, BC, Canada,
          <year>2019</year>
          , pp.
          <fpage>14222</fpage>
          --
          <lpage>14235</lpage>
          .
          <article-title>a r X i v : 1 9 0 5 . 1 2 5 0 6</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>T.</given-names>
            <surname>Karaletsos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Belongie</surname>
          </string-name>
          , G. Rätsch,
          <article-title>When crowds hold privileges: Bayesian unsupervised representation learning with oracle constraints</article-title>
          ,
          <source>in: 4th International Conference on Learning Representations, {ICLR}</source>
          , San Juan, Puerto Rico,
          <year>2016</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
          <article-title>a r X i v : 1 5 0 6 . 0 5 0 1 1</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Batmanghelich</surname>
          </string-name>
          ,
          <article-title>Weakly Supervised Disentanglement by Pairwise Similarities</article-title>
          ,
          <source>in: Proceedings of the 32nd AAAI Conference on Artificial Intelligence</source>
          , AAAI, New York, NY, USA,
          <year>2020</year>
          .
          <article-title>a r X i v : 1 9 0 6 . 0 1 0 4 4</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Batmanghelich</surname>
          </string-name>
          ,
          <article-title>Robust ordinal VAE: employing noisy pairwise comparisons for disentanglement</article-title>
          , CoRR abs/
          <year>1910</year>
          .05898 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1910</year>
          .05898.
          <article-title>a r X i v : 1 9 1 0 . 0 5 8 9 8</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>I.</given-names>
            <surname>Redko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Habrard</surname>
          </string-name>
          , E. Morvant,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sebban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bennani</surname>
          </string-name>
          ,
          <source>Advances in Domain Adaptation Theory, Elsevier</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          ,
          <string-name>
            <surname>Auto-Encoding Variational</surname>
          </string-name>
          Bayes,
          <source>in: Proceedings of the 2nd International Conference on Learning Representations, Banf</source>
          , Alberta, Canada,
          <year>2014</year>
          .
          <source>doi:1 0 . 1 0</source>
          <volume>5 1 / 0 0 0 4 - 6 3 6 1 / 2 0 1 5 2 7 3 2 9</volume>
          .
          <article-title>a r X i v : 1 3 1 2 . 6 1 1 4</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Learning entity and relation embeddings for knowledge graph completion</article-title>
          , in: B.
          <string-name>
            <surname>Bonet</surname>
          </string-name>
          , S. Koenig (Eds.),
          <source>Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30</source>
          ,
          <year>2015</year>
          , Austin, Texas, USA, AAAI Press,
          <year>2015</year>
          , pp.
          <fpage>2181</fpage>
          -
          <lpage>2187</lpage>
          . URL: http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/ 9571.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nickel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rosasco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Poggio</surname>
          </string-name>
          ,
          <article-title>Holographic embeddings of knowledge graphs</article-title>
          , in: D.
          <string-name>
            <surname>Schuurmans</surname>
            ,
            <given-names>M. P.</given-names>
          </string-name>
          Wellman (Eds.),
          <source>Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17</source>
          ,
          <year>2016</year>
          , Phoenix, Arizona, USA, AAAI Press,
          <year>2016</year>
          , pp.
          <fpage>1955</fpage>
          -
          <lpage>1961</lpage>
          . URL: http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12484.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>M.</given-names>
            <surname>Asai</surname>
          </string-name>
          ,
          <string-name>
            <surname>Photo-Realistic Blocksworld</surname>
            <given-names>Dataset</given-names>
          </string-name>
          , arXiv preprint arXiv:
          <year>1812</year>
          .
          <year>01818</year>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>I.</given-names>
            <surname>Donadello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Serafini</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>d'Avila Garcez, Logic Tensor Networks for Semantic Image Interpretation</article-title>
          ,
          <source>in: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1596</fpage>
          --
          <lpage>1602</lpage>
          .
          <article-title>a r X i v : 1 7 0 5 . 0 8 9 6 8</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>L.</given-names>
            <surname>Serafini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Garcez</surname>
          </string-name>
          ,
          <article-title>Logic tensor networks: Deep learning and logical reasoning from data and knowledge</article-title>
          ,
          <source>in: Proceedings of the 11th International Workshop on Neural-Symbolic Learning and Reasoning (NeSy'16)</source>
          <article-title>co-located with the Joint MultiConference on Human-Level Artificial Intelligence {(HLAI}</article-title>
          <year>2016</year>
          ), New York, NY, USA,
          <year>2016</year>
          .
          <article-title>a r X i v : 1 6 0 6 . 0 4 4 2 2</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , E. Duchesnay,
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          )
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>