<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Coherent and Consistent Relational Transfer Learning with Auto-encoders</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Harald</forename><surname>Strömfelt</surname></persName>
							<email>h.stromfelt17@imperial.ac.uk</email>
							<affiliation key="aff0">
								<orgName type="institution">Imperial College London</orgName>
								<address>
									<addrLine>Exhibition Rd, South Kensington</addrLine>
									<postCode>SW7 2BX</postCode>
									<settlement>London</settlement>
									<country key="GB">UK</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="department">City</orgName>
								<orgName type="institution">University of London</orgName>
								<address>
									<addrLine>Northampton Square</addrLine>
									<postCode>EC1V 0HB</postCode>
									<settlement>London</settlement>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Luke</forename><surname>Dickens</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">University College London</orgName>
								<address>
									<addrLine>Gower St</addrLine>
									<postCode>WC1E 6BT</postCode>
									<settlement>London</settlement>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Artur</forename><surname>D'avila Garcez</surname></persName>
							<affiliation key="aff2">
								<orgName type="department">City</orgName>
								<orgName type="institution">University of London</orgName>
								<address>
									<addrLine>Northampton Square</addrLine>
									<postCode>EC1V 0HB</postCode>
									<settlement>London</settlement>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Alessandra</forename><surname>Russo</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Imperial College London</orgName>
								<address>
									<addrLine>Exhibition Rd, South Kensington</addrLine>
									<postCode>SW7 2BX</postCode>
									<settlement>London</settlement>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Coherent and Consistent Relational Transfer Learning with Auto-encoders</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">701C6A5157CBC6F1F8469CA39BF9EE7B</idno>
					<note type="submission">Submitted to the 15th International Workshop On Neural-Symbolic Learning and Reasoning (NeSy &apos;21</note>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T21:31+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Representation Learning</term>
					<term>Relation Learning</term>
					<term>Variational AutoEncoders</term>
					<term>Concept Learning</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Human defined concepts are inherently transferable, but it is not clear under what conditions they can be modelled effectively by non-symbolic artificial learners. This paper argues that for a transferable concept to be learned, the system of relations that define it must be coherent across domains and properties. That is, they should be consistent with respect to relational constraints, and this consistency must extend beyond the representations encountered in the source domain. Further, where relations are modelled by differentiable functions, their gradients must conform -the functions must at times move together to preserve consistency. We propose a Partial Relation Transfer (PRT) task which exposes how well relation-decoders model these properties, and exemplify this with ordinality prediction transfer task, including a new data set for the transfer domain. We evaluate this on existing relation-decoder models, as well as a novel model designed around the principles of consistency and gradient conformity. Results show that consistency across broad regions of input space indicates good transfer performance, and that good gradient conformity facilitates consistency.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>In many situations, concepts that pertain to one set of data can also be relevant to another <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref>. Take, for instance, the general concept of ordinality, whose semantics are defined by relations: isSuccessor, isPredecessor, isGreater, isLess and isEqual; together with their constraints. Successfully capturing this concept involves learning the corresponding relations such that they maintain data set and property independence, with no retraining. This is to say that they have been abstracted from the specific property and act instead as a generic set of characterizing relations for the semantics of ordinality. For this, we argue that the relations must be consistent with their expected constraints and coherent across ordinal properties spanning different data sets, which means their consistency is maintained regardless of data set or particular ordinal property.</p><p>As a concrete example, suppose that we have successfully learned to order images of numbers by their abstract digit identity, and are presented with a new data set containing images of individual stacks of blocks. Suppose then that we wish to obtain an ordering over them, such that we can compare arbitrary data instances using the above relations. Provided that the learned relations are consistent with their expected constraints, it should be possible to obtain an encoding that establishes each successor, via our isSuccessor relation, and immediately be able to compare data instance over the remaining relations. Following this logic, the primary purpose of this paper is to evaluate under which conditions a relation-decoder model is able to obtain the ordinality concept. We do this by taking a set of popular relation-decoder models, including a proposed Dynamic Comparator (DC) model, and assess 1. their consistencies as measured in the source data set, and 2. their ability to perform a Partial Relation Transfer (PRT) task to a novel target data set, which measures the robustness of their consistencies across domains. The evaluation takes place in two steps. In the first, we learn the above set of ordinality relations by ordering MNIST images based on their abstract digit identity and report each model's consistency profile. In the next step, we take the now pretrained isSuccessor relation-decoder and apply it to a proposed BlockStacks data set, which consists of images of multicolored block stacks. Each stack contains a single red block at various heights, which we use to test the degree to which ordering the encodings of each block stack image, subject to the pretrained isSuccessor relation, leads to transferred prediction accuracy across the remaining relations. In summary, the contributions of our work are:</p><p>• We devise an experimental setup that can expose the degree to which learning relations leads to concept abstraction, together with a new BlockStacks data set that presents a challenging ordering task based on a complex property. • We introduce a set of data set agnostic characteristic measures for relation-decoders which can help determine their ability to perform PRT. • We present a Dynamic Comparator model that achieves excellent PRT.</p><p>• Finally, we present a comprehensive analysis of model characteristics against corresponding PRT performance, for a set of popular relation-decoders.</p><p>The rest of the paper is presented as follows. Section 2 firstly positions our paper with respect to related work. Section 3 formalises the PRT task and outlines the architecture we employ to solve it, including the proposed DC relation-decoder model. We then define how we compute model consistency and gradient-conformity in Section 4. Finally, we provide results and analysis in Section 5, with concluding remarks in Section 6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Relational representations play a prominent role in Knowledge Graph Embedding (KGE), wherein sets of relation-decoders are jointly learned, through triplet link prediction, in order to obtain a semantic latent factor representation for entities <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b9">10,</ref><ref type="bibr" target="#b10">11]</ref>. In principle any KGE link prediction model can be employed in this work, but we focus on those that assume a Euclidean representation space and do not require any additional per-triplet engineering. Although KGE methods typically do not use a shared auto-encoder as we do in this paper, Schlichtkrull et al. <ref type="bibr" target="#b11">[12]</ref> did adopt an auto-encoding framework, where a graph neural network is used as the encoder, however they did not work with visual data and the model was not applied to transfer. Disentanglement, which also aims to learn semantic representations for data is of relevance to this work <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b1">2]</ref>, wherein multiple methods have been proposed, for example using Generative Adversarial Networks <ref type="bibr" target="#b13">[14]</ref> and VAEs <ref type="bibr" target="#b14">[15,</ref><ref type="bibr" target="#b0">1,</ref><ref type="bibr" target="#b15">16,</ref><ref type="bibr" target="#b16">17,</ref><ref type="bibr" target="#b17">18,</ref><ref type="bibr" target="#b18">19,</ref><ref type="bibr" target="#b19">20]</ref>. Of particular relevance to our work are investigations looking at the transferability of disentangled representations <ref type="bibr" target="#b20">[21,</ref><ref type="bibr" target="#b21">22,</ref><ref type="bibr" target="#b22">23]</ref>, but these did not include relation learning. A bridge between relation learning and disentanglement, wherein relation-decoders are employed as a semisupervision to VAEs, can be found in <ref type="bibr" target="#b23">[24,</ref><ref type="bibr" target="#b24">25,</ref><ref type="bibr" target="#b25">26]</ref>. Lastly, we note that our experimental setup is most remnant of domain adaptation <ref type="bibr" target="#b26">[27]</ref>. To the best of our knowledge, no work has compared relation-decoders in their ability to abstract concepts, as measured by their consistency and its transfer across domains.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">The Partial Relation Transfer Task and Model</head><p>Partial Relation Transfer (PRT) is at its core a domain adaptation task <ref type="bibr" target="#b26">[27]</ref>, wherein we have a source and target data domain, consisting of a set of images, 𝑋 𝑠 and 𝑋 𝑡 , respectively, and a set of shared relation prediction tasks, ℛ = {𝑟 1 , … , 𝑟 𝑛 }. We approximate each relation using a relation-decoder</p><formula xml:id="formula_0">𝜙 𝑀 𝑟 ∶ 𝑍 × 𝑍 → [0, 1],</formula><p>where 𝑍 denotes a latent space that contains all image encodings 𝑧 𝑖 ∈ 𝑍. The superscript 𝑀 denotes a specific relation-decoder model, as we test multiple variants. To obtain embeddings we use a domain-specific auto-encoder, consisting of an encoder 𝜓 𝑠/𝑡 𝑒𝑛𝑐 ∶ 𝑋 → 𝑍 and decoder 𝜓 𝑠/𝑡 𝑑𝑒𝑐 ∶ 𝑍 → 𝑋, which helps to minimise information loss through reconstruction of the input image <ref type="foot" target="#foot_0">1</ref> .</p><p>The evaluation takes place as a two-step procedure. In the first, all relation decoders are trained in the source domain, as a semi-supervision to the auto-encoder, using available labels, 𝑦 𝑠 ∈ ℝ |ℛ|×|𝑋 𝑠 |×|𝑋 𝑠 | , that specify whether a relation 𝑟 ∈ ℛ holds between image 𝑥 𝑖 , 𝑥 𝑗 ∈ 𝑋 𝑠 . Here, | ⋅ | denotes the cardinality of the operand set, but in practice we only use a small fraction of the available labels. In the second evaluation step, we initialise a new auto-encoder to be applied to the target dataset and use a subset of the pretrained relation-decoders, with labels 𝑦 𝑡 ∈ ℝ |ℛ|×|𝑋 𝑡 |×|𝑋 𝑡 | , to act as fixed-parameter 'guides' for the encoder.</p><p>To obtain informative data encodings, we use a Variational AutoEncoder (VAE), specifically the 𝛽-VAE, given its simplicity and demonstrated ability to separate distinct factors in the latent representation <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b27">28]</ref>. The 𝛽-VAE achieves this by optimising the ELBO objective, which for the purposes of this paper we express as a loss over both encoder and decoder:</p><formula xml:id="formula_1">ℒ 𝐸𝐿𝐵𝑂 𝛽-VAE = ℒ (𝜓 𝑠/𝑡 𝑒𝑛𝑐 , 𝜓 𝑠/𝑡 𝑑𝑒𝑐 ) + 𝛽ℒ (𝜓 𝑠/𝑡 𝑒𝑛𝑐 , 𝒩 (0, 𝟙)),<label>(1)</label></formula><p>where an additional 𝛽 scalar hyperparameter is used to influence disentanglement through stronger distribution matching pressure to an isotropic zero-mean Gaussian prior, 𝒩 (0, 𝟙).</p><p>When 𝛽 = 1 we obtain the original VAE objective <ref type="bibr" target="#b27">[28]</ref>. We provide the full ELBO loss, with a detailed explanation, in Appendix B. Each experiment involves taking embeddings from a corresponding encoder and passing them through to sets of relation-decoders (either the full set in the case in the source domain, or only a subset in the target domain). We can treat each relation-decoder as producing a prediction ŷ 𝑟𝑖𝑗 for whether relation 𝑟 holds between data instances 𝑖 and 𝑗 <ref type="bibr" target="#b4">[5]</ref>. Using the ground truth 𝑦 𝑟𝑖𝑗 , we can then compute the loss over all relationdecoders, ℒ 𝑅𝑒𝑙𝐷𝑒𝑐 , as the binary cross-entropy of prevision versus ground truth. This gives us the final joint objective between VAE and relation-decoders:</p><formula xml:id="formula_2">ℒ 𝑗𝑜𝑖𝑛𝑡 = ℒ 𝐸𝐿𝐵𝑂 𝛽-VAE − 𝜆 𝔼 𝑟,𝑦 𝑟𝑖𝑗 ,𝑧 𝑖 ,𝑧 𝑗 [𝑦 𝑟𝑖𝑗 log( ŷ 𝑟𝑖𝑗 ) + (1 − 𝑦 𝑟𝑖𝑗 ) log(1 − ŷ 𝑟𝑖𝑗 )] ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ ℒ 𝑅𝑒𝑙𝐷𝑒𝑐 , (<label>2</label></formula><formula xml:id="formula_3">)</formula><p>where 𝜆 is a scalar weighting parameter.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Dynamic Comparator</head><p>In our analysis, we include a proposed low-complexity, but nonetheless expressive, "Dynamic Comparator" (DC) model, which is designed to model systems of relations, whilst encouraging desirable properties for PRT. The overall DC model is composed of two modes, a distance-based measure, 𝜙 † 𝑟 , that can compute how close the vector difference between two inputs is to a positive or negative valued reference vector, and a step-like function, 𝜙 ‡ 𝑟 , that determines the sign of the difference between two points, optionally with an offset. The overall DC model is given by<ref type="foot" target="#foot_1">2</ref> :</p><formula xml:id="formula_4">𝜙 𝐷𝐶 𝑟 (𝑧 𝑖 , 𝑧 𝑗 ) = 𝑎 0 ⋅ 𝜎 0 (𝜂 0 (‖𝑢 ⊙ (𝑧 𝑖 − 𝑧 𝑗 + 𝑏 † )‖ 2 2 )) ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ 𝜙 † 𝑟 +𝑎 1 ⋅ 𝜎 1 ((𝜂 1 ⋅ 𝑢 ⊤ (𝑧 𝑖 − 𝑧 𝑗 + 𝑏 ‡ ))) ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ 𝜙 ‡ 𝑟 . (<label>3</label></formula><formula xml:id="formula_5">)</formula><p>where 𝑎 = S o f t m a x (𝐴) ∈ ℝ 2 is an attention weighting between the two modes, and ensures that 𝜙 𝐷𝐶 is bound to [0,1]. 𝜎 0 , 𝜎 1 are an exponential and sigmoid function, respectively; 𝑢 = Softmax(𝑈) ∈ ℝ 𝑚 is an attention mask which is applied to 𝑚-dimensional latent embeddings; 𝑏 † , 𝑏 ‡ ∈ ℝ 𝑚 are learnable bias terms that enables an offset to each mode; and 𝜂 0 ∈ ℝ + are non-negative and 𝜂 1 ∈ ℝ any-valued scalar terms, respectively. Lastly, ⊙ denotes the Hadamard product (elementwise multiplication) and ‖ ⋅ ‖ 2 is the 𝐿2-norm. Due to a convergence issue when using a pretrained DC with fixed parameters, we needed to use a flexible fitting procedure in which we enable the DC parameters to train in the target domain, but with the additional loss term ‖𝜌 * − 𝜌‖, between pretrained 𝜌 * and untrained parameters 𝜌, respectively. In all cases we evaluated the final parameter values in the target domain and found them to be approximately equivalent to the 𝜌 * . We did not apply this method to the other models as they were all able to fit the isSuccessor relation in the target domain.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Measuring relation-decoder characteristics</head><p>In this section we describe a series of measures that we use to understand more about the intrinsic characteristics of each relation-decoder, which together help identify the behaviour of each relation-decoder model and provide insight regarding their respective PRT performance.</p><p>For any system of relations, we can write down a truth-table that defines the valid truthstates that they may collectively take, which we expect our relation-decoders to model. For example, we know that any time isGreater is true, isLess must not be. By assuming that each relation-decoder output is pairwise conditionally independent given 𝑧 𝑖 , 𝑧 𝑗 , for instance,</p><formula xml:id="formula_6">𝑝(isGreater, isLess|𝑧 𝑖 , 𝑧 𝑗 ) = 𝑝(isGreater|𝑧 𝑖 , 𝑧 𝑗 )𝑝(isLess|𝑧 𝑖 , 𝑧 𝑗 ),</formula><p>we can produce a probability statement for whether the relations are consistent with valid entries to the truth-table <ref type="table">.</ref> Taking 𝑟 1 = isGreater and 𝑟 2 = isLess as our entire system of relations, we can produce the following truth-table conversion, where invalid entries are omitted:</p><formula xml:id="formula_7">𝑟 1 (𝑥 𝑖 , 𝑥 𝑗 ) 𝑟 2 (𝑥 𝑖 , 𝑥 𝑗 ) ℱ (𝑟 1 , 𝑟 2 ) 𝑇 𝐹 𝑇 𝐹 𝑇 𝑇 𝐹 𝐹 𝑇 ⟹ ℱ (𝑟 1 , 𝑟 2 ) = ∀𝑥 𝑖 , 𝑥 𝑗 ((𝑟 1 (𝑥 𝑖 , 𝑥 𝑗 ) ∧ 𝑟 2 (𝑥 𝑖 , 𝑥 𝑗 )) ∨(¬𝑟 1 (𝑥 𝑖 , 𝑥 𝑗 ) ∧ 𝑟 2 (𝑥 𝑖 , 𝑥 𝑗 )) ∨(¬𝑟 1 (𝑥 𝑖 , 𝑥 𝑗 ) ∧ ¬𝑟 2 (𝑥 𝑖 , 𝑥 𝑗 )))<label>(4)</label></formula><p>which, using our relation-decoders for each relation and with 𝑧 𝑖,𝑗 = 𝜓 𝑒𝑛𝑐 (𝑥 𝑖,𝑗 ) and ¬𝜙 𝑟 (𝑧 𝑖 , 𝑧 𝑗 ) = 1 − 𝜙 𝑟 (𝑧 𝑖 , 𝑧 𝑗 ), we express the probability of ℱ being true as:</p><formula xml:id="formula_8">𝑝(ℱ |𝑧 𝑖 , 𝑧 𝑗 ) = ((𝜙 𝑟 1 (𝑧 𝑖 , 𝑧 𝑗 ) ⋅ 𝜙 𝑟 2 (𝑧 𝑖 , 𝑧 𝑗 )) + ((1 − 𝜙 𝑟 1 (𝑧 𝑖 , 𝑧 𝑗 )) ⋅ 𝜙 𝑟 2 (𝑧 𝑖 , 𝑧 𝑗 )) + ((1 − 𝜙 𝑟 1 (𝑧 𝑖 , 𝑧 𝑗 )) ⋅ (1 − 𝜙 𝑟 2 (𝑧 𝑖 , 𝑧 𝑗 ))).<label>(5)</label></formula><p>Finally, since ℱ should hold for all input combinations, we heavily penalise violations by using a binary cross-entropy loss between ℱ and the expected outcome:</p><formula xml:id="formula_9">𝐻 𝑇 𝑟𝑢𝑒 (𝑝(ℱ )) = − 1 𝑁 ∑ 𝑧 𝑖 ,𝑧 𝑗 ∈𝑍 1 ⋅ log 𝑝(ℱ |𝑧 𝑖 , 𝑧 𝑗 ),<label>(6)</label></formula><p>where 𝑍 is the latent space, as we can compute this score for any samples from this space <ref type="foot" target="#foot_2">3</ref> and 𝑁 is a normalising constant, equal to the number of 𝑧 𝑖 and 𝑧 𝑗 sample pairs used in the calculation. We refer to this measure as Con-A referring to the fact that we use it to measure consistency across multiple relations.</p><p>To provide a deeper understanding about how relation-decoders collectively interact with their inputs, we use a gradient evaluation to see whether models respond similarly to changes in their input. For a set of relations, we define the gradient-conformity (GC) of relation 𝑟 𝑖 against all others by the following cosine-similarity:</p><formula xml:id="formula_10">𝐺𝐶 = | 𝑑 𝑇 𝑖 𝑑 𝑗 ||𝑑 𝑖 || 2 ||𝑑 𝑗 || 2 | where 𝑑 𝑖 = 𝑑𝜙 𝑟 𝑖 𝑑𝑧 𝑐 | 𝑧 𝑐 =𝑧 𝑐 𝑠 and 𝑑 𝑗 = 𝑑𝜙 𝑟 𝑗 𝑑𝑧 𝑐 | 𝑧 𝑐 =𝑧 𝑐 𝑠 , ∀𝑖 ≠ 𝑗<label>(7)</label></formula><p>where | ⋅ | denotes the absolute of the operand and 𝑧 𝑐 is the concatenation of each relationdecoder's inputs, with gradients evaluated at reference inputs 𝑧 𝑐 𝑠 . GC will be 1 if gradients are aligned and zero if orthogonal <ref type="foot" target="#foot_3">4</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results</head><p>This section presents results for the PRT task on a range of relation-decoder models. In the source domain, we learn a system of binary relations: ℛ ={isSuccessor (S), isPredecessor (P), isGreater (G), isEqual (E), isLess (L)}, on digits represented in MNIST images, alongside a 𝛽-VAE. In the target domain, we take the pretrained S relation as a fixed-parameter guide for a new 𝛽-VAE applied to BlockStacks images (see Appendix A for BlockStacks image examples), and then evaluate PRT accuracy on the held-out G, E, L and P relations. Relation-decoder models compared here are: TransR <ref type="bibr" target="#b28">[29]</ref>, HolE <ref type="bibr" target="#b29">[30]</ref>, NTN <ref type="bibr" target="#b2">[3]</ref>, our proposed DC and a basic neuralnetwork baseline, NN. NN is a simple four-layer (𝑙 in , 𝑙 1 , 𝑙 2 , 𝑙 out ) neural-network with layer sizes 𝑙 in = 2𝑑 𝑧 , 𝑙 1 = 2𝑑 𝑧 and 𝑙 2 = 𝑑 𝑧 , with ReLU activations. The final output layer 𝑙 out is a single value passed through a sigmoid function, to bound the output to [0,1]. Further model details are provided in Appendix C.</p><p>We vary 𝛽 only in the source domain, ranging across values {1, 4, 8, 12}, but fix it in the target domain. 𝜆 is fixed in both domains (see Appendix C.3 for further details on hyperparameter settings). For Con-A and GC measures, we produce encodings for three data splits: dataembeddings, where all inputs are encodings of a domain's test data; interpolation, where we obtain an empirical mean and variance for the domain's data-embeddings and sample from a corresponding Gaussian distribution; and extrapolation, where we sample from regions strictly outside the data-embeddings region.</p><p>Figure <ref type="figure" target="#fig_2">2</ref>-top provides relation-decoder prediction accuracy in both the source MNIST (left), and target BlockStacks (right), domains. Key observations are that DC produces excellent PRT performance, whilst NN, NTN and HolE all see some degradation from their source accuracies. TransR seems to maintain a similar accuracy profile. We include 𝛽's impact on these performances in Figure <ref type="figure" target="#fig_2">2</ref>   domain, PRT performance is significantly impacted by 𝛽 in all models, but has little effect in the source domain. Additionally, TransR has a strong positive correlation with 𝛽, whereas NN, NTN and HolE produce the best PRT performance with intermediate disentanglement pressure. To interrogate further how 𝛽 affects each model, we provide: (Figure <ref type="figure" target="#fig_3">3</ref>-top) mean relation Con-A referenced to both source (left) and target (right) domain embeddings; and (Figure <ref type="figure" target="#fig_3">3</ref>-bottom) source domain referenced GC measures for each model. In the left and bottom plots, blue (left group), green (middle group) and red (right group) show results for the data-embeddings, interpolation and extrapolation regions of latent space, in respective order. From the source domain Con-A results, we note that DC shows excellent consistency across relations in all regions. Most other models have worse interpolation and extrapolation consistency. Increasing 𝛽 appears to give some improvement for all but HolE, but there are indications that this trend does not persist into the largest 𝛽 = 12 value. Interestingly, Con-A values for target data-embeddings (right) are notably worse than for source data-embeddings, with values closer to those for interpolation or extrapolation in the source domain. For GC, DC performance is close to 1 for all 𝛽 with no discernible change. All other models show a weaker GC with positive correlation between GC and 𝛽. TransR and NN achieve significantly higher GC than NTN and HolE.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Key Experimental Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.1.">Does good source task accuracy lead to successful PRT?</head><p>Since we transfer pretrained models from source to target domain and ensure that the target encoder, 𝜓 𝑡 𝑒𝑛𝑐 , fits its encodings to S, we might expect that relation-decoding performance will be the same in both domains. However, despite DC, NN, NTN and HolE all performing close to 100% accuracy, and TransR achieving above 80%, across all relations, and with all relations able to achieve similar prediction accuracy (or better in the case of DC) on the guide relation S, PRT performance varies significantly across models. It is firstly evident that DC is successful at PRT, sustaining approximately 100% accuracy across all held-out relations. NN achieves mostly good performance, with greater degradation across P and E relations. Although HolE and NTN both achieve good PRT for P, there is increasing degradation across E and G, L relations. TransR is able to achieve strong relative performance where PRT accuracy per relation is comparable to what was possible in the source domain. These results indicate that source accuracy alone is not enough to determine whether models will be successful at PRT.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.2.">How does 𝛽 affect Con-A and GC and how does this impact model coherence?</head><p>To provide an overview of how increased disentanglement pressure affects each model we can firstly compare how 𝛽 affects model performance in both source and target domain. Figure <ref type="figure" target="#fig_2">2</ref>-bottom demonstrates that, although relation prediction accuracies for most models either do not respond, or respond negatively, to increases in 𝛽 in the source domain, their PRT behaviour differs significantly across models: DC shows no discernible change, whilst NN, NTN and HolE all show a parabolic response with a maximum PRT around 𝛽 = 8; TransR shows a general positive correlation but with diminishing returns above 𝛽 = 8. To gain further insight into the role of disentanglement pressure, it is necessary to look at how each model's intrinsic behaviour responds to 𝛽 changes.</p><p>First, we attempt to expose the relationship between 𝛽 and consistency and whether this has any effect on PRT performance. By Figure <ref type="figure" target="#fig_3">3</ref>-top, DC clearly outperforms all other models on Con-A and this coincides with better PRT performance. The next best performing model on Con-A in the source domain is also the next best on PRT performance. In most cases Con-A degrades for all models when moving from data-embeddings to interpolation and extrapolation, but the degree of degradation changes depending on the model. Interestingly, across all models, their target Con-A is notably close to that of interpolation or extrapolation in the source domain analysis. This suggests that guiding 𝜓 𝑡 𝑒𝑛𝑐 to fit relation S produces data embeddings that lie in the interpolation or extrapolation regions with respect to MNIST embeddings. This suggests that a relation-decoder model's ability to retain consistency over regions of latent space beyond where MNIST embeddings are found leads to improved PRT. These findings provide compelling evidence in support of our claim that consistency across relations is important for PRT performance.</p><p>Secondly, we examine how gradient-conformity affects PRT performance. To achieve successful PRT, fitting the target encoder to a single pretrained relation should lead to embeddings that are structured correctly with respect to the other pretrained relations. For this to be possible there must be a degree of conformity between how each model computes its system of relations. As an extreme case, suppose we have a two-dimensional latent representation, with two relations that are each calculated using entirely different dimensions of latent space. By fitting an encoder to one of these relations, there is no guarantee that the latent dimension, that the other relation requires, receives the necessary guidance. DC shows excellent and stable GC values (near 1) across all conditions. This is by design as the use of masks per relation ensures that if masks match for any two relations, then their gradients will be either parallel or anti-parallel. Excluding HolE, all remaining models show a positive correlation between GC and 𝛽, and it appears that models with either higher GC values, or 𝛽 response, typically perform better at PRT. Together this provides tentative evidence to suggest that GC is important to model coherence, as measured by their PRT performance. It is possible that we do not see a monotonic benefit of GC against PRT, due to no further extrapolation or interpolation Con-A gains with 𝛽 &gt; 8.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>We provide a comprehensive analysis of relation-decoder characteristics when learning the system of relations that together define the semantics of a concept. We then compare these characteristics with a Partial Relation Transfer task setting, which determines whether, given logical constraints between relations, fitting embeddings to one relation-decoder leads to embeddings that satisfy all other relations in terms of their logical consistency and accuracy.</p><p>Our results demonstrate that model consistency, and possibly gradient-conformity, across different regions of input space together determine whether a set of relation-decoders have learned a consistent and coherent notion of a given concept, in this case ordinality. These measures make it possible to check whether a set of relation-decoders have indeed learned a transferable concept, or if they are limited to a single data domain and property.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. BlockStacks dataset description</head><p>The BlockStacks dataset consists of 12,000 images (200×200 pixels but resized in code to 128×128) of individual block stacks, of varying height (between 1-10 blocks), block colors (uniformly sampled from options: { gray, blue, green, brown, purple, cyan, yellow} ) and position (uniformly sampled from 𝑥, 𝑦 range (-3,-3) to (3,3)), but with the requirement that each instance consists of a single red block at a random height (see Figure <ref type="figure" target="#fig_4">4</ref> for example images). These were rendered using the CLEVR rendering agent with the help of code from <ref type="bibr" target="#b30">[31]</ref>. The dataset is divided into 9000:1500:1500 train, validation and test splits.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Explanation of the 𝛽-VAE</head><p>The VAE is derived by introducing an approximate posterior 𝑞 𝛼 (𝑍|𝑋), from which a lower bound (commonly referred to as the Evidence LOwer Bound (ELBO)) on the true marginal log 𝑝 𝜃 (𝑋) can be obtained by using Jensen's inequality <ref type="bibr" target="#b27">[28]</ref>. The VAE maximises the log-probability by maximising this lower bound, given by:</p><formula xml:id="formula_11">ℒ 𝐸𝐿𝐵𝑂 𝛽-VAE = 𝔼 𝑞 𝛼 (𝑍|𝑋) [log 𝑝 𝜃 (𝑋|𝑍)] − 𝛽𝐷 𝐾 𝐿 (𝑞 𝛼 (𝑍|𝑋)‖𝑝 𝜃 (𝑍)),<label>(8)</label></formula><p>where 𝑞 𝛼 (𝑍|𝑋) is the approximate posterior, typically modelled as a neural network encoder with parameters 𝛼. Similarly 𝑝 𝜃 (𝑋|𝑍) is modelled as a decoder with parameters 𝜃 and is calculated as a Monte Carlo estimation. A reparameterization trick is used to enable differentiation through an otherwise undifferentiable sampling from 𝑞 𝛼 (𝑍|𝑋) (see <ref type="bibr" target="#b27">[28]</ref>). In the 𝛽-VAE <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b14">15]</ref>, an additional 𝛽 scalar hyperparameter was added as it was found to influence disentanglement through stronger distribution matching pressure with respect to the prior 𝑝 𝜃 (𝑍), where this prior is typically set to an isotropic zero-mean Gaussian 𝒩 (0, 𝟙)). When 𝛽 = 1 we obtain the standard VAE objective <ref type="bibr" target="#b27">[28]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Model Descriptions</head><p>In this section we provide model details for each relation-decoder that we use and the VAE architecture that we employ for each data set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.1. Relation Decoder implementations</head><p>TransR:</p><formula xml:id="formula_12">𝜙 TransR 𝑟 (𝑧 𝑖 , 𝑧 𝑗 ) = ‖ℎ 𝑟 + 𝑟 − 𝑡 𝑟 ‖ 2</formula><p>with, ℎ 𝑟 = 𝑀 𝑟 𝑧 𝑖 and 𝑡 𝑟 = 𝑀 𝑟 𝑧 𝑗 .</p><p>As we want to obtain a [0,1] output, we modify TransR through 𝜙 TransR + 𝑟 = 𝜎 (𝑐 − 𝜙 TransR</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>𝑟</head><p>), where 𝜎 is a sigmoid function and c is a scalar that ensures that at 𝜙 TransR + 𝑟 (𝑧 𝑖 , 𝑧 𝑗 ) = 0, then 𝜙 TransR + 𝑟 (𝑧 𝑖 , 𝑧 𝑗 ) ≈ 0. In all experiments we set 𝑐 = 10. NTN (modified version from <ref type="bibr" target="#b31">[32,</ref><ref type="bibr" target="#b32">33]</ref>):</p><formula xml:id="formula_13">𝜙 𝑟 (𝑧 0 , … , 𝑧 𝑛 ) = 𝜎(𝑢 ⊤ 𝑟 [tanh(𝑧 𝑐⊤ 𝑀 𝑟 𝑧 𝑐 + 𝑉 𝑟 𝑧 𝑐 + 𝑏 𝑟 )])<label>(9)</label></formula><p>where 𝑢 𝑟 ∈ ℝ 𝑘 , 𝑀 𝑟 ∈ ℝ (𝑛−1)⋅𝑑 𝑧 ×(𝑛−1)⋅𝑑 𝑧 ×𝑘 , 𝑉 𝑟 ∈ ℝ 𝑘×(𝑛−1)⋅𝑑 𝑧 ) and 𝑏 𝑟 ∈ ℝ 𝑘 . The only hyperparameter to consider is 𝑘, which controls the NTN's capacity -in all experiments, we set this to 1. Here 𝑧 𝑐 is a concatenation of the inputs 𝑧 0 , … , 𝑧 𝑛 , which was introduced in <ref type="bibr" target="#b31">[32,</ref><ref type="bibr" target="#b32">33]</ref>. In contrast the original NTN (see <ref type="bibr" target="#b2">[3]</ref>) is only applicable to binary relations and does not include the outer sigmoid. HolE:</p><formula xml:id="formula_14">𝜙 HolE 𝑟 (𝑧 𝑖 , 𝑧 𝑗 ) = 𝜎 (𝑟 ⊤ (𝑧 𝑖 ⋆ 𝑧 𝑗 ))</formula><p>where ⋆ ∶ ℝ 𝑑 × ℝ 𝑑 → ℝ 𝑑 denotes the circular correlation operator and is given by, </p><formula xml:id="formula_15">[𝑧 𝑖 ⋆ 𝑧 𝑗 ] 𝑘 = 𝑑−1 ∑ 𝑚=0 𝑧 𝑖,</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.2. VAE configuration</head><p>In all representation learning experiments, we use a 𝛽-VAE trained for 300,000 steps, following accepted practice from <ref type="bibr" target="#b19">[20,</ref><ref type="bibr" target="#b21">22]</ref>.</p><p>The encoder-decoder model parameters are given in Table <ref type="table" target="#tab_1">1</ref> -we include the model configurations used for both MNIST and BlockStacks datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.3. ℒ 𝑗𝑜𝑖𝑛𝑡 configuration</head><p>In the source domain, we vary 𝛽 values between {1, 4, 8, 12} and fix 𝜆 = 10 3 . In the target domain, we fix 𝛽 to 10 −4 and 𝜆 = 10 −2 and normalise the ℒ 𝐸𝐿𝐵𝑂 𝛽-VAE reconstruction term by dividing by a factor 1 √𝐻 ⋅𝑊⋅𝐶 , for height 𝐻, width 𝑊 and color channels 𝐶, and normalize ℒ (𝜓 𝑡 𝑒𝑛𝑐 , 𝒩 (0, 𝟙)) by a factor 1 𝑑 𝑧 , for latent representation size 𝑑 𝑧 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Supplementary Results</head><p>Figure <ref type="figure" target="#fig_5">5</ref> and Figure <ref type="figure" target="#fig_6">6</ref> provide additional results for Con-I (individual consistency scores for individual relation properties covering transitivity, asymmetry and reflexivity) and Con-A, configured on the same data splits as described in the main text. These results cover variants of where 𝜎 is the sigmoid function and 𝜂 is a scalar value. This modification enables a cliff-like shape for 𝜙 † 𝑟 , such that it can output close to 1 for a wider vector difference range. Note all distribution forms are unnormalized so that they cover the interval [0,1].</p><p>The NN variants vary layer depth and size, but all use a common input layer of size 𝑙 in = 2 * 𝑑 𝑧 . NN2 is a three-layer neural network with hidden layer size 𝑑 𝑧 and NN3 is a four-layer neural network which is the same as NN, but in contrast has a 𝑑 𝑧 pre-final layer size, thereby omitting  which enables a pairwise comparison between each input dimension. NN1-sig is the same as NN but employs sigmoid activations, instead of ReLUs. NN-DC is the again same as the NN from the main text, but includes an additional 𝜙 † 𝑟 -type node that can compute relative differences between inputs in the same way as DC. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. How does each model impact the retention of domain-dependent information</head><p>Figure <ref type="figure" target="#fig_7">7</ref> shows results for BlockStacks overall block height prediction accuracy when training on fixed encodings of each block stack, after isSuccessor relation-decoder alignment as been applied. Note 𝛽 is fixed in the target domain, so the only moving part are the pretrained models which are trained with varied source 𝛽 values. Note also that dc has an unfair advantage here, as the steered fitting approach allows more flexibility to the VAE learning phase -for this reason the result is only included in the appendix. Since we are interested in capturing general representations that encode both domain-dependent and -independent information, we use each target encoder 𝜓 𝑡 𝑒𝑛𝑐 obtained from each PRT experiment and produce encodings for the full BlockStacks test set. The resulting encodings are then divided into a new train and test subset, used to train both a Sci-Kit Learn Linear regressor and Support Vector Machine regressor with a RBF ∘ kernel <ref type="bibr" target="#b33">[34]</ref>. We present the resulting Mean Squared Errors (MSE) in Figure <ref type="figure" target="#fig_7">7</ref>, with Ordinary Least Squares (OLS) (a) and Support Vector Regression (SVR) (b).</p><p>There are a number of noteworthy details: firstly, DC shows no dependence on 𝛽 and leads to a lower MSE across all settings; second, excluding DC, for all models we observe an optimum MSE at 𝛽 = 8, with TransR reaching DC MSE performance for OLS and NN doing the same for SVR. These results indicate that lower MSE can be obtained by using non-linear regression, which indicates that to some degree, the block stack height factor is not encoded linearly, regardless of selected model. Next, by contrasting with Figure <ref type="figure" target="#fig_3">3</ref>-bottom, these results suggest that models with higher GC lead to embeddings that are more amenable to domain-specific factor prediction. However, the parabolic trend, where increasing 𝛽 to 12 leads to an increase in error, is in agreement with Figure <ref type="figure" target="#fig_2">2</ref>-bottom-right, which showed that most models do not improve at PRT for the largest 𝛽.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Depiction of the architecture we use for PRT. In this diagram, we show how the initial relation learning is performed on the source MNIST dataset. Moving to the target domain involves using 𝜓 𝑡 𝑒𝑛𝑐/𝑑𝑒𝑐</figDesc><graphic coords="4,89.29,84.20,416.56,177.04" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>-bottom. Barring DC which has little discernible change in either</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: [Top] Relation-decoder prediction accuracy per relation and model, in the source (left) and target domains. Relations are abbreviated on the 𝑥-axis by { S: isSuccessor, P: isPredecessor, E: isEqual, G: isGreater, L: isLess }, with a red highlight identifying which relation is included as a guide for 𝜓 𝑡 𝑒𝑛𝑐 . [Bottom] 𝛽 impact profiles for each relation-decoder model, aggregated across all relations in the source (left) domain and aggregated only for held-out relations in the target (right) domain. In all cases, higher values are better.</figDesc><graphic coords="7,89.29,324.25,416.70,110.03" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: [Top] Con-A values for each relation-decoder model, referenced to source (left) and target (right) domains (lower values better).[Bottom] GC values for each relation decoder (higher values better). In all plots, darker color shades denote higher values of 𝛽, corresponding to greater disentanglement pressure from the 𝛽-VAE. In top-left and bottom plots, blue, green and red groups show results for data-embeddings, interpolation and extrapolation embeddings respectively (see main text for details).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Example of two BlockStacks data set images.</figDesc><graphic coords="13,89.29,84.19,416.71,201.31" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Consistency values for individual relation properties (Con-I), covering: transitivity, reflexivity and asymmetry. Values are for variants of DC and NN relation-decoder models, referenced to source (MNIST) domain (lower values better). In all plots, darker color shades denote higher values of 𝛽 (in range {1, 4, 8, 12}), corresponding to greater disentanglement pressure from the 𝛽-VAE. Blue, green and red groups show results for data-embeddings, interpolation and extrapolation embeddings respectively (see main text for details on these data splits).</figDesc><graphic coords="16,89.29,280.54,416.69,92.44" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Con-A values for variants of DC and NN relation-decoder models, referenced to source (MNIST) domain (lower values better). In all plots, darker color shades denote higher values of 𝛽 (in range {1, 4, 8, 12}), corresponding to greater disentanglement pressure from the 𝛽-VAE. Blue, green and red groups show results for data-embeddings, interpolation and extrapolation embeddings respectively (see main text for details on these data splits).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: Analysis of domain-specific information retention by the 𝛽-VAE when using different relationdecoders for ordinality relation decoding. We attempt to predict the overall BlockStacks stack height on the final fixed embeddings obtained after isSuccessor relation-decoder alignment.</figDesc><graphic coords="17,89.29,84.19,416.68,78.64" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>𝑚 𝑧 𝑗,𝑘+𝑚 mod 𝑑 NN: a simple four-layer neural-network with hidden layer sizes 𝑙 in = 2𝑑 𝑧 , 𝑙 1 = 2𝑑 𝑧 and 𝑙 2 = 𝑑 𝑧 , with ReLu activations, for latent representations with size 𝑑 𝑧 . The final output layer, 𝑙 out , is a single value passed through a sigmoid function, to cap the output within [0,1].</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1</head><label>1</label><figDesc>Specification of our 𝛽-VAE encoder and decoder model parameters, for both 28×28 (top) and 128×128 (bottom) size input data. I: Input channels, O: Output channels, K: Kernel size, S: Stride, P: Padding, A: Activation</figDesc><table><row><cell>Encoder</cell><cell cols="2">Decoder</cell></row><row><cell>Input: 28 × 28 × 𝑁 𝐶 = 1</cell><cell cols="3">Input: ℝ 10</cell></row><row><cell>Layer_ID ; I ; O ; K ; S ; P ; A</cell><cell cols="3">Layer_ID ; Num Nodes : In -Out ; A</cell></row><row><cell>Conv2d_1 ; 𝑁 𝐶 ; 32 ; 4 × 4 ; 2 ; 1 ; ReLU</cell><cell cols="3">FC_z ; 10 -144 ; ReLU</cell></row><row><cell>Conv2d_2 ; 32 ; 32 ; 4 × 4 ; 2 ; 1 ; ReLU</cell><cell cols="3">FC_z_mu ; 144 -576 ; ReLU</cell></row><row><cell>Conv2d_3 ; 32 ; 64 ; 3 × 3 ; 2 ; 1 ; ReLU Conv2d_4 ; 64 ; 64 ; 2 × 2 ; 2 ; 1 ; ReLU</cell><cell cols="3">Layer_ID ; I ; O ; K ; S ; P ; A UpConv2d_1 ; 64 ; 64 ; 2 × 2 ; 2 ; 1 ; ReLU</cell></row><row><cell>Layer_ID ; Num Nodes : In -Out ; A</cell><cell cols="3">UpConv2d_2 ; 64 ; 32 ; 3 × 3 ; 2 ; 1 ; ReLU</cell></row><row><cell>FC_z ; 576 -144 ; ReLU</cell><cell cols="3">UpConv2d_3 ; 32 ; 32 ; 4 × 4 ; 2 ; 1 ; ReLU</cell></row><row><cell>FC_z_mu ; 144 -10 ; None</cell><cell cols="3">UpConv2d_4 ; 32 ; 𝑁 𝐶 ; 4 × 4 ; 2 ; 1 ; Sigmoid</cell></row><row><cell>FC_z_logvar ; 144 -10 ; None</cell><cell></cell><cell></cell></row><row><cell>Encoder</cell><cell cols="3">Decoder</cell></row><row><cell>Input: 128 × 128 × 𝑁 𝐶 = 3</cell><cell cols="3">Input: ℝ 10</cell></row><row><cell>Layer_ID ; I ; O ; K ; S ; P ; A</cell><cell cols="3">Layer_ID ; Num Nodes : In -Out ; A</cell></row><row><cell>Conv2d_1 ; 𝑁 𝐶 ; 32 ; 4 × 4 ; 2 ; 1 ; ReLU</cell><cell cols="3">FC_z ; 10 -256 ; ReLU</cell></row><row><cell>Conv2d_2 ; 32 ; 32 ; 4 × 4 ; 2 ; 1 ; ReLU</cell><cell cols="3">FC_z_mu ; 256 -1024 ; ReLU</cell></row><row><cell>Conv2d_3 ; 32 ; 64 ; 4 × 4 ; 2 ; 1 ; ReLU Conv2d_4 ; 32 ; 64 ; 4 × 4 ; 2 ; 1 ; ReLU Conv2d_5 ; 64 ; 64 ; 4 × 4 ; 2 ; 1 ; ReLU</cell><cell cols="3">Layer_ID ; I ; O ; K ; S ; P ; A UpConv2d_1 ; 64 ; 64 ; 4 × 4 ; 2 ; 1 ; ReLU UpConv2d_2 ; 64 ; 32 ; 4 × 4 ; 2 ; 1 ; ReLU</cell></row><row><cell>Layer_ID ; Num Nodes : In -Out ; A</cell><cell cols="3">UpConv2d_3 ; 32 ; 32 ; 4 × 4 ; 2 ; 1 ; ReLU</cell></row><row><cell>FC_z ; 1024 -256 ; ReLU</cell><cell cols="3">UpConv2d_4 ; 32 ; 32 ; 4 × 4 ; 2 ; 1 ; ReLU</cell></row><row><cell>FC_z_mu ; 256 -10 ; None</cell><cell cols="3">UpConv2d_5 ; 32 ; 𝑁 𝐶 ; 4 × 4 ; 2 ; 1 ; Sigmoid</cell></row><row><cell>FC_z_logvar ; 256 -10 ; None</cell><cell></cell><cell></cell></row><row><cell cols="4">the DC and NN models. DC variants include: DC-Basic, uses the same 𝜙</cell><cell>‡ 𝑟 as DC, but uses a</cell></row><row><cell cols="4">similar 𝜙 𝑟 to that of [25] but includes the dynamic 𝑢 mask and 𝑏  † offset; DC-Gaus, again same  †</cell></row><row><cell cols="4">𝜙 𝑟 but uses a Gaussian function for 𝜙  ‡ 𝑟 ; DC-Cauchy, uses a Cauchy distribution form for 𝜙  † 𝑟  †</cell></row><row><cell cols="2">and a Cauchy cumulative distribution function for 𝜙</cell><cell cols="2">‡ 𝑟 ; and finally DC-CCS which employs a</cell></row><row><cell>modified Cauchy distribution for 𝜙 𝑟 , via  †</cell><cell></cell><cell></cell></row><row><cell>𝜎 (𝜂(2 ⋅ 𝜙</cell><cell cols="2">DC-Cauchy, † 𝑟</cell><cell>− 1))</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">Further analysis on the performance of BlockStacks embeddings for domain-dependent task can be found in Appendix E</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">In the main text we report results for this DC model, but we can use any function that has the required characteristics for 𝜙 † and 𝜙 ‡ . We include results for other versions in Appendix D.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">in practice as we cannot include every encoding combination, we provide an estimate.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">We can evaluate against this measure for arbitrary samples from 𝑍.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework</title>
		<author>
			<persName><forename type="first">I</forename><surname>Higgins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Matthey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Burgess</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Glorot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Botvinick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mohamed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lerchner</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">5th International Conference on Learning Representations</title>
				<meeting><address><addrLine>{ICLR}, Toulon, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">I</forename><surname>Higgins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Amos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Pfau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Racaniere</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Matthey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Rezende</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lerchner</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1812.02230</idno>
		<ptr target="http://arxiv.org/abs/1812.02230.doi:arXiv:1812.02230v1.arXiv:1812." />
		<title level="m">Towards a Definition of Disentangled Representations</title>
				<imprint>
			<date type="published" when="2018">2018. 0 2 2 3 0</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Reasoning With Neural Tensor Networks for Knowledge Base Completion</title>
		<author>
			<persName><forename type="first">R</forename><surname>Socher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Manning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems</title>
				<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="926" to="934" />
		</imprint>
	</monogr>
	<note>a r X i v : a r X i v : 1 3 0 1 . 3 6 1 8 v 2</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Complex Embeddings for Simple Link Prediction</title>
		<author>
			<persName><forename type="first">T</forename><surname>Trouillon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Welbl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Riedel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">É</forename><surname>Gaussier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Bouchard</surname></persName>
		</author>
		<idno>r X i v : 1 6 0 6 . 0 6 3 5 7</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 33nd International Conference on Machine Learning</title>
				<meeting>the 33nd International Conference on Machine Learning<address><addrLine>{ICML}, New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="2071" to="2080" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">On inductive abilities of latent factor models for relational learning</title>
		<author>
			<persName><forename type="first">T</forename><surname>Trouillon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">É</forename><surname>Gaussier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">R</forename><surname>Dance</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Bouchard</surname></persName>
		</author>
		<idno type="DOI">10.1613/jair.1.11305</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of Artificial Intelligence Research</title>
		<imprint>
			<biblScope unit="volume">64</biblScope>
			<biblScope unit="page" from="21" to="53" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note>a r X i v : 1 7 0 9 . 0 5 6 6 6</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Translating Embeddings for Modeling Multi-relational Data</title>
		<author>
			<persName><forename type="first">A</forename><surname>Bordes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Usunier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Garcia-Duran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Weston</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Yakhnenko</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems</title>
				<editor>
			<persName><forename type="first">C</forename><forename type="middle">J C</forename><surname>Burges</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Bottou</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Welling</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Z</forename><surname>Ghahramani</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><forename type="middle">Q</forename><surname>Weinberger</surname></persName>
		</editor>
		<meeting><address><addrLine>Lake Tahoe, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="2787" to="2795" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A review of relational machine learning for knowledge graphs</title>
		<author>
			<persName><forename type="first">M</forename><surname>Nickel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Murphy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Tresp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Gabrilovich</surname></persName>
		</author>
		<idno type="DOI">10.1109/JPROC.2015.2483592</idno>
	</analytic>
	<monogr>
		<title level="j">Proceedings of the IEEE</title>
		<imprint>
			<biblScope unit="volume">104</biblScope>
			<biblScope unit="page" from="11" to="33" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note>a r X i v : 1 5 0 3 . 0 0 7 5 9</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Knowledge graph embedding: A survey of approaches and applications</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Mao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Guo</surname></persName>
		</author>
		<idno type="DOI">10.1109/TKDE.2017.2754499</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Knowledge and Data Engineering</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="page" from="2724" to="2743" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A Survey on Knowledge Graph Embedding: Approaches, Applications and Benchmarks</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">N</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Guo</surname></persName>
		</author>
		<idno type="DOI">10.3390/electronics9050750</idno>
	</analytic>
	<monogr>
		<title level="j">Electronics</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="1" to="29" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note>o n i c s 9 0 5 0 7 5 0</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Simple embedding for link prediction in knowledge graphs</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Kazemi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Poole</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="page" from="4284" to="4295" />
			<date type="published" when="2018-12">2018-December. 2018</date>
		</imprint>
	</monogr>
	<note>1 8 0 2 . 0 4 8 6 8</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Boxe: A box embedding model for knowledge base completion</title>
		<author>
			<persName><forename type="first">R</forename><surname>Abboud</surname></persName>
		</author>
		<author>
			<persName><forename type="first">İ</forename><forename type="middle">İ</forename><surname>Ceylan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lukasiewicz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Salvatori</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper/2020/hash/6dbbe6abe5f14af882ff977fc3f35501-Abstract.html" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020</title>
				<editor>
			<persName><forename type="first">H</forename><surname>Larochelle</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Ranzato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Hadsell</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Balcan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Lin</surname></persName>
		</editor>
		<meeting><address><addrLine>NeurIPS; virtual</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020. cember 6-12, 2020. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Modeling Relational Data with Graph Convolutional Networks</title>
		<author>
			<persName><forename type="first">M</forename><surname>Schlichtkrull</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">N</forename><surname>Kipf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bloem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Van Den Berg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Titov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Welling</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-319-93417-4_38</idno>
		<idno>r X i v : 1 7 0 3 . 0 6 1 0 3</idno>
	</analytic>
	<monogr>
		<title level="j">LNCS</title>
		<imprint>
			<biblScope unit="volume">10843</biblScope>
			<biblScope unit="page" from="593" to="607" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Representation learning: A review and new perspectives</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Courville</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Vincent</surname></persName>
		</author>
		<idno type="DOI">10.1109/TPAMI.2013.50</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Pattern Analysis and Machine Intelligence</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="1798" to="1828" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
	<note>a r X i v : 1 2 0 6 . 5 5 3 8</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Infogan: Interpretable representation learning by information maximizing generative adversarial nets</title>
		<author>
			<persName><forename type="first">X</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Duan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Houthooft</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schulman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Abbeel</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper/2016/hash/7c9d0b1f96aebd7b5eca8c3edaa19ebb-Abstract.html" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016</title>
				<editor>
			<persName><forename type="first">D</forename><forename type="middle">D</forename><surname>Lee</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Sugiyama</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">U</forename><surname>Luxburg</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">I</forename><surname>Guyon</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Garnett</surname></persName>
		</editor>
		<meeting><address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016">December 5-10, 2016. 2016</date>
			<biblScope unit="page" from="2172" to="2180" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Understanding disentangling in 𝛽-VAE</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">P</forename><surname>Burgess</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Higgins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Matthey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Watters</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Desjardins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lerchner</surname></persName>
		</author>
		<ptr target="http://arxiv.org/abs/1804.03599.arXiv:1804.03599" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 30</title>
				<meeting><address><addrLine>Nips, Long Beach, CA, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Isolating Sources of Disentanglement in Variational Autoencoders</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">T Q</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">B</forename><surname>Grosse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Duvenaud</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems</title>
				<meeting><address><addrLine>Montreal, Quebec, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="2615" to="2625" />
		</imprint>
	</monogr>
	<note>a r X i v : 1 8 0 2 . 0 4 9 4 2</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Learning Deep Disentangled Embeddings With the F-Statistic Loss</title>
		<author>
			<persName><forename type="first">K</forename><surname>Ridgeway</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">C</forename><surname>Mozer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems</title>
				<meeting><address><addrLine>Montreal, Quebec, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="185" to="194" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">A framework for the quantitative evaluation of disentangled representations</title>
		<author>
			<persName><forename type="first">C</forename><surname>Eastwood</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">K I</forename><surname>Williams</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">6th International Conference on Learning Representations</title>
				<meeting><address><addrLine>{ICLR}, Vancouver, BC, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Variational inference of disentangled latent concepts from unlabeled observations</title>
		<author>
			<persName><forename type="first">A</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Sattigeri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Balakrishnan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">6th International Conference on Learning Representations</title>
				<meeting><address><addrLine>{ICLR}, Vancouver, BC, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note>a r X i v : 1 7 1 1 . 0 0 8 4 8</note>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations</title>
		<author>
			<persName><forename type="first">F</forename><surname>Locatello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bauer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lucic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>R{\"{a}}tsch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gelly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Sch{\"{o}}lkopf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Bachem</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 36th International Conference on Machine Learning</title>
				<meeting>the 36th International Conference on Machine Learning<address><addrLine>{ICML}, Long Beach, California, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="4114" to="4124" />
		</imprint>
	</monogr>
	<note>a r X i v : a r X i v : 1 8 1 1 . 1 2 3 5 9 v 4</note>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<author>
			<persName><forename type="first">F</forename><surname>Locatello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Poole</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Rätsch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Schölkopf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Bachem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tschannen</surname></persName>
		</author>
		<idno>CoRR abs/2002.0</idno>
		<title level="m">Weakly-Supervised Disentanglement Without Compromises</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Improving Generalization for Abstract Reasoning Tasks Using Disentangled Feature Representations</title>
		<author>
			<persName><forename type="first">X</forename><surname>Steenbrugge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Leroux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Verbelen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Dhoedt</surname></persName>
		</author>
		<idno>doi:</idno>
		<ptr target="http://arxiv.org/abs/1811.04784.arXiv:arXiv:1811." />
	</analytic>
	<monogr>
		<title level="m">Neural Information Processing Systems (NeurIPS) Workshop on Relational Representation Learning</title>
				<meeting><address><addrLine>Montreal, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018. 0 4 7 8 4 v 1</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Are Disentangled Representations Helpful for Abstract Visual Reasoning?</title>
		<author>
			<persName><forename type="first">S</forename><surname>Van Steenkiste</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Locatello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schmidhuber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Bachem</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems</title>
				<meeting><address><addrLine>Vancouver, BC, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="14222" to="14235" />
		</imprint>
	</monogr>
	<note>a r X i v : 1 9 0 5 . 1 2 5 0 6</note>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">When crowds hold privileges: Bayesian unsupervised representation learning with oracle constraints</title>
		<author>
			<persName><forename type="first">T</forename><surname>Karaletsos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Belongie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Rätsch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">4th International Conference on Learning Representations</title>
				<meeting><address><addrLine>{ICLR}, San Juan, Puerto Rico</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="1" to="16" />
		</imprint>
	</monogr>
	<note>a r X i v : 1 5 0 6 . 0 5 0 1 1</note>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Weakly Supervised Disentanglement by Pairwise Similarities</title>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Batmanghelich</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 32nd AAAI Conference on Artificial Intelligence</title>
				<meeting>the 32nd AAAI Conference on Artificial Intelligence<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>AAAI</publisher>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note>a r X i v : 1 9 0 6 . 0 1 0 4 4</note>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<title level="m" type="main">Robust ordinal VAE: employing noisy pairwise comparisons for disentanglement</title>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Batmanghelich</surname></persName>
		</author>
		<idno>CoRR abs/1910.05898</idno>
		<ptr target="http://arxiv.org/abs/1910.05898.arXiv:1910.05898" />
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<monogr>
		<author>
			<persName><forename type="first">I</forename><surname>Redko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Habrard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Morvant</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sebban</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bennani</surname></persName>
		</author>
		<title level="m">Advances in Domain Adaptation Theory</title>
				<imprint>
			<publisher>Elsevier</publisher>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Auto-Encoding Variational Bayes</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Kingma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Welling</surname></persName>
		</author>
		<idno type="DOI">10.1051/0004-6361/201527329</idno>
		<idno>r X i v : 1 3 1 2 . 6 1 1 4</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2nd International Conference on Learning Representations</title>
				<meeting>the 2nd International Conference on Learning Representations<address><addrLine>Banff, Alberta, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Learning entity and relation embeddings for knowledge graph completion</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhu</surname></persName>
		</author>
		<ptr target="http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9571" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence</title>
				<editor>
			<persName><forename type="first">B</forename><surname>Bonet</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Koenig</surname></persName>
		</editor>
		<meeting>the Twenty-Ninth AAAI Conference on Artificial Intelligence<address><addrLine>Austin, Texas, USA</addrLine></address></meeting>
		<imprint>
			<publisher>AAAI Press</publisher>
			<date type="published" when="2015">January 25-30, 2015. 2015</date>
			<biblScope unit="page" from="2181" to="2187" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Holographic embeddings of knowledge graphs</title>
		<author>
			<persName><forename type="first">M</forename><surname>Nickel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Rosasco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">A</forename><surname>Poggio</surname></persName>
		</author>
		<ptr target="http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12484" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence</title>
				<editor>
			<persName><forename type="first">D</forename><surname>Schuurmans</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><forename type="middle">P</forename><surname>Wellman</surname></persName>
		</editor>
		<meeting>the Thirtieth AAAI Conference on Artificial Intelligence<address><addrLine>Phoenix, Arizona, USA</addrLine></address></meeting>
		<imprint>
			<publisher>AAAI Press</publisher>
			<date type="published" when="2016">February 12-17, 2016. 2016</date>
			<biblScope unit="page" from="1955" to="1961" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Asai</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1812.01818</idno>
		<title level="m">Photo-Realistic Blocksworld Dataset</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Logic Tensor Networks for Semantic Image Interpretation</title>
		<author>
			<persName><forename type="first">I</forename><surname>Donadello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Serafini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Avila</forename><surname>Garcez</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence</title>
				<meeting>the Twenty-Sixth International Joint Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="1596" to="1602" />
		</imprint>
	</monogr>
	<note>a r X i v : 1 7 0 5 . 0 8 9 6 8</note>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">Logic tensor networks: Deep learning and logical reasoning from data and knowledge</title>
		<author>
			<persName><forename type="first">L</forename><surname>Serafini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">D</forename><surname>Garcez</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 11th International Workshop on Neural-Symbolic Learning and Reasoning (NeSy&apos;16) co-located with the Joint Multi-Conference on Human-Level Artificial Intelligence {(HLAI} 2016)</title>
				<meeting>the 11th International Workshop on Neural-Symbolic Learning and Reasoning (NeSy&apos;16) co-located with the Joint Multi-Conference on Human-Level Artificial Intelligence {(HLAI} 2016)<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note>a r X i v : 1 6 0 6 . 0 4 4 2 2</note>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Scikit-learn: Machine learning in Python</title>
		<author>
			<persName><forename type="first">F</forename><surname>Pedregosa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Varoquaux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gramfort</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Michel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Thirion</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Grisel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Blondel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Prettenhofer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Weiss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Dubourg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Vanderplas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Passos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Cournapeau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Brucher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Perrot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Duchesnay</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page" from="2825" to="2830" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
