<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Training Multimodal Systems for Classi cation with Multiple Ob jectives</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jason Armitage</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shramana Thakur</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rishi Tripathi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jens Lehmann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Maleshkova</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer IAIS</institution>
          ,
          <addr-line>Dresden</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Bonn</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We learn about the world from a diverse range of sensory information. Automated systems lack this ability as investigation has centred on processing information presented in a single form. Adapting architectures to learn from multiple modalities creates the potential to learn rich representations of the world - but current multimodal systems only deliver marginal improvements on unimodal approaches. Neural networks learn sampling noise during training with the result that performance on unseen data is degraded. This research introduces a second objective over the multimodal fusion process learned with variational inference. Regularisation methods are implemented in the inner training loop to control variance and the modular structure stabilises performance as additional neurons are added to layers. This framework is evaluated on a multilabel classi cation task with textual and visual inputs to demonstrate the potential for multiple objectives and probabilistic methods to lower variance and improve generalisation.</p>
      </abstract>
      <kwd-group>
        <kwd>Machine Learning Multimodal Data Probabilistic Methods</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Human experience of the world is rooted in our ability to process and integrate
information present in diverse perceptual modalities [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Multimodal approaches
to machine learning are motivated by this ability and aim to develop rich
representations that combine information from multiple sources [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Consider a
machine learning system that models events in the world by processing inputs from
online media sources. Representations of these events take the form of text,
images, video, and audio. Systems that are able to process signals from a range of
these inputs learn models that are more complete descriptions of the represented
events with resulting bene ts for inference and predictions [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Researchers have
proposed related classi ers for performing event detection [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], source prediction
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and activity recognition [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>Copyright c 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>
        Multimodal machine learning presents a suite of methods for leveraging
diverse data - but the development of systems that generalise to unseen samples
leads to challenges arising both in practice and from the underlying theory of
machine learning. Limited data resources are the most pressing concern in the
rst category. Data acquisition for multimodal systems is complicated by the
requirement for combinations of samples in each input modality [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In the
absence of large-scale data, neural networks learn sampling noise in the training
data and report low scores on unseen samples [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Additional modalities also
in ate parameter counts with the outcome that multimodal systems report high
accuracy during training and low accuracy at test time [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Additional hidden
layers can boost performance during training but introduce the requirement to
prevent the interaction of parameters across the model from slowing convergence
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Multimodal fusion combines representations from constituent modalities into
a single embedding. In recent years, the deployment of neural networks to
generate fused embeddings has resulted in state-of-the-art performance on classi
cation tasks of textual and visual samples. In theory, multimodal fusion methods
capture information present in the input representations and produce outputs
with complimentary information. Comparison with unimodal classi ers
demonstrates that the introduction of additional modalities yields only modest
performance gains [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In addition to over tting, limitations on available data are
acute when tasks require images or video.
      </p>
      <p>Main contributions. We propose and build a novel approach to
multimodal classi cation that introduces a second objective to learn fused embeddings
trained with variational inference. To our knowledge, this use of multiple
objectives - where one function is learned with a method from inverse probability - is
unique in the research on multimodal representation learning. We go on to show
that the range of methods for calibrating parameter updates developed within
the latter approach o sets the over tting associated with multimodal fusion.
The bene ts of these proposals are demonstrated empirically by adapting an
existing end-to-end architecture to perform multilabel classi cation on a dataset
of 25k samples of paired images and text. F -scores on classifying unseen samples
provide measurement of the contribution from introducing a second probabilistic
objective and related regularisation methods to multimodal classi cation tasks.</p>
      <p>Structure of the analysis. We start with an outline of the use case
identi ed for multimodal representation learning followed by a detailed speci cation
of our proposed framework and related methods. The evaluation section presents
topline results for the system and an ablation on regularisation methods in
variational inference. Section 4 highlights the existing research informing our work
and we conclude by summarising the main ndings.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Use Case</title>
      <p>
        Learning on multiple modalities presents opportunities to generate rich
representations for enhancing performance on existing tasks and enables new
applications [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This section introduces a form of classi cation task where samples
are presented both in natural language and images. Systems that learn on these
modalities are applied to a range of use cases related to archived and online
media [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">5,4,3</xref>
        ].
2.1
      </p>
      <sec id="sec-2-1">
        <title>Multilabel Classi cation on Multiple Modalities</title>
        <p>
          Label prediction underlies approaches to information retrieval [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and classi
cation [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] - and enables the downstream tasks of document retrieval [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], textual
and visual entailment [
          <xref ref-type="bibr" rid="ref14 ref15">14,15</xref>
          ], and fact validation [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Assigning samples to a
potential subset of multiple labels also characterises challenges in unimodal [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]
and multimodal [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] real-world applications.
        </p>
        <p>
          Multilabel classi cation on image and text inputs forms a benchmark task
in the research on bimodal learning [
          <xref ref-type="bibr" rid="ref18 ref19">18,19</xref>
          ] and is also used here to assess our
proposed approach to multimodal fusion. In creating the MM-IMDb dataset for
movie genre classi cation, the authors were addressing a shortfall in training data
to conduct multimodal classi cation [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. The task in MM-IMDb is an instance of
multilabel prediction over multiple modalities where titles have an average of 2.48
classes and the system undertakes a series of independent classi cations. Metrics
are computed by comparing outputs Y with target labels from the set D. The
authors propose an architecture (referenced below as the GMU baseline) with
gates to control information learned from modalities to perform classi cation.
We build a version of this system in PyTorch and include results as a benchmark
(see Figure 2 and Table 1).
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Dataset</title>
        <p>The MM-IMDb dataset constitutes samples for 25,959 movies assigned to one
or more of 23 genre classes. Inputs processed in predicting class labels for each
sample are a text summary averaging 92.5 words and an image poster. An
additional 50 metadata elds of structured text are excluded from the multilabel
prediction task to enable assessment of systems on natural language and images.</p>
        <p>
          Systems are evaluated on a processed version of the dataset available from
the authors' institution *. We extracted four columns from MM-IMDb: FileID,
Genres (in one-hot encoded format), VGG16 image embeddings, word2vec text
embeddings. Text embeddings are 300 dimension word2vec representations and
image embeddings are 4096 dimension representations of features extracted
by Arevalo et al. [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. Image embeddings, text embeddings, and one-hot-encoded
true labels are stored as separate tensors ahead of training. Training and
crossvalidation are performed with 70% of the data and systems are evaluated on the
remaining 30% of samples.
* http://lisi1.unal.edu.co/mmimdb/multimodal_imdb.hdf5
High variance is a core challenge to system performance at test time in
multimodal classi cation tasks [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. Arevalo et al. [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] note the improvements that
regularisation methods contribute to the architecture proposed for conducting
classi cation on the MM-IMDb dataset. As a starting point, we examine the
e ects of excluding batchnorm [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and constraining weight updates to an upper
bound (max-norm) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] on system performance. Validation accuracy curves in
Figure 2 run to one epoch after the maximum weighted F 1 score (see Section
4) observed for di erent versions of the baseline system. The negative impact
of variance is visible after only a few epochs when these methods are excluded
from the system.
We have examined the importance of regularisation in the use case above and
continue with our proposal to provide additional controls for calibrating updates
to a set of parameters . In this section, we introduce a classi cation framework
with multimodal fusion comprised of two modules trained with separate loss
functions and a single computation of gradients w.r.t. inputs. This framework is
the basis for our investigation into mitigating variance with the aim of improving
classi cation performance over multiple modalities and is referred to as PM+MO
below.
A multilabel classi cation framework with bimodal inputs is a function h(x)
that takes pairs of samples (xit; xi) = ((xt1; xi1); (xt2; xi2); :::(xtn; xin)) where t and
i
i are text and image representations respectively. The resulting multimodal
representations are mapped to a subset of labels S D or di and each output is
classi ed y = (y1; y2; :::yn) 2 f0; 1g: In the proposed framework, the two
objectives in h(f1(x); f2(z)) are computed sequentially with a single step of gradient
computations. Module A learns the function f1(x) for multimodal fusion with a
variational inference framework and ELBO as the objective. Module C conducts
multilabel classi cation f2(z) on the outputs of A by optimising a standard loss
described directly below.
        </p>
        <p>
          Three wide hidden layers are the basis of the fused embedding module A.
As with Arevalo et al. [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], input embeddings vt and vi are assigned to linear
functions and hyperbolic tangent tanh(uli) activations where u is the hyperbolic
angle. In our proposal, weights and biases of hidden layers [wiI;;jl ; bli] are random
variables drawn from a Laplace L distribution jl L( j ; j ) with mean j and
variance j optimised during training. Outputs from the unimodal embedding
layers are fused using concatenation (1) and mixing (2) operations:
p( jl jx) =
p( jl ; x)
p(x)
        </p>
        <p>
          :
KL(q( jl )jjp( jl jx))
vjcat = [vjt;o ; vji;o ]
zj = vcat
j
vji;o
+ (1
vcat) vt;o
j j
Loss on A is computed with stochastic variational inference using the variants of
ELBO detailed in Section 3.2. Multimodal embeddings vm form the inputs for the
classi er module C. We align with Arevalo et al. [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] in implementing a multilayer
perceptron with maxout activation maxj (wiI;;jl x; bli) as proposed by Goodfellow et
al. [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. A wide hidden layer receives vm and the max of parameters for v[m;o] are
taken during activation. Binary cross-entropy combines sigmoid activation with
cross-entropy loss to assign a probability for each class to outputs (y1; y2; :::yn) 2
f0; 1g and summing the results PM
        </p>
        <p>c=1 yj;c ; log(pj;c ):</p>
        <p>
          Regularisation methods recommended by Arevalo et al. [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] - and retained in
our framework - consist of batchnorm to learn z + for [z] and [z] on batches
, max-norm to constrain weight updates jjwjl jj, and dropout with maxout
activation in C. Both modules are optimised with variants of the Adam algorithm
incorporating regularisation. Adam with gradient clipping in A - as implemented
in Pyro PPL [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] - is run on each parameter during steps of variational inference.
In the case of C, AdamW was used in place of Adam after initial testing. This
algorithm acts on Loschilov and Hutter's proposal to replace L2 regularisation
with decoupled weight decay when Adam is the optimiser [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ].
3.2
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Multimodal Fusion with Variational Inference</title>
        <p>
          Multimodal embeddings are learned by inferring [wiI;;jl ; bli] for the layers of A using
variational inference. In each case, we assume that the posterior p( jl jX; Y ) is
drawn from a family of Laplace distributions Q. Computing the integrals for p
is intractable and so we infer an approximate candidate from Q by selecting qi
with the lowest KL divergence from p [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ].The posterior expression is reduced
to p( jl jx) and de ned as:
Minimising the KL divergence
is equivalent to maximising the evidence lower bound (ELBO) [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]
KL(q( jl )jjp( jl jx)) =
(Eq[logp( jl ; x)]
        </p>
        <sec id="sec-2-3-1">
          <title>Eq[logq( jl )]</title>
          <p>(1)
(2)
(3)
(4)
where Eq is the expected value under q. In practice the parameters are stochastic
gradients sampled from the optimal variational distribution.</p>
          <p>Three variants of ELBO are evaluated in trained versions of the PM+MO
framework. The rst implementation (ELBOv1) samples s from qi and computes
the expected value in a basic form:</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>Eqs(x) [logp( jl ; x)</title>
          <p>
            logqs(x)]:
(6)
ELBOv2 [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ] uses the Rao-Blackwellization strategy proposed by Ranganath et
al. to reduce variance when estimating gradients by replacing random variables
with conditional expectations of the variables [
            <xref ref-type="bibr" rid="ref27">27</xref>
            ]. The third variant ( KL)
also addresses variance in gradient estimation by including a term to limit the
regularising in uence of the KL term at initiation - and then scaling up the level
as training progresses [
            <xref ref-type="bibr" rid="ref28">28</xref>
            ]. L1 and L2 norms in the training steps of variational
inference present additional controls to regulate parameter updates.
4
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>
        An analysis of the PM+MO framework comprises training and testing system
variants on the task of multilabel classi cation on text and images from the
MMIMDb dataset. System performance is measured with micro F 1, macro F 1,
weighted F 1, and samples F 1. F 1 is a standard metric for measuring accuracy
in multiclass classi cation tasks [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] and is computed as the mean of precision
and recall
f sample =
1
1 Xi=1 2jy^i \ yij
N
      </p>
      <p>N jy^ij + jyij
(7)
where N is the number of samples and yi is the tuple of predictions. Each of the
four methods is an average of F -scores computed in the following ways: per
sample (samples), across all system outputs (micro), by genre (macro), or by genre
and with a weighted average on positive samples for each label. Performance at
system level is reported for all of these metrics and weighted F 1 is referenced
in comparisons between systems in the text.
4.1</p>
      <sec id="sec-3-1">
        <title>System Con guration</title>
        <p>Systems in the evaluation are all trained on a single Tesla K80 GPU and with
a batch size of 512. Priors for the weights and biases of each layer in builds
with variational inference are modeled using Laplace distributions initialised at
= 0:1 and = 0:01. Parameters for these distributions are learned during
gradient steps with the three variants of ELBO detailed above. In the version
with KL scaling ( KL), the scaling term is set at = 0:2 following tests in the
range (0:1; 1:0). L1 and L2 updates to parameters are set by a di erent = 0:1
in all tests.
4.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Results</title>
        <p>Experiments aim to measure the impact of training with multiple objectives,
variational inference, and probabilistic regularisation methods when conducting
multimodal fusion for classi cation tasks. Assessment starts with a comparison
of the best performing version of the PM+MO framework - PM+MO ( KL+L2)
- with the GMU baseline. An ablation analysis on several PM+MO variants
provides a granular analysis of regularisation methods associated with variational
inference. Reported numbers are the means of scores calculated over ve
complete cycles of training and testing.
System</p>
        <p>
          Topline results for our proposed framework and the GMU baseline are
presented in Table 1. Hyperparameter settings were optimised for each system
with di erences in learning rate (PM+MO=0.005, GMU=0.001) and dropout
(PM+MO=0.9, GMU=0.7). A version of the PM+MO framework with a fused
embedding module including ELBO+KL scaling and L2 regularisation in the
fused embedding model - and wide layers with 3000 neurons - scored higher on
all F -scores than the GMU baseline (weighted F 1 = +0:009). The baseline
combines a Gated Multimodal Unit and a simple classi er with maxout
activation - and is trained end-to-end with a single binary cross-entropy objective.
Regularisation and hyperparameter settings conform to speci cations shared in
the publication [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] and repository. Code conversion from Theano to PyTorch is
a contributor to the di erence in scores for this build of the GMU system against
those stated in the original research (weighted F 1 = 0:617).
        </p>
        <p>
          A version of our framework with 1024 neurons in each linear layer completes
the topline analysis. Lower scores for this approach underline the bene ts of
training with wide layers when regularisation o sets variance. As a nal check
of the bene ts of training with a combination of variational inference and wide
layers, we tested a version of the GMU baseline with wide layers and observed
lower accuracy to the reported system. Training time per epoch on a single GPU
for the best performing PM+MO system is 5.25 secs compared to 1.10 secs for
the GMU baseline. Total training time for the former is still low at 2m11s - but
extended training times are signi cant considerations in large-scale data regimes
[
          <xref ref-type="bibr" rid="ref30">30</xref>
          ].
        </p>
        <p>F 1 (mean</p>
        <p>Measurements of accuracy on individual classes are presented in Figure 3.
The most performant PM+MO system reported higher weighted F 1 scores
in relation to the GMU baseline for 15 of the 23 movie genres. Classi cation
accuracy matched or exceeded the baseline on all genres where weighted F 1
for the latter system was less than 0.5.</p>
        <p>Scores for several versions of the PM+MO framework are presented in Table
2 with the objective of comparing the impact of regularisation strategies.
Hyperparameter settings are uniform across all runs with the exception of the speci c
methods noted in rows. ELBO versions incorporating methods for managing
variance outperform basic implementations of ELBO (ie ELBOv1). KL scaling with
L2 regularisation delivers a marginal improvement (weighted F 1 = +0:003)
on the same con guration with Rao-Blackwellization (ELBOv2+L2).
Supplementary L2 norm penalties on parameter updates boost accuracy on all
congurations. A build with multiple objectives and excluding variational inference
(M+MO) returns the lowest weighted F 1 ( 0:015 w.r.t. PM+MO ( KL+L2)).
5
5.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Related Work</title>
      <sec id="sec-4-1">
        <title>Multimodal Representation Learning</title>
        <p>
          Researchers have investigated the role of neural networks in combining
representations from multiple modalities to perform end tasks for several decades
[
          <xref ref-type="bibr" rid="ref31">31</xref>
          ]. Multimodal fusion is deployed in classi cation tasks when all constituent
modalities are present during training and inference [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ]. Coordinated
embedding methods are an alternative method for these tasks [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Separation between
vectors is retained in these approaches by projecting the textual and visual
representations into a common d dimensional space and introducing a constraint
[
          <xref ref-type="bibr" rid="ref33">33</xref>
          ]. Fusion-based learning results in a single output vector: one advantage for
sticking with this approach in our framework is to facilitate transfer between
modules. Multimodal embeddings are also learned by Silberer and Lapata to
perform word similarity and object classi cation [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ]. The process proposed for
this and related methods [
          <xref ref-type="bibr" rid="ref34 ref35">34,35</xref>
          ] di ers from the method in our system by
including Singular Value Decomposition (SVD) to integrate constituent embeddings.
5.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Multiple Modules and Objectives</title>
        <p>
          System architectures composed of multiple modules form a foundational area in
the research on neural networks. A primary objective in this literature is the
construction of classi cation frameworks that generalise to unseen samples [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ].
Auda et al. [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ] detailed several approaches to decomposing tasks and designed
a classi er with multiple modules to solve components for separate sub-tasks.
In this case, a voting layer acted as a constraint on outputs from individual
components [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ]. Early implementations of modular architectures for task
decomposition were trained with a single objective - or were separated into distinct
models. Secondary losses are implemented during training in the related areas of
representation learning and transfer learning. Our system approximates Zhang
et al's proposal to introduce an auxiliary objective and module into a classi
cation framework [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ]. The method selection in this research di ers to ours in
implementing unsupervised learning for the auxiliary components. Du et al. [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ]
proposed a system for measuring cosine similarity between auxiliary and main
losses when the former contributes to the latter [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ]. In contrast to our work,
positive transfer with multiple losses are applied in instances where source and
target tasks share related objectives.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Probabilistic Deep Learning</title>
        <p>
          Probabilistic methods in this research extend an approach to machine learning
where the assessment of architectures is based on inverse probability [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ]. Here
the plausibility of the model - or in our case, the parameters in each layer - are
computed w.r.t. to the data. Variational inference is a non-deterministic method
that replaces elements in probabilistic inference with approximations when the
computation of integrals is intractable. A family of distributions is placed over
the model parameters and the candidate distribution with the lowest KL
divergence from the true posterior is selected [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. ELBO formulates this minimisation
as optimisation and rewards candidate distributions that maximise both p(zjx)
and a spread of uncertainty. The ELBO term in our system extends
optimisation with simple operations to regularise parameter updates. Ma et al. [
          <xref ref-type="bibr" rid="ref41">41</xref>
          ]
modify ELBO with a supplementary regularisation term to improve
representation learning - although the objective of this technique is to reward diversity
in the selection of candidate distributions [
          <xref ref-type="bibr" rid="ref41">41</xref>
          ]. In contrast to our proposals,
enhancements or substitutes for ELBO in this and other research are centred on
variational autoencoder (VAE) architectures [
          <xref ref-type="bibr" rid="ref42 ref43">42,43</xref>
          ]. The selection of Laplace
distributions in our system is informed by the ability of these distributions to
model data with a high level of heterogeneity [
          <xref ref-type="bibr" rid="ref44">44</xref>
          ]. Stochastic variational
inference and related methods are implemented in our system using the Pyro PPL
[
          <xref ref-type="bibr" rid="ref23">23</xref>
          ].
5.4
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>Regularisation Methods</title>
        <p>
          Our framework retains means for reducing variance when conducting multimodal
fusion proposed by Arevalo et al. [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. Batch normalisation was introduced to
minimise the impact of changes in parameters on the distributions for activations
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Goodfellow et al. describe the maxout activation function as an averaging
technique in neural network-based architectures that compliments dropout [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ].
In initial testing, we veri ed the performance gain from selecting maxout
activation in the classi er module and retained it in the PM+MO architecture.
Contributions that describe interactions between parameters and optimiser
algorithms [
          <xref ref-type="bibr" rid="ref24 ref45">24,45</xref>
          ] supported our decision to test di erent forms of managing weight
updates. The context di ers as we extend these techniques to training
representations with variational inference.
6
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this research, we have demonstrated that a framework of sub-modules trained
with variational inference as one of multiple objectives leads to improvements
in performance on multimodal classi cation. The proposed framework supports
wide layers and higher learning rates when compared with systems trained with
a single objective. Improvements in generalisation over multiple objective
systems that exclude variational inference are also demonstrated. An evaluation of
regularisation methods associated with variational inference underlines the
advantages of probabilistic approaches in extending options to calibrate parameter
updates during training and o set over tting in multimodal systems. We plan to
train on additional modalities and extend probabilistic methods in
representation learning as further contributions to the research on improving generalisation
in multimodal systems.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>The project leading to this publication has received funding from the
European Union's Horizon 2020 research and innovation programme under the Marie
Sklodowska-Curie grant agreement No 812997.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Sepulcre</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sabuncu</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yeo</surname>
            ,
            <given-names>T.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , Johnson, K.A.:
          <article-title>Stepwise Connectivity of the Modal Cortex Reveals the Multimodal Organization of the Human Brain</article-title>
          .
          <source>Journal of Neuroscience</source>
          <volume>32</volume>
          (
          <issue>31</issue>
          ),
          <volume>10649</volume>
          {10661 (Aug
          <year>2012</year>
          ), http: //www.jneurosci.org/cgi/doi/10.1523/JNEUROSCI.0759-
          <fpage>12</fpage>
          .
          <year>2012</year>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bruni</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>N.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baroni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Multimodal Distributional Semantics</article-title>
          .
          <source>Journal of Arti cial Intelligence Research</source>
          p.
          <volume>47</volume>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Baltrusaitis</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahuja</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morency</surname>
            ,
            <given-names>L.P.:</given-names>
          </string-name>
          <article-title>Multimodal Machine Learning: A Survey and Taxonomy</article-title>
          .
          <source>arXiv:1705.09406 [cs] (Aug</source>
          <year>2017</year>
          ), http://arxiv.org/abs/1705. 09406, arXiv:
          <fpage>1705</fpage>
          .
          <fpage>09406</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Petkos</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Papadopoulos</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kompatsiaris</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Social event detection using multimodal clustering and integrating supervisory signals j</article-title>
          <source>Proceedings of the 2nd ACM International Conference on Multimedia Retrieval. In: ICMR '12: Proceedings of the 2nd ACM International Conference on Multimedia Retrieval (Jun</source>
          <year>2012</year>
          ), https://dl.acm.org/doi/abs/10.1145/2324796.2324825, archive Location: world Library Catalog: dl.acm.org
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ramisa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Multimodal News Article Analysis</article-title>
          .
          <source>In: Proceedings of the TwentySixth International Joint Conference on Arti cial Intelligence</source>
          . pp.
          <volume>5136</volume>
          {
          <fpage>5140</fpage>
          . International Joint Conferences on Arti cial Intelligence Organization, Melbourne,
          <source>Australia (Aug</source>
          <year>2017</year>
          ), https://www.ijcai.org/proceedings/2017/737
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chitta</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madhvanath</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernal</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
          </string-name>
          , J.:
          <article-title>Deep Multimodal Representation Learning from Temporal Data</article-title>
          .
          <source>In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          . pp.
          <volume>5066</volume>
          {
          <fpage>5074</fpage>
          . IEEE, Honolulu,
          <string-name>
            <surname>HI</surname>
          </string-name>
          (Jul
          <year>2017</year>
          ), http://ieeexplore.ieee.org/document/8100021/
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Amato</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Behrmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bimbot</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caramiaux</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Falchi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Geurts</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gibert</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gravier</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holken</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koenitz</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lefebvre</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liutkus</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lotte</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perkis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Redondo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turrin</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vieville</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vincent</surname>
          </string-name>
          , E.:
          <article-title>AI in the media and creative industries</article-title>
          . arXiv:
          <year>1905</year>
          .04175 [cs] (May
          <year>2019</year>
          ), http: //arxiv.org/abs/
          <year>1905</year>
          .04175, arXiv:
          <year>1905</year>
          .04175
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhutdinov</surname>
          </string-name>
          , R.:
          <article-title>Dropout: A Simple Way to Prevent Neural Networks from Over tting</article-title>
          p.
          <fpage>30</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feiszli</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>What Makes Training Multi-Modal Classi cation Networks Hard? arXiv:</article-title>
          <year>1905</year>
          .12681 [cs] (
          <year>Dec 2019</year>
          ), http://arxiv.org/abs/
          <year>1905</year>
          . 12681, arXiv:
          <year>1905</year>
          .12681
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Io e, S.,
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Batch Normalization:
          <article-title>Accelerating Deep Network Training by Reducing Internal Covariate Shift</article-title>
          .
          <source>arXiv:1502.03167 [cs] (Mar</source>
          <year>2015</year>
          ), http: //arxiv.org/abs/1502.03167, arXiv:
          <fpage>1502</fpage>
          .
          <fpage>03167</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , H.:
          <article-title>Machine learning for information retrieval: Neural networks, symbolic learning, and genetic algorithms</article-title>
          .
          <source>Journal of the American Society for Information Science</source>
          <volume>46</volume>
          (
          <issue>3</issue>
          ),
          <volume>194</volume>
          {
          <fpage>216</fpage>
          (
          <year>1995</year>
          ), https://asistdl.onlinelibrary.wiley. com/doi/abs/10.1002/%28SICI%
          <fpage>291097</fpage>
          -
          <lpage>4571</lpage>
          %
          <fpage>28199504</fpage>
          %
          <fpage>2946</fpage>
          %3A3%
          <fpage>3C194</fpage>
          %
          <fpage>3A</fpage>
          %
          <fpage>3AAID</fpage>
          -
          <lpage>ASI4</lpage>
          %
          <year>3E3</year>
          .
          <article-title>0</article-title>
          .CO%3B2-S
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Aggarwal</surname>
            ,
            <given-names>C.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A Survey of Text Classi cation Algorithms</article-title>
          . In: Aggarwal,
          <string-name>
            <given-names>C.C.</given-names>
            ,
            <surname>Zhai</surname>
          </string-name>
          , C. (eds.)
          <source>Mining Text Data</source>
          , pp.
          <volume>163</volume>
          {
          <fpage>222</fpage>
          .
          <string-name>
            <surname>Springer</surname>
            <given-names>US</given-names>
          </string-name>
          , Boston, MA (
          <year>2012</year>
          ), https://doi.org/10.1007/978-1-
          <fpage>4614</fpage>
          -3223-
          <issue>4</issue>
          _
          <fpage>6</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhutdinov</surname>
            ,
            <given-names>R.R.</given-names>
          </string-name>
          :
          <article-title>Reducing the Dimensionality of Data with Neural Networks</article-title>
          .
          <source>Science</source>
          <volume>313</volume>
          (
          <issue>5786</issue>
          ),
          <volume>504</volume>
          {507 (Jul
          <year>2006</year>
          ), https://science. sciencemag.org/content/313/5786/504, publisher: American Association for the
          <source>Advancement of Science Section: Report</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Marelli</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Menini</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baroni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bentivogli</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernardi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zamparelli</surname>
          </string-name>
          , R.:
          <article-title>A SICK cure for the evaluation of compositional distributional semantic models p. 8</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lai</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doran</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kadav</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Visual Entailment: A Novel Task for FineGrained Image Understanding</article-title>
          . arXiv:
          <year>1901</year>
          .06706 [cs] (
          <year>Jan 2019</year>
          ), http://arxiv. org/abs/
          <year>1901</year>
          .06706, arXiv:
          <year>1901</year>
          .06706
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gerber</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morsey</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Ngonga</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.C.</given-names>
            :
            <surname>DeFacto - Deep Fact</surname>
          </string-name>
          <article-title>Validation</article-title>
          . In: Cudre-Mauroux,
          <string-name>
            <given-names>P.</given-names>
            , He in, J.,
            <surname>Sirin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Tudorache</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Euzenat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Hauswirth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Parreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.X.</given-names>
            ,
            <surname>Hendler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Schreiber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Blomqvist</surname>
          </string-name>
          , E. (eds.)
          <source>The Semantic Web { ISWC 2012</source>
          . pp.
          <volume>312</volume>
          {
          <fpage>327</fpage>
          . Lecture Notes in Computer Science, Springer, Berlin, Heidelberg (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Nam</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Menc</surname>
            <given-names>a</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>E.L.</given-names>
            ,
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Sarikaya</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          , Furnkranz, J.:
          <article-title>Learning Context-DependentLabel Permutations for Multi-Label Classi cation</article-title>
          p.
          <fpage>10</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Kiela</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
          </string-name>
          , T.:
          <article-title>E cient Large-Scale Multi-Modal Classi cation</article-title>
          . arXiv:
          <year>1802</year>
          .02892 [cs] (
          <year>Feb 2018</year>
          ), http://arxiv.org/abs/
          <year>1802</year>
          . 02892, arXiv:
          <year>1802</year>
          .02892
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Pang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Multimodal learning for multi-label image classi cation</article-title>
          .
          <source>In: 2011 18th IEEE International Conference on Image Processing</source>
          . pp.
          <volume>1797</volume>
          {
          <year>1800</year>
          . IEEE, Brussels, Belgium (Sep
          <year>2011</year>
          ), http://ieeexplore. ieee.org/document/6115811/
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Arevalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solorio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-y Gomez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>F.A.</given-names>
          </string-name>
          :
          <article-title>Gated Multimodal Units for Information Fusion</article-title>
          .
          <source>arXiv:1702</source>
          .
          <year>01992</year>
          [cs, stat] (
          <year>Feb 2017</year>
          ), http: //arxiv.org/abs/1702.
          <year>01992</year>
          , arXiv:
          <fpage>1702</fpage>
          .01992
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Radu</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tong</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhattacharya</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lane</surname>
            ,
            <given-names>N.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mascolo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marina</surname>
            ,
            <given-names>M.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kawsar</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Multimodal Deep Learning for Activity and Context Recognition</article-title>
          .
          <source>Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies</source>
          <volume>1</volume>
          (
          <issue>4</issue>
          ),
          <volume>1</volume>
          {
          <fpage>27</fpage>
          (Jan
          <year>2018</year>
          ), http://dl.acm.org/citation.cfm?doid=
          <volume>3178157</volume>
          .
          <fpage>3161174</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Warde-Farley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mirza</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Maxout Networks</article-title>
          .
          <source>arXiv:1302</source>
          .4389 [cs, stat] (
          <year>Sep 2013</year>
          ), http://arxiv.org/abs/1302. 4389, arXiv:
          <fpage>1302</fpage>
          .
          <fpage>4389</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Bingham</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jankowiak</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Obermeyer</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pradhan</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karaletsos</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szerlip</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horsfall</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goodman</surname>
            ,
            <given-names>N.D.</given-names>
          </string-name>
          : Pyro:
          <article-title>Deep Universal Probabilistic Programming</article-title>
          . arXiv:
          <year>1810</year>
          .09538 [cs, stat] (
          <year>Oct 2018</year>
          ), http: //arxiv.org/abs/
          <year>1810</year>
          .09538, arXiv:
          <year>1810</year>
          .09538
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Loshchilov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hutter</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Decoupled Weight Decay Regularization</article-title>
          . arXiv:
          <volume>1711</volume>
          .05101 [cs, math] (
          <year>Jan 2019</year>
          ), http://arxiv.org/abs/1711.05101, arXiv:
          <fpage>1711</fpage>
          .
          <fpage>05101</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Wingate</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weber</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <source>Automated Variational Inference in Probabilistic Programming. arXiv:1301</source>
          .1299 [cs, stat] (
          <year>Jan 2013</year>
          ), http://arxiv.org/abs/1301. 1299, arXiv:
          <fpage>1301</fpage>
          .
          <fpage>1299</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Minka</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Divergence Measures</article-title>
          and Message Passing (Jan
          <year>2005</year>
          ), https://www.microsoft.com/en-us/research/publication/ divergence-measures-and
          <string-name>
            <surname>-</surname>
          </string-name>
          message-passing/
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Ranganath</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gerrish</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          :
          <article-title>Black Box Variational Inference</article-title>
          .
          <source>arXiv:1401</source>
          .0118 [cs, stat] (
          <year>Dec 2013</year>
          ), http://arxiv.org/abs/1401.0118, arXiv:
          <fpage>1401</fpage>
          .
          <fpage>0118</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28. S nderby, C.K.,
          <string-name>
            <surname>Raiko</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maal</surname>
            <given-names>e</given-names>
          </string-name>
          , L., S nderby, S.K.,
          <string-name>
            <surname>Winther</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Ladder Variational Autoencoders</article-title>
          . arXiv:
          <volume>1602</volume>
          .02282 [cs, stat] (May
          <year>2016</year>
          ), http://arxiv.org/ abs/1602.02282, arXiv:
          <fpage>1602</fpage>
          .
          <fpage>02282</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Madjarov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kocev</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gjorgjevikj</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Dzeroski, S.:
          <article-title>An extensive experimental comparison of methods for multi-label learning</article-title>
          .
          <source>Pattern Recognit</source>
          .
          <volume>45</volume>
          (
          <issue>9</issue>
          ),
          <volume>3084</volume>
          {
          <fpage>3104</fpage>
          (
          <year>2012</year>
          ), http://dblp.uni-trier.de/db/journals/pr/pr45.html#MadjarovKGD12
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pouransari</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adya</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Serrano</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gimnicher</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Walsh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Democratizing Production-Scale Distributed Deep Learning</article-title>
          . arXiv:
          <year>1811</year>
          .00143 [cs] (
          <year>Nov 2018</year>
          ), http://arxiv.org/abs/
          <year>1811</year>
          .00143, arXiv:
          <year>1811</year>
          .00143
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Yuhas</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goldstein</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Sejnowski, T.:
          <article-title>Integration of acoustic and visual speech signals using neural networks ntextbar IEEE Communications Magazine</article-title>
          .
          <source>IEEE Communications Magazine</source>
          <volume>27</volume>
          (
          <year>1989</year>
          ), https://dl.acm.org/doi/10.1109/ 35.41402
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Ngiam</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khosla</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nam</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Multimodal Deep Learning (</article-title>
          <year>2011</year>
          ), https://people.csail.mit.edu/khosla/papers/icml2011_ngiam. pdf
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Frome</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shlens</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
          </string-name>
          , T.:
          <article-title>DeViSE: A Deep Visual-Semantic Embedding Model</article-title>
          . In: Burges,
          <string-name>
            <given-names>C.J.C.</given-names>
            ,
            <surname>Bottou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Welling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Ghahramani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.Q</surname>
          </string-name>
          . (eds.)
          <source>Advances in Neural Information Processing Systems</source>
          <volume>26</volume>
          , pp.
          <volume>2121</volume>
          {
          <fpage>2129</fpage>
          . Curran Associates, Inc. (
          <year>2013</year>
          ), http://papers.nips.cc/paper/ 5204-devise
          <article-title>-a-deep-visual-semantic-embedding-model</article-title>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Silberer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lapata</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Learning Grounded Meaning Representations with Autoencoders</article-title>
          .
          <source>In: ACL (1)</source>
          . pp.
          <volume>721</volume>
          {
          <fpage>732</fpage>
          .
          <article-title>The Association for Computer Linguistics (</article-title>
          <year>2014</year>
          ), http://dblp.uni-trier.de/db/conf/acl/acl2014-
          <fpage>1</fpage>
          .html#SilbererL14
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35.
          <string-name>
            <surname>Bruni</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boleda</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baroni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>N.K.</given-names>
          </string-name>
          :
          <article-title>Distributional Semantics in Technicolor</article-title>
          . In:
          <article-title>Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</article-title>
          . pp.
          <volume>136</volume>
          {
          <fpage>145</fpage>
          . Association for Computational Linguistics, Jeju Island,
          <source>Korea (Jul</source>
          <year>2012</year>
          ), https://www.aclweb.org/ anthology/P12-1015
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          36.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>Augmenting Supervised Neural Networks with Unsupervised Objectives for Large-scale Image Classi cation</article-title>
          .
          <source>arXiv:1606.06582 [cs] (Jun</source>
          <year>2016</year>
          ), http://arxiv.org/abs/1606.06582, arXiv:
          <fpage>1606</fpage>
          .
          <fpage>06582</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          37.
          <string-name>
            <surname>Auda</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kamel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Multimodal Deep Learning</article-title>
          .
          <source>Journal of Intelligent and Robotic Systems</source>
          (
          <year>1998</year>
          ), https://dl.acm.org/doi/10.1023/A%3A1007925203918
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          38.
          <string-name>
            <surname>Auda</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kamel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raafat</surname>
          </string-name>
          , H.:
          <article-title>A new neural network structure with cooperative modules</article-title>
          .
          <source>In: Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94)</source>
          . vol.
          <volume>3</volume>
          , pp.
          <volume>1301</volume>
          {1306 vol.
          <volume>3</volume>
          . Orlando, FL (Jun
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          39.
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Czarnecki</surname>
            ,
            <given-names>W.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jayakumar</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pascanu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lakshminarayanan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Adapting Auxiliary Losses Using Gradient Similarity</article-title>
          . arXiv:
          <year>1812</year>
          .02224 [cs, stat] (
          <year>Dec 2018</year>
          ), http://arxiv.org/abs/
          <year>1812</year>
          .02224, arXiv:
          <year>1812</year>
          .02224
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          40.
          <string-name>
            <surname>MacKay</surname>
            ,
            <given-names>D.J.C.</given-names>
          </string-name>
          :
          <article-title>Bayesian methods for adaptive models</article-title>
          .
          <source>Ph.D. thesis</source>
          , California Institute of Technology (
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          41.
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hovy</surname>
          </string-name>
          , E.: MAE:
          <article-title>Mutual Posterior-Divergence Regularization for Variational AutoEncoders</article-title>
          . arXiv:
          <year>1901</year>
          .01498 [cs, stat] (
          <year>Jan 2019</year>
          ), http://arxiv. org/abs/
          <year>1901</year>
          .01498, arXiv:
          <year>1901</year>
          .01498
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          42.
          <string-name>
            <surname>Alemi</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poole</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fischer</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dillon</surname>
            ,
            <given-names>J.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saurous</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Fixing a Broken ELBO</article-title>
          . arXiv:
          <volume>1711</volume>
          .00464 [cs, stat] (
          <year>Feb 2018</year>
          ), http://arxiv.org/abs/ 1711.00464, arXiv:
          <fpage>1711</fpage>
          .
          <fpage>00464</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          43.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Houthooft</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schulman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abbeel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : InfoGAN:
          <article-title>Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets</article-title>
          . arXiv:
          <volume>1606</volume>
          .03657 [cs, stat] (
          <year>Jun 2016</year>
          ), http://arxiv.org/ abs/1606.03657, arXiv:
          <fpage>1606</fpage>
          .
          <fpage>03657</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          44.
          <string-name>
            <surname>Geraci</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Borja</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          :
          <article-title>Notebook: The Laplace distribution</article-title>
          .
          <source>Signi cance 15(5)</source>
          ,
          <volume>10</volume>
          {11 (Oct
          <year>2018</year>
          ), https://rss.onlinelibrary.wiley.com/doi/10.1111/ j.1740-
          <fpage>9713</fpage>
          .
          <year>2018</year>
          .
          <volume>01185</volume>
          .x
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          45.
          <string-name>
            <surname>Hanson</surname>
            ,
            <given-names>S.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pratt</surname>
          </string-name>
          , L.Y.:
          <article-title>Comparing Biases for Minimal Network Construction with Back-Propagation</article-title>
          . In: Touretzky, D.S. (ed.)
          <source>Advances in Neural Information Processing Systems</source>
          <volume>1</volume>
          , pp.
          <volume>177</volume>
          {
          <fpage>185</fpage>
          . Morgan-Kaufmann (
          <year>1989</year>
          ), http://papers.nips.cc/paper/ 156-comparing
          <article-title>-biases-for-minimal-network-construction-with-back-propagation</article-title>
          . pdf
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>