<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Partial Convolution Based Multimodal Autoencoder for Art Investigation</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Xianghui</forename><surname>Xie</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Engineering Technology</orgName>
								<orgName type="institution">KU Leuven</orgName>
								<address>
									<country key="BE">Belgium</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Laurens</forename><surname>Meeus</surname></persName>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">Department of Telecommunications and Information Processing</orgName>
								<orgName type="department" key="dep2">TELIN-GAIM</orgName>
								<orgName type="institution">Ghent University</orgName>
								<address>
									<country key="BE">Belgium</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Aleksandra</forename><surname>Pižurica</surname></persName>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">Department of Telecommunications and Information Processing</orgName>
								<orgName type="department" key="dep2">TELIN-GAIM</orgName>
								<orgName type="institution">Ghent University</orgName>
								<address>
									<country key="BE">Belgium</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Partial Convolution Based Multimodal Autoencoder for Art Investigation</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">E237161A3676BC12B8D817DC9F323209</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T11:02+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Autoencoder</term>
					<term>Partial convolution</term>
					<term>Multimodal data</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Autoencoders have been widely used in applications with limited annotations to extract features in an unsupervised manner, preprocessing the data to be used in machine learning models. This is especially helpful in image processing for art investigation where annotated data is scarce and difficult to collect. We introduce a structural similarity index based loss function to train the autoencoder for image data. By extending the recently developed partial convolution to partial deconvolution, we construct a fully partial convolutional autoencoder (FP-CAE) and adapt it to multimodal data, typically utilized in art invesigation. Experimental results on images of the Ghent Altarpiece show that our method significantly suppresses edge artifacts and improves the overall reconstruction performance. The proposed FP-CAE can be used for data preprocessing in craquelure detection and other art investigation tasks in future studies.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Art investigation aims at developing and applying technologies to facilitate research and conservation of artworks. Some typical research topics are craquelure detection, paint loss detection, virtual reconstruction and so on. In recent years, deep learning has shown great potential in computer vision tasks, which attracts researchers to apply deep learning methods to art investigation. However, existing studies mainly utilize fully supervised learning that relies on a great number of annotation data, a requirement that is hard to come by in art investigation.</p><p>Using autoencoders as a data preprocesser for feature extraction, is very common in deep learning when only limited amount of annotated data is available. Autoencoders can be trained in an unsupervised manner so that it learns to extract the most important features from a particular dataset, e.g. paintings from the same artist. After unsupervised learning, the latent vector of the autoencoder can then be used as the input of models for the art investigation tasks. Since these models are applied on a compressed representation of the input, they can be of lower complexity and contain less parameters. Accordingly, less annotated data is needed to train these models to the same performance level. In this way, autoencoders can be a powerful tool for art investigation.</p><p>Besides photography, other sensors are commonly employed in art investigation in order to acquire more information about a particular object. Our methods are applied on images acquired in the ongoing restoration project of the Ghent Altarpiece <ref type="bibr" target="#b8">[9]</ref> where at least five modalities are obtained: macro-photography before and during treatment (RGB, three color channels each), Infrared reflectography (IRR, single channel), X-ray (single channel) and ultraviolet fluorescence (UVF, three color channels). Researchers have used these multimodal data for craquelure <ref type="bibr" target="#b22">[23]</ref> and paint loss detection <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b15">16]</ref>, although are restricted in performance to the availability in annotations. Images from two different sensors of the painting the Prophet Zachary can be found in figure <ref type="figure" target="#fig_0">1</ref>. In both images, craquelure and regions of paint loss are visible. A disparity in context can show overpainted regions, e.g. in the red rectangles. To achieve good data preprocessing for art investigation tasks, the autoencoder must be able to extract both inter-and intramodal features.</p><p>To assess the quality of this encoding with respect to the compression factor, the reconstruction performance is commonly analyzed. To improve the reconstruction performance, we have 3 main contributions: a fully partial convolutional autoencoder, a structural similarity (SSIM) index based loss function, and separating the inputs for a multimodal autoencoder. Firstly, we generalize partial convolutions <ref type="bibr" target="#b13">[14]</ref> and extend it to partial deconvolutions. As a result, we construct a novel fully partial convolutional autoencoder (FP-CAE) which significantly reduces edge artifacts on the reconstructed images. When training the autoencoder, we introduce a SSIM-based loss function to maximize the structural similarity between the input and reconstructed images. Finally, we investigate two strategies to improve extracting inter-and intramodal information from multimodal data.</p><p>The paper is organized as follows: We review briefly autoencoder designs, variation of CNNs, existing studies in art investigation, and multimodal data processing in section 2. Our generalization and extension of partial convolutions, the proposed model structure and loss function are explained in section 3. The experiments and results are discussed in section 4. Finally, section 5 draws some conclusion and encloses the paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related work</head><p>In this section, we first review a few recent studies in art investigation as well as autoencoder and then discuss some variation of autoencoder structure. Finally, we briefly summarize some relevant work in multimodal data processing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Art investigation and autoencoder</head><p>Some recent studies adopted deep learning methods in art investigation tasks. A U-net like structure was used in <ref type="bibr" target="#b15">[16]</ref> to detect paint loss while Sizyakin et al. proposed to combine morphological filtering with CNN for crack detection <ref type="bibr" target="#b22">[23]</ref>. Existing deep learning models used in art investigation are based on supervised learning, which is constrained by the limited annotations. Therefore, more research exploring unsupervised or semi-supervised learning such as using autoencoders is needed to improve these methods.</p><p>Autoencoder has been applied to fields such as medical image processing where annotations are limited <ref type="bibr" target="#b4">[5]</ref>. When training an autoencoder, the mean squared error is typically employed as the loss function <ref type="bibr" target="#b3">[4]</ref>, <ref type="bibr" target="#b4">[5]</ref>. However, Snell et al. have shown that SSIM based loss function can achieve better performance than MSE loss for images <ref type="bibr" target="#b23">[24]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Autoencoder model structure</head><p>Various autoencoder structures exist for different applications. The variational autoencoder <ref type="bibr" target="#b11">[12]</ref> is a stochastic autoencoder and is very popular in especially generative models <ref type="bibr" target="#b16">[17]</ref>, <ref type="bibr" target="#b18">[19]</ref>, <ref type="bibr" target="#b20">[21]</ref>. Another category, the deterministic autoencoder, has been widely used in feature extraction and reconstruction. Some used stacked autoencoders to reduce the noise from input data <ref type="bibr" target="#b21">[22]</ref>.</p><p>In deep learning models, convolutional neural networks have been proved more effective compared to fully connected networks. Since convolution is implemented by sliding kernels along the input, one challenge researchers have to deal with is preserving the border information when applying CNN. Carlo et al. proposed to use extra filters explicitly to learn the border information <ref type="bibr" target="#b6">[7]</ref>. However, the total filters and parameters increase quickly when the kernel size increases. This limits its application with large kernel sizes and those that require fast computations. Another widely used technique to cope with border information in convolution is padding. Zero padding <ref type="bibr" target="#b12">[13]</ref>, reflection padding and duplication padding are the most common padding methods researchers use. All these methods introduce artificial values in the border, which does not necessarily correspond to the real value outside the border hence this leads to edge artifacts. Liu et al. proposed using partial convolution for image inpainting tasks <ref type="bibr" target="#b13">[14]</ref>. In their method, appropriate scaling is applied to counter balance the varying amount of valid inputs. Since zero padding can be regarded as a special case of missing values by defining the input region to be non-holes and zero padded region to be holes, partial convolution based padding is used to reduce edge artifacts <ref type="bibr" target="#b14">[15]</ref>. Their results suggest that partial convolution could indeed improve the segmentation accuracy on the edges.</p><p>Transposed convolution or deconvolution has been used as the basic building block for convolutional decoders <ref type="bibr" target="#b0">[1]</ref>, <ref type="bibr" target="#b9">[10]</ref>. The most commonly used implementation of deconvolution first stretches the input feature by inserting zeros between each input unit and then applies the kernel to the stretched input with stride equal to 1 <ref type="bibr" target="#b2">[3]</ref>. Since zero insertions are used in the deconvolution, checkerboard artifacts are easily introduced <ref type="bibr" target="#b17">[18]</ref>. As an alternative, Ronneberger et al. used upsampling <ref type="bibr" target="#b19">[20]</ref> to build the decoder. However, upsampling also introduces artificial values when applying interpolation to the input feature, which leads to other kind of artifacts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Multimodal data processing</head><p>Multimodal data processing has attracted attention from researchers in recent years as more and more correlated data from different sensors are collected. Cadena et al. proposed separating depth sensor data, image and semantics in the input and combining encoded features in latent space to predict depth <ref type="bibr" target="#b1">[2]</ref>. Jaques et al. investigated the possibility of combining data from text, number, location, time and survey for mood prediction <ref type="bibr" target="#b7">[8]</ref>. Canonical correlation analysis based intra and inter modal information learning was introduced in <ref type="bibr" target="#b24">[25]</ref> for RGB-D object recognition task. In art investigation, Meeus et al. stacked all modalities together for paint loss detection <ref type="bibr" target="#b15">[16]</ref>. Given all the possible variables of data source, it is still an ongoing research topic of how to effectively combine different modalities to obtain correlated information nowadays.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Method</head><p>In this section, we start with illustrating how we extend the partial convolution and then explain the structure of our multimodal autoencoder as well as the proposed loss function. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Extending Partial convolution</head><p>General method for implementing partial convolution From the definition of partial convolution with zero padding <ref type="bibr" target="#b13">[14]</ref>, <ref type="bibr" target="#b14">[15]</ref>, we generalized a method for implementing partial convolution, see figure <ref type="figure" target="#fig_1">2</ref>. Given the input feature X, the trainable kernel W and bias b (if not zero) of the current layer, two all-ones matrix 1 X and 1 W , having the same shape as X and W respectively, are generated. Some convolutional operation such as Conv1D, Conv2D or Conv2DTranspose is applied to X and W, yielding Z . The same convolutional operation is also applied to 1 X and 1 W , yielding a non-scaled mask M. Instead of calculating the L 1 norm of all-ones matrix in <ref type="bibr" target="#b13">[14]</ref>, we take the maximum value of M as the numerator so that the minimum value of the scale factor is one. This way we enforce that the convolution result is not changed in the region where all elements are valid inputs. In the extreme case when all the elements where the region kernel applies are zeros, the scale factor and bias will be set to zero. Finally, Z is multiplied element-wise with the scale factor R. Bias and non-linearity can be applied after this multiplication. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Partial deconvolution</head><p>The zero insertion and padding used in the deconvolution can be regarded as missing input values, thus partial convolution can be applied. Let X be the input feature of current deconvolution layer and 1 X be all-ones matrix with the same shape as X. X ext and 1 ext is the stretched and zero padded result on X and 1 X respectively. When a kernel W is applied to a local region of the input feature X (i,j) , the partial deconvolution result is:</p><formula xml:id="formula_0">z (i,j) = W T (X ext (i,j) 1 ext (i,j) )r (i,j) + b = W T X ext (i,j) r (i,j) + b.<label>(1)</label></formula><p>The scale factor r (i,j) is defined as:</p><formula xml:id="formula_1">r (i,j) = max(M) M (i,j) , (<label>2</label></formula><formula xml:id="formula_2">)</formula><p>where M is the deconvolution result of 1 ext and the all-ones kernel 1 W , having the same shape as W. The visualization of our partial deconvolution can be found in figure <ref type="figure" target="#fig_2">3</ref>. In this example both the input feature X and the kernel W are 3 × 3 all-ones matrix. The input feature is first stretched to a 5 × 5 matrix and becomes 7 × 7 matrix after padding. The normal deconvolution result is shown in figure <ref type="figure" target="#fig_2">3c</ref> while our partial deconvolution result is shown in figure <ref type="figure" target="#fig_2">3e</ref>. By multiplying with scaling factor r (i,j) , the appropriate adjustment is applied to different input regions with varying number of valid elements. Therefore, the partial deconvolution could smooth out the variation in output values and thus suppress edge artifacts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Multimodal autoencoder</head><p>We proposed two autoencoder architectures to cope with the multimodal data: stacked input and separated input autoencoder. The main difference of these two structure is the strategy to combine different modalities in the input. Our model structure for stacked input autoencoder is a fully convolutional neural network, see figure <ref type="figure" target="#fig_3">4</ref>. Images from different modalities are stacked together as a single input for the autoencoder. To reduce edge artifacts, different versions of the model are tested by replacing the convolution and deconvolution layers with partial convolution, deconvolution, or upsampling layer. For this model, the input shape is 32 × 32 × 11 while the latent vector shape is 3 × 3 × 80, thus the data compression ratio is 15.6.</p><p>For the separated input autoencoder, each modality has an encoder to extract important intra-modal features, illustrated in figure <ref type="figure" target="#fig_3">4</ref>. The encoded features are combined either by an addition or a concatenation layer. Then these combined features are given to a convolutional layer to learn the inter-modal information. Finally, the learned inter-modal features are distributed to decoders for each modality to reconstruct the multimodal images. The encoder and decoder used for multi-modal autoencoder have the same structure as the stacked input autoencoder, only the input channel depth changes. The output of the convolution layer for the inter-modal features is used for later art investigation tasks. Therefore, the dimension of latent vector is again 3 × 3 × 80 and compression ratio stays 15.6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Loss function</head><p>SSIM is widely used as a metric to compare the similarity between two images. The single scale SSIM consists of three components: luminance (L), contrast (C) and structure (S). With µ and σ 2 the average and variance operator respectively, they are defined as L(x, y)</p><formula xml:id="formula_3">= (2µxµy+C1) (µ 2 x +µ 2 y +C1) , C(x, y) = 2σxσy+C2 σ 2 x +σ 2 y +C2 , S(x, y) = σxy+C3 σxσy+C3</formula><p>. The SSIM score is calculated by combining these three functions:</p><formula xml:id="formula_4">SSIM (x, y) = L(x, y) α C(x, y) β S(x, y) γ .<label>(3)</label></formula><p>As usually α = β = γ and C 3 = C2 2 , the SSIM can be rewritten as:</p><formula xml:id="formula_5">SSIM (x, y) = (2µ x µ y + C 1 )(2σ xy + C 2 ) (µ 2 x + µ 2 y + C 1 )(σ 2 x + σ 2 y + C 2 )<label>(4)</label></formula><p>Fig. <ref type="figure">5</ref>: Separated input autoencoder model structure. The different modalities are separated. Intra-modal information is learned by the encoder and decoder while inter-modal information is extracted by the convolution layer. The combination layer could be either an addition or a concatenation layer.</p><p>Based on the definition, the SSIM score ranges from −1 to 1 and only when two images are identical the score can be equal to one. For the proposed loss function, the logarithm is applied on a shifted and rescaled SSIM, in order to punish a low SSIM more.</p><formula xml:id="formula_6">Loss = −log( SSIM + 1 2 )<label>(5)</label></formula><p>The more similar two images are, the smaller the loss is. Given the properties of the logarithm function, when SSIM is small, the loss value and and gradient is high, thus pushes bigger model update steps. When SSIM is closer to one, the gradients become smaller and the model optimizes the parameters in a more stable way. Since the SSIM is only defined for grey scale images, we apply grey scale to images that have three color channels such as RGB and UVF images before calculating its SSIM. For the multimodal autoencoder, the final loss is the mean loss on each modality.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Results and Discussion</head><p>All our models are applied on multimodal acquisitions of two panels from the Ghent Altarpiece: John the Evangelist and the Prophet Zachary <ref type="bibr" target="#b8">[9]</ref>. The five modalities mentioned in section 1 are used, totalling 11 color channels. For each painting, we first divide the full image into two roughly equal parts. One part is used as the training data while the other part is used for testing. Then the full image is further cropped into small squares to match the input dimension of our model. Horizontal, vertical and diagonal flip are randomly applied to the patches. This way, around 2.6 million images and 2 million images are available for training and testing respectively. We started our experiments by testing performance of stacked input autoencoders, i.e. all modalities are stacked before being given to the model. The kernel size, stride, input and output dimensions of the different stacked input autoencoders are the same as illustrated in figure <ref type="figure" target="#fig_3">4</ref>. The convolution layers might be replaced by partial convolution, upsampling or deconvolution layer depending on the model configuration. As baseline, we first train an autoencoder whose layers are normal convolution and deconvolution layers (normal AE ). For the second model, the convolutional layers in encoder are replaced with partial convolution layers while the decoder remains the same (PEN + NDE ). The third model has the same encoder as the second model and the deconvolution layers are replaced with upsampling and partial convolution layers (PEN + UPDE ). Nearest interpolation is used for the upsampling method. The last model is constructed by replacing all normal convolution and deconvolution layers with partial convolution and deconvolution layers, which becomes our fully partial convolutional autoencoder (FP-CAE ). Adam optimizer <ref type="bibr" target="#b10">[11]</ref> was used to optimize parameters. The learning rate for the baseline model was set to 6e−4 without decay. However, with partial convolution layers we found that higher learning rate is desired in order to achieve good performance. The learning rate for the other three models is 12e−4 with 3e−5 decay. All models are trained until convergence.</p><p>The stacked input fully partial convolution autoencoder was used as basic unit to construct the separated input autoencoder. We first combined the different encoded features by concatenating them together and then applied a convolution layer (Concatenation FP-CAE ), as illustrated in figure <ref type="figure">5</ref>. In the second multimodal autoencoder, the combination layer is an addition layer(Addition FP-CAE ). The learning rate for both models is 8e−4 with 4e−5 decay.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Stacked input autoencoder</head><p>The average testing SSIM score of different models is shown in table <ref type="table">1</ref>. It can be seen that our FP-CAE is the best among four models while the model with the commonly used upsampling layers performs the worst. The visual comparison of some test samples can be found in figure <ref type="figure" target="#fig_5">7</ref>. The most severe edge artifacts occur in the normal AE. Replacing only normal convolution with partial convolution (PEN + NDE ) reduces some artifacts but the overall performance drops. (PEN + UPDE ) reduces most visual artifacts but the reconstruction performance drops a lot and corner artifacts become dominant. When replacing all the normal layers with their partial substitute (FP-CAE ), the reconstruction performance slightly increases and most artifacts are suppressed.</p><p>Although the visible reduction of edge artifacts, the numerical difference between our FP-CAE and normal autoencoder is relatively small. This is because the edges only account for a small proportion for the full image. Improving only the edges and keeping most of the interior unchanged does not lead to significant improvement of the overall SSIM score. In order to evaluate the actual Table <ref type="table">1</ref>: The average SSIM for stacked input autoencoders. Our FP-CAE is better than all the other models. The improvement of SSIM in our model comes from the suppression of edge artifacts. improvement of the partial deconvolution on the edges, we calculate the SSIM score in local regions and plotted the SSIM score with respect to the distance to the edge of the patch. The distance is the Manhattan distance: Suppose the width and height of the image is l, the small window size applied to crop the image is w. With x and y the spatial coordinates of a pixel according to an image patch, the coordinates of the four corners on the cropped image are (x 1 , y 1 ), (x 1 , y 2 ), (x 2 , y 1 ), (x 2 , y 2 ). The distance of this cropped image to the edge is defined as:</p><formula xml:id="formula_7">d = min(x 1 , l − x 2 ) + min(y 1 , l − y 2 )<label>(6)</label></formula><p>The smaller the distance is, the closer the cropped image to the four corners.</p><p>For the locations with the same distance, the SSIM is averaged. From the graph in figure <ref type="figure" target="#fig_4">6</ref> it can be seen that our fully partial convolutional autoencoder always outperforms the other models, i.e. our FP-CAE not only reduces edge artifacts, the overall performances increases too. As the difference between PEN + NDE and normal AE is very small, we conclude that the biggest performance increase is due to our proposed deconvolution layers. The reconstruction of PEN + UPDE in four corners (d = 0) is the worst among all models, which is consistent with the visualization in figure <ref type="figure" target="#fig_5">7</ref>. This result clearly  proves that partial deconvolution improves reconstruction performance on the edges.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Separated input autoencoder</head><p>The average SSIM for the seperated multimodal input models are shown in table <ref type="table" target="#tab_0">2</ref>. Compared the SSIM score with stacked input FP-CAE, both separate autoencoders show significant improvement. Some visualization of testing samples can be found in figure <ref type="figure" target="#fig_6">8</ref>. The visualization also suggests a better reconstruction on the edges. However, the difference between concatenation based combination and addition based combination is very small. Concatenation FP-CAE only shows 0.36% improvement with respect to the addition FP-CAE. Given that the concatenation model has more parameters (996,043) than the addition model (893,643), we can not conclude that one method outperforms the other one. More studies will be needed to further investigate different combination strategies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>We showed that autoencoders can be a powerful tool for feature extraction as a data preprocessing step in art investigation tasks where annotations are typically very limited. To achieve good feature extraction, the reconstruction performance of the autoencoder is maximized. In this study, we generalized implementation of the partial convolution operations and extended it to partial deconvolution, which becomes the basic building block for our fully partial convolutional autoencoder (FP-CAE). In partial convolution and deconvolution, appropriate scale factor is applied to the normal convolution output to counter balance varying number of valid inputs, thus it can smooth the output and reduce artifacts. Results suggest that our partial deconvolution layers in the decoder significantly reduce the artifacts on the edges while avoiding deteriorating inner regions. This way, the reconstruction performance of our FP-CAE outperforms, both visually and numerically, other autoencoders models with normal layers. During training, we introduced an SSIM based loss function, which is effective to maximize the similarity in structure between the original and reconstructed images. Finally, we showed that the reconstruction performance of autoencoder can be further improved by separating the different modalities in the encoder and decoder and combining them in latent space. Results indicate that the performance difference between concatenating and summing the latent vectors is small. More studies are needed to compare various combination strategies. In future studies the proposed autoencoder FP-CAE can be used in craquelure detection, inpainting, overpainting detection or other art investigation tasks.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 :</head><label>1</label><figDesc>Fig. 1: Different modalities of the panel the Prophet Zachary. (a) RGB and (b) Infrared reflectography image. Different types of degradations become visible in these images. Intermodal information through the variations in different modalities can be utilized. Image copyright: Ghent, Kathedrale Kerkfabriek, Lukasweb; photo courtesy of KIK-IRPA, Brussels.</figDesc><graphic coords="2,134.77,115.83,166.00,178.92" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 :</head><label>2</label><figDesc>Fig. 2: The general flowchart for implementing partial convolution. The convolution operation can be 1D convolution, 2D convolution, transposed convolution etc.</figDesc><graphic coords="5,186.64,115.83,242.07,255.16" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 3 :</head><label>3</label><figDesc>Fig. 3: Visualization of our partial deconvolution. (a). An input feature and filter matrix. (b). The stretched and zero-padded input feature. (c) Output of normal deconvolution. (d) The scale factor r (i,j) . (e). Output of our partial deconvolution. Our partial deconvolution smooths the output of a normal deconvolution by multiplying with the appropriate scale factor based on the varying amount of valid inputs.</figDesc><graphic coords="6,286.93,146.35,55.33,57.15" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Fig. 4 :</head><label>4</label><figDesc>Fig. 4: Model structure of stacked input autoencoder. All image modalities are stacked together in the input layer so the input channel depth is 11.</figDesc><graphic coords="7,134.77,115.84,345.83,80.01" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Fig. 6 :</head><label>6</label><figDesc>Fig. 6: Evaluating the effect of suppressing edge artifacts. (a) The definition of distance. (b) Visualization of local SSIM with cropping window size 8 with respect to distance from edge.</figDesc><graphic coords="10,134.77,224.66,117.58,102.52" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Fig. 7 :</head><label>7</label><figDesc>Fig. 7: Visualization of some test images. The best reconstruction SSIM score is in black. The first column is the ground truth. Images in the second to fifth column are reconstruction from different models. Partial convolution layer (third column) does not help improve edge artifacts while up sampling layer (fourth column) causes severe artifacts on the corner and reduce the overall reconstruction quality. The partial deconvolution layers in our fully partial autoencoder (last column) improve reconstruction on edges hence slightly increase the overall SSIM. Image copyright: Ghent, Kathedrale Kerkfabriek, Lukasweb; photo courtesy of KIK-IRPA, Brussels.</figDesc><graphic coords="11,137.27,435.48,65.71,64.86" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Fig. 8 :</head><label>8</label><figDesc>Fig.8: Visualization of test images from separate input autoencoders, the best is in bold. The SSIM score of both separate input models is better the stacked input model and reconstruction on the edges is improved. The difference between two combination strategies is very small and none of them can always outperforms the other. Image copyright: Ghent, Kathedrale Kerkfabriek, Lukasweb; photo courtesy of KIK-IRPA, Brussels.</figDesc><graphic coords="13,240.43,453.28,65.71,64.99" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 2 :</head><label>2</label><figDesc>The average SSIM for separated input autoencoders. Both separate input models are better than the stacked input model but the difference between two separate models is very small.</figDesc><table><row><cell></cell><cell cols="3">Stacked input Concatenation Addition</cell></row><row><cell>Model</cell><cell>FP-CAE</cell><cell>FP-CAE</cell><cell>FP-CAE</cell></row><row><cell>SSIM</cell><cell>0.9377</cell><cell>0.9469</cell><cell>0.9450</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Bigdeli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zwicker</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1703.09964[cs</idno>
		<idno>arXiv: 1703.09964</idno>
		<ptr target="http://arxiv.org/abs/1703.09964" />
		<title level="m">Image Restoration using Autoencoding Priors</title>
				<imprint>
			<date type="published" when="2017-03">Mar 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Multi-modal Auto-Encoders as Joint Estimators for Robotics Scene Understanding</title>
		<author>
			<persName><forename type="first">C</forename><surname>Cadena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Reid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename></persName>
		</author>
		<idno type="DOI">10.15607/RSS.2016.XII.041</idno>
		<ptr target="http://www.roboticsproceedings.org/rss12/p41.pdf" />
	</analytic>
	<monogr>
		<title level="m">Robotics: Science and Systems XII. Robotics</title>
				<imprint>
			<publisher>Science and Systems Foundation</publisher>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">A guide to convolution arithmetic for deep learning</title>
		<author>
			<persName><forename type="first">V</forename><surname>Dumoulin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Visin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1603.07285</idno>
		<idno>arXiv: 1603.07285</idno>
		<ptr target="http://arxiv.org/abs/1603.07285" />
		<imprint>
			<date type="published" when="2016-03">Mar 2016</date>
		</imprint>
	</monogr>
	<note>cs, stat</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Cross-modal Retrieval with Correspondence Autoencoder</title>
		<author>
			<persName><forename type="first">F</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Li</surname></persName>
		</author>
		<idno type="DOI">10.1145/2647868.2654902</idno>
		<idno>2647868.2654902</idno>
		<ptr target="http://doi.acm.org/10.1145/2647868.2654902" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 22Nd ACM International Conference on Multimedia</title>
				<meeting>the 22Nd ACM International Conference on Multimedia<address><addrLine>New York, NY, USA; Orlando, Florida, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="7" to="16" />
		</imprint>
	</monogr>
	<note>MM &apos;14</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Medical image denoising using convolutional denoising autoencoders</title>
		<author>
			<persName><forename type="first">L</forename><surname>Gondara</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICDMW.2016.0041</idno>
		<idno type="arXiv">arXiv:1608.04667</idno>
		<ptr target="http://arxiv.org/abs/1608.04667" />
	</analytic>
	<monogr>
		<title level="m">IEEE 16th International Conference on Data Mining Workshops (ICDMW)</title>
				<imprint>
			<date type="published" when="2016-12">2016. Dec 2016</date>
			<biblScope unit="page" from="241" to="246" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Paint loss detection via kernel sparse representation</title>
		<author>
			<persName><forename type="first">S</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Meeus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Cornelis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Devolder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Martens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pizurica</surname></persName>
		</author>
		<ptr target="https://ip4ai.ugent.be/" />
	</analytic>
	<monogr>
		<title level="m">Image Processing for Art Investigation (IP4AI) : proceedings</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="24" to="26" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Learning on the Edge: Explicit Boundary Handling in CNNs</title>
		<author>
			<persName><forename type="first">C</forename><surname>Innamorati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ritschel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Weyrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">J</forename><surname>Mitra</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1805.03106</idno>
		<idno>arXiv: 1805.03106</idno>
		<ptr target="http://arxiv.org/abs/1805.03106" />
		<imprint>
			<date type="published" when="2018-05">May 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Multimodal autoencoder: A deep learning approach to filling in missing sensor data and enabling better mood prediction</title>
		<author>
			<persName><forename type="first">N</forename><surname>Jaques</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Taylor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Picard</surname></persName>
		</author>
		<idno type="DOI">10.1109/ACII.2017.8273601</idno>
		<ptr target="https://doi.org/10.1109/ACII.2017.8273601" />
	</analytic>
	<monogr>
		<title level="m">Seventh International Conference on Affective Computing and Intelligent Interaction (ACII)</title>
				<imprint>
			<date type="published" when="2017-10">2017. Oct 2017</date>
			<biblScope unit="page" from="202" to="208" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><surname>Kik/Irpa</surname></persName>
		</author>
		<ptr target="http://closertovaneyck.kikirpa.be/ghentaltarpiece/#home/" />
		<title level="m">Closer to van eyck: The ghent altarpiece</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Denoising auto-encoder based image enhancement for high resolution sonar image</title>
		<author>
			<persName><forename type="first">J</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Yu</surname></persName>
		</author>
		<idno type="DOI">10.1109/UT.2017.7890316</idno>
		<ptr target="https://doi.org/10.1109/UT.2017.7890316" />
	</analytic>
	<monogr>
		<title level="j">IEEE Underwater Technology</title>
		<imprint>
			<biblScope unit="page" from="1" to="5" />
			<date type="published" when="2017-02">2017. Feb 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Adam: A Method for Stochastic Optimization</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Kingma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ba</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1412.6980</idno>
		<idno>arXiv: 1412.6980</idno>
		<ptr target="http://arxiv.org/abs/1412.6980" />
		<imprint>
			<date type="published" when="2014-12">Dec 2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Auto-Encoding Variational Bayes</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Kingma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Welling</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1312.6114</idno>
		<idno>arXiv: 1312.6114</idno>
		<ptr target="http://arxiv.org/abs/1312.6114" />
		<imprint>
			<date type="published" when="2013-12">Dec 2013</date>
		</imprint>
	</monogr>
	<note>cs, stat</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Imagenet classification with deep convolutional neural networks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Krizhevsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">E</forename><surname>Hinton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Image Inpainting for Irregular Holes Using Partial Convolutions</title>
		<author>
			<persName><forename type="first">G</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">A</forename><surname>Reda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">J</forename><surname>Shih</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">C</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Catanzaro</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1804.07723</idno>
		<idno>arXiv: 1804.07723</idno>
		<ptr target="http://arxiv.org/abs/1804.07723" />
		<imprint>
			<date type="published" when="2018-04">Apr 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Partial Convolution based Padding</title>
		<author>
			<persName><forename type="first">G</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">J</forename><surname>Shih</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">C</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">A</forename><surname>Reda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Sapra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Catanzaro</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1811.11718</idno>
		<idno>arXiv: 1811.11718</idno>
		<ptr target="http://arxiv.org/abs/1811.11718" />
		<imprint>
			<date type="published" when="2018-11">Nov 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Deep learning for paint loss detection: A case study on the ghent altarpiece</title>
		<author>
			<persName><forename type="first">L</forename><surname>Meeus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Devolder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Martens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pizurica</surname></persName>
		</author>
		<ptr target="https://www.ip4ai.ugent.be/IP4AI2018_proceedings.pdf" />
	</analytic>
	<monogr>
		<title level="m">Image Processing for Art Investigation (IP4AI)</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="30" to="32" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks</title>
		<author>
			<persName><forename type="first">L</forename><surname>Mescheder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Nowozin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Geiger</surname></persName>
		</author>
		<ptr target="http://dl.acm.org/citation.cfm?id=3305890.3305928" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 34th International Conference on Machine Learning -Volume 70</title>
				<meeting>the 34th International Conference on Machine Learning -Volume 70<address><addrLine>Sydney, NSW, Australia</addrLine></address></meeting>
		<imprint>
			<publisher>JMLR</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="2391" to="2400" />
		</imprint>
	</monogr>
	<note>ICML&apos;17</note>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">Deconvolution and checkerboard artifacts</title>
		<author>
			<persName><forename type="first">A</forename><surname>Odena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Dumoulin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Olah</surname></persName>
		</author>
		<idno type="DOI">10.23915/distill.00003</idno>
		<ptr target="http://distill.pub/2016/deconv-checkerboard" />
		<imprint>
			<date type="published" when="2016">2016</date>
			<publisher>Distill</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Razavi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">V D</forename><surname>Oord</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Vinyals</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1906.00446[cs,stat</idno>
		<idno>arXiv: 1906.00446</idno>
		<ptr target="http://arxiv.org/abs/1906.00446" />
		<title level="m">Generating Diverse High-Fidelity Images with VQ-VAE-2</title>
				<imprint>
			<date type="published" when="2019-06">Jun 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<title level="m" type="main">U-Net: Convolutional Networks for Biomedical Image Segmentation</title>
		<author>
			<persName><forename type="first">O</forename><surname>Ronneberger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Fischer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Brox</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1505.04597</idno>
		<idno>arXiv: 1505.04597</idno>
		<ptr target="http://arxiv.org/abs/1505.04597" />
		<imprint>
			<date type="published" when="2015-05">May 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<title level="m" type="main">Variational Approaches for Auto-Encoding Generative Adversarial Networks</title>
		<author>
			<persName><forename type="first">M</forename><surname>Rosca</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Lakshminarayanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Warde-Farley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mohamed</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1706.04987[cs,stat</idno>
		<idno>arXiv: 1706.04987</idno>
		<ptr target="http://arxiv.org/abs/1706.04987" />
		<imprint>
			<date type="published" when="2017-06">Jun 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Stacked Autoencoders for Unsupervised Feature Learning and Multiple Organ Detection in a Pilot Study Using 4d Patient Data</title>
		<author>
			<persName><forename type="first">H</forename><surname>Shin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">R</forename><surname>Orton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">J</forename><surname>Collins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">J</forename><surname>Doran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">O</forename><surname>Leach</surname></persName>
		</author>
		<idno type="DOI">10.1109/TPAMI.2012.277</idno>
		<ptr target="https://doi.org/10.1109/TPAMI.2012.277" />
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Pattern Analysis and Machine Intelligence</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="issue">8</biblScope>
			<biblScope unit="page" from="1930" to="1943" />
			<date type="published" when="2013-08">Aug 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<title level="m" type="main">A deep learning approach to crack detection in panel paintings</title>
		<author>
			<persName><forename type="first">R</forename><surname>Sizyakin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Cornelis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Meeus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Martens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Voronin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pižurica</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page">3</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<title level="m" type="main">Learning to Generate Images with Perceptual Similarity Metrics</title>
		<author>
			<persName><forename type="first">J</forename><surname>Snell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Ridgeway</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Liao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">D</forename><surname>Roads</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">C</forename><surname>Mozer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">S</forename><surname>Zemel</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1511.06409</idno>
		<idno>arXiv: 1511.06409</idno>
		<ptr target="http://arxiv.org/abs/1511.06409" />
		<imprint>
			<date type="published" when="2015-11">Nov 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Large-Margin Multi-Modal Deep Learning for RGB-D Object Recognition</title>
		<author>
			<persName><forename type="first">A</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Cham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Wang</surname></persName>
		</author>
		<idno type="DOI">10.1109/TMM.2015.2476655</idno>
		<ptr target="https://doi.org/10.1109/TMM.2015.2476655" />
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Multimedia</title>
		<imprint>
			<biblScope unit="volume">17</biblScope>
			<biblScope unit="issue">11</biblScope>
			<biblScope unit="page" from="1887" to="1898" />
			<date type="published" when="2015-11">Nov 2015</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
