Depth-Aware Arbitrary Style Transfer Using Instance
                     Normalization

      Victor Kitov1,2[0000−0002−3198−5792] , Konstantin Kozlovtsev1 , and Margarita
                                     Mishustina1
                     1
                        Lomonosov Moscow State University, Moscow, Russia
                 2
                     Plekhanov Russian University of Economics, Moscow, Russia
                      v.v.kitov@yandex.ru, ko-sova@yandex.ru,
                          margarita_mishustina_112@mail.ru
                                   https://victorkitov.github.io


         Abstract. Style transfer is the process of rendering one image with some content
         in the style of another image, representing the style. Recent studies of Liu et
         al. (2017) show that traditional style transfer methods of Gatys et al. (2016) and
         Johnson et al. (2016) fail to reproduce the depth of the content image, which is
         critical for human perception. They suggest to preserve the depth map by additional
         regularizer in the optimized loss function, forcing preservation of the depth map.
         However these traditional methods are either computationally inefficient or require
         training a separate neural network for each style. AdaIN method of Huang et
         al. (2017) allows efficient transferring of arbitrary style without training a separate
         model but is not able to reproduce the depth map of the content image. We
         propose an extension to this method, allowing depth map preservation by applying
         variable stylization strength. Qualitative analysis and results of user evaluation
         study indicate that the proposed method provides better stylizations, compared to
         the original AdaIN style transfer method.

         Keywords: Image Processing, Image Generation, Depth Estimation, Instance
         Normalization.


1     Introduction
The problem of rendering an image (called the content image) in a particular style
is known as style transfer and is a long studied problem in computer vision. Early
approaches [5,16,14] used algorithms with human engineered features targeting to
impose particular styles.
    In 2016 Gatys et al. [2] proposed an algorithm of imposing arbitrary style taken from
user defined style image on arbitrary content image by using representations of images
that could be obtained with deep convolutional networks. However their method needed a
computationally expensive optimization in the space of images requiring several minutes
of processing a single image of moderate resolution on powerful GPUs. Ulyanov et
    Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License
    Attribution 4.0 International (CC BY 4.0).
2 V.Kitov, K.Kozlovtsev et al.

al. [17] and Jonson et al. [7] proposed a real-time style transfer algorithm by passing
a content image through a pretrained fully convolutional transformer network. Their
methods required training a separate transformation network for each new style. Work
of Liu et. al (2017) [10] highlighted an issue with traditional style transfer methods that
they failed to reproduce the depth map of the content image, that was critical for human
perception of the result. To address this issue they extended traditional methods [2] and
[7] with a regularizer, forcing preservation of the depth map of the content image. This
yielded significant improvement of style transfer rendering quality but required compu-
tationally complex algorithm, requiring either solving high dimensional optimization
problem for each content-style pair or fitting a separate transformer network for each
style. Later architectures, such as AdaIN [6] and other ([3], [8]), allowed transferring
arbitrary style without training a separate network but lacked rendering quality due to
failure to preserve the depth map of the content image.
     In the work a depth aware AdaIN method extension (DA-AdaIN for short) is proposed
that allows to preserve the depth map of the content during stylization, as shown on
fig. 1d by applying style with spatially variable strength: more close regions, standing
for foreground, are stylized less, and more distant regions, standing for background, are
stylized more.


            Fig. 1. AdaIN and proposed Depth Aware AdaIN method comparison.


    Since style transfer does not yet have conventional objective criteria of quality, we
judge which method is better by means of aggregated user preferences. Qualitative
analysis suggests that proposed modification leads to rendering quality improvement.
User study confirms that proposed algorithm gives on average better results.
    The remainder of the paper is organized as follows. In section 2 two recent depth
estimation methods are compared and the best one is used in later analysis. Section 3
describes standard AdaIN method and our proposed modification. Section 4 provides
qualitative analysis of the proposed method, its dependency on major parameters and
results of a user study, where AdaIN and proposed methods are compared. Finally
section 5 concludes.
                        Depth-Aware Arbitrary Style Transfer Using Instance Normalization 3

2     Depth estimation

2.1   Methods

Monocular depth estimation is a problem of finding a depth map D ∈ RW ×H for
arbitrary color image I ∈ RW ×H×3 . Since I is a color image, it has three channels,
standing for red, green and blue color intensities. D is a single channel image, having
the same width W and height H as I, with D(x, y) equal to the distance of pixel I(x, y)
to the camera.
    For the purposes of style transfer we are interested to discriminate between central
objects, that are more close to the camera, from background objects, that are more distant.
So absolute accuracy of depth prediction is not as important as relative accuracy.
    We compare two recent methods for monocular depth estimation - monodepth2 [4]
and MiDaS [13], using official public implementations of both.
    MiDaS is a supervised model in contrast to monodepth2, which is self-supervised,
which means that it did not use true depth values during training. Monodepth2 was
trained on single KITTI 2015 dataset [11], covering outdoor scenes, taken by the camera
on the car. MiDaS was trained on five different datasets, covering indoor and outdoor
scenes with static and dynamic objects in various contexts.
    MiDaS has a single realization, whereas monodepth2 has nine: mono-640x192,
stereo-640x192, mono+stereo-640x192, mono-1024x320, stereo-1024x320, mono+stereo-
1024x320, mono-no-pt-640x192, stereo-no-pt-640x192, mono+stereo-no-pt-640x192.
They differ in training data used (mono, stereo or both), resolution of training data and
weights initialization.


2.2   Comparison

Since style transfer may be applied to arbitrary images, we need a depth estimation
method that is robust across different types of scenes. Qualitative check for random
images shows significant superiority of the MiDaS method, as can be seen on fig. 2.


Fig. 2. Qualitative comparison of MiDaS [13] (2nd column) and different realizations of mon-
odepth2 [4] depth estimation methods (columns 3-11). MiDaS is robust to different scenes, whereas
monodepth2 has poor generalization for non-road objects.
4 V.Kitov, K.Kozlovtsev et al.

     To compare methods quantitatively we apply them on the test subset of the DIW
dataset[1], having diverse kinds of images. Neither of the methods used this dataset
for training. The test subset used contains 73983 images with sparse labels: for each
image 2 point locations are randomly selected, and an indicator is given whether the
first location is more close or distant to the camera, than the second point. We apply
depth prediction methods to each image and check, whether relative depth indicator
was predicted correctly or not. Accuracy results are reported in table 1. These results
confirm that MiDaS is more accurate depth estimation model for images of general kind,
so we will use this method in later analysis. This is an expected result since MiDaS was
trained in supervised way on a number of diverse datasets.


Table 1. Relative depth prediction accuracy for MiDaS [13] and different realizations of mon-
odepth2 [4] depth prediction methods on the DIW test dataset [1].

                         Method                            Accuracy
                         MiDaS                                0.87
                         mono+stereo-1024x320                 0.69
                         mono+stereo-640x192                  0.70
                         mono+stereo-no-pt-640x192            0.65
                         mono-1024x320                        0.70
                         mono-640x192                         0.71
                         mono-no-pt-640x192                   0.65
                         stereo-1024x320                      0.67
                         stereo-640x192                       0.66
                         stereo-no-pt-640x192                 0.62


3     Style transfer
3.1   AdaIN Method
AdaIN method [6] is a recent powerful style transfer method, allowing to stylize any
content image Ic by any style image Is in real-time without any complex optimizations.
Stylization result Iˆ is obtained by

                                 Iˆ = g(AdaIN (f (Ic ), f (Is ))).

where f (·) is an encoder (taken as first few layers of VGG-19 [15]) and g(·) is a
corresponding decoder, trained to match the encoder in producing good stylizations for a
representative set of content images (MS COCO [9]) and style images (WikiArt [12])
                      Depth-Aware Arbitrary Style Transfer Using Instance Normalization 5

according to a loss function being a weighted combination of content preservation loss
(matching inner VGG-19 representations of content and style) and style preservation loss
(matching means and standard deviations of inner representations). For details we refer
to original paper [6]. AdaIN(x, y) is a variant of instance normalization [17], where
instance normalization parameters are taken from the style image representation.
    Define encoder representations

                             x = f (Ic ),   x ∈ RC×Hc ×Wc
                             y = f (Is ),   y ∈ RC×Hs ×Ws

Then AdaIN(x, y) ∈ RC×Hc ×Wc and is defined as
                                                       
                                          xcij − µc (x)
               AdaIN(x, y)cij = σc (y)                    + µc (y),                  (1)
                                              σc (x)
                                           v
                    H W                    u         H X W
                 1 XX                      u 1 X
      µc (x) =             xcij , σc (x) = t               (xcij − µc (x))2          (2)
               HW i=1 j=1                      HW i=1 j=1

                  i = 1, 2, ...Hc ;   j = 1, 2, ...Wc ;   c = 1, 2, ...C;            (3)


3.2   Proposed Extension

Standard AdanIN method applies style uniformly across content image. To improve
rendering quality of style transfer we propose to apply style with different strength in
different regions of content image depending on their proximity to the camera. Closer
regions we consider foreground, that needs to be preserved more, so we stylize it less.
And vice versa more distant regions we consider background, that can be stylized more.
    Uniform stylization strength control can be controlled by hyperparameter α ∈ [0, 1]
in the following formula:

                   Iˆ = g(αf (Ic ) + (1 − α) AdaIN(f (Ic ), f (Is ))),               (4)

since f (Ic ) is the original unmodified content encoder representation, whereas
AdaIN(f (Ic ), f (Is )) is fully stylized encoder representation.
    Since we are interested in spatially variable strength control, we apply modified
formula
                   Iˆ = g(P f (Ic ) + (1 − P ) AdaIN(f (Ic ), f (Is ))),          (5)
where P ∈ RHc ×Wc is stylization strength map with strength values for each spatial
position of content encoder representation and denotes element-wise multiplication
repeated for every channel:

                                 {P     F }cij = Pij Fcij

Algorithm 1 shows steps for computing stylization strength map P in formula 5. MiDaS
algorithm produces proximity map straight away, so for it steps 1,2 are omitted. Max,
min and mean operations are produced over all spatial positions and produce a scalar.
6 V.Kitov, K.Kozlovtsev et al.

Algorithm 1 Styliztation strength map estimation.
Input: content image Ic , monocular depth estimation algorithm, size of content encoder
    representation f (·) Hc × Wc , offset ε ≥ 0, prominence β ≥ 0.
Output: Stylization strength map P .
 1: Get depth map D for content image Ic
 2: Get proximity map P = max D − D
 3: Rescale P to content encoder representation size Hc × Wc
 4: P := (P − min P )/(max P − min P )
 5: P := P − mean P
 6: P := 1/(1 + exp(−βP ))
 7: P := min{P, 1 − ε}


Step 4 ensures that proximity has spread in [0,1] interval. Step 6 controls contrast of the
depth map by hyperparameter β: higher β corresponds to more prominent changes in the
depth map around its mean and β = 0 makes depth map constant converting proposed
algorithm to standard AdaIN. Step 7 constrains proximity map from above by 1 − ε.
Hyperparameter ε controls the minimal offset from the camera to regions on the image.
    For stylization pre-trained AdaIN encoder/decoder [15,6] and pre-trained depth
network [13] is used. Computational advantage of our method is that it is learning-
free: given pretrained encoder, decoder and depth estimation network, method does not
require additional training for new styles. We name our algorithm Depth Aware Adaptive
Instance Normalization (DA-AdaIN for short).


4     Style Transfer Evaluation

4.1   Dependence on major parameters

Proposed algorithm has two hyperparameters: β > 0 controls prominence of proximity
map around its mean value and ε ∈ [0, 1] controls minimal offset of the image regions
from the camera. To study impact of these parameters on the stylization result we will
use content and style images, shown on fig. 3


             Fig. 3. Style transfer result depending on depth contrast parameter β.
                        Depth-Aware Arbitrary Style Transfer Using Instance Normalization 7

     Fig. 4 shows how style transfer output depends on contrast parameter β. Higher
values increase contrast (spread) of the proximity map values, while ensuring that they
fall inside [0, 1] interval.


         Fig. 4. Style transfer result depending on depth contrast parameter β, ε = 0.


    Fig. 5 shows dependency of style transfer results on proximity offset parameter ε.
The lower is this offset, the closer proximity values may approach one in certain regions,
forcing for that regions more content reconstruction and less style transfer. Higher values
of ε ensure that all image regions maintain certain distance from the camera and higher
minimal impact of style transfer is ensured.


        Fig. 5. Style transfer result depending on proximity offset parameter ε, β = 20.


4.2   Qualitative Comparison With AdaIN method

Side-by-side comparisons of style transfer results by standard AdaIN method and pro-
posed modification DA-AdaIN is visualized on fig. 6. For DA-AdaIN we used ε = 0.15
and β = 20. Comparisons show that proposed method is capable to detect more close
objects and highlight them by applying style transfer with smaller strength. More close
objects generally are more important for the viewer and this strategy allows to preserve
8 V.Kitov, K.Kozlovtsev et al.

them better by less expressed stylization, which brings rendering improvement. However
if this approach is used too strongly, it may create a noticeable disagreement between
foreground and background rendering, as may be seen on the last row of fig. 6 where a
huge proximity contrast between the foreground (the dog) and the background (the grass)
forced the foreground look too photorealistic in strongly stylized context. To alleviate
this issue we suggest to increase offset ε or decrease contrast β.


4.3   User Evaluation Study

Procedure. To provide a more general comparison of style transfer methods we conduct
a user evaluation study, where 18 users were asked to pass a survey. The survey consisted
of 20 image pairs, corresponding to stylizations by AdaIN and DA-AdaIN methods
presented in random order, and the users had to select for each pair a stylization which
they liked more. 360 responses were collected. For a set of different style and content
images all possible stylizations were generated. Contents were selected to contain objects
of different proximity to the camera, otherwise results between the two methods were
indistinguishable. A random subset of 20 results was selected for the survey. Contents
were resized so that their smaller side is 1000 pixels and styles were resized to ensure
that their smaller side is 300 pixels. We did not tell the respondents anything about
the depth preservation concept and our algorithm details. Although performance of our
method could be further improved by finetuning parameters ε and β for each individual
image, we fixed them to general reasonable values ε = 0.15 and β = 20, to put proposed
method in equal position with the baseline.


Results. The results of user study evaluation study are presented on table 2. Our method
is preferred moderately more often than existing AdaIN method and this difference is
statistically significant with 99% confidence for exact binomial test.


                          Table 2. Results of user evaluation study

                                                  Experiment Ours vs AdaIN

                                                # image pairs          20
                                               # respondents           18
                                                  # responces          360
                                 # votes for proposed method           207
              proportion of votes for proposed method                 57.5%
                                  std. deviation of proportion        2.6%
                                 p-value (exact binomial test)        0.0026
                   Depth-Aware Arbitrary Style Transfer Using Instance Normalization 9


Fig. 6. Comparison of style transfer results for AdaIN and proposed DA-AdaIN methods
                                         .
10 V.Kitov, K.Kozlovtsev et al.

Discussion. During the study it was found that DA-AdaIN was not very sensitive to
proximity variability, so the contrast in variability had to be additionally increased by
step 6 of algorithm 1. For contents with very close objects to the camera, proximity
became close to one for those objects and they received almost no stylization, so offset on
step 7 of algorithm 1 was introduced to maintain certain guaranteed level of stylization
for all parts of the image. These modifications ensured better rendering quality on
average. β = 20 and ε = 0.15 are recommended. For particular content and style pair
result may be improved even more by manual tuning of β and ε parameters.


5    Conclusion
An extension to AdaIN method, allowing to preserve depth information from the content
image, is proposed. All other benefits of AdaIN are preserved, namely fast real-time
stylization and the ability to transfer arbitrary style at inference time without additionally
training the model. Qualitative analysis reveals that the proposed method is capable to
preserve information about proximity to the objects on the stylized image and results of
the user evaluation study confirm that depth preservation is important for users, making
them prefer our method more often than conventional AdaIN method.


References
 1. Chen, W., Fu, Z., Yang, D., Deng, J.: Single-image depth perception in the wild. In: Advances
    in neural information processing systems. pp. 730–738 (2016)
 2. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks.
    In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2414–
    2423 (2016)
 3. Ghiasi, G., Lee, H., Kudlur, M., Dumoulin, V., Shlens, J.: Exploring the structure of a real-time,
    arbitrary neural artistic stylization network. arXiv preprint arXiv:1705.06830 (2017)
 4. Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocu-
    lar depth prediction (October 2019)
 5. Gooch, B., Gooch, A.: Non-photorealistic rendering. AK Peters/CRC Press (2001)
 6. Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance nor-
    malization. In: Proceedings of the IEEE International Conference on Computer Vision. pp.
    1501–1510 (2017)
 7. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-
    resolution. In: European conference on computer vision. pp. 694–711. Springer (2016)
 8. Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H.: Universal style transfer via feature
    transforms. In: Advances in neural information processing systems. pp. 386–396 (2017)
 9. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.:
    Microsoft coco: Common objects in context. In: European conference on computer vision. pp.
    740–755. Springer (2014)
10. Liu, X.C., Cheng, M.M., Lai, Y.K., Rosin, P.L.: Depth-aware neural style transfer. In: Pro-
    ceedings of the Symposium on Non-Photorealistic Animation and Rendering. p. 4. ACM
    (2017)
11. Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: Proceedings of the
    IEEE conference on computer vision and pattern recognition. pp. 3061–3070 (2015)
12. Nichol, K.: Painter by numbers, wikiart (2016)
                         Depth-Aware Arbitrary Style Transfer Using Instance Normalization 11

13. Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular
    depth estimation: Mixing datasets for zero-shot cross-dataset transfer. arXiv:1907.01341
    (2019)
14. Rosin, P., Collomosse, J.: Image and video-based artistic stylisation, vol. 42. Springer Science
    & Business Media (2012)
15. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recog-
    nition. arXiv preprint arXiv:1409.1556 (2014)
16. Strothotte, T., Schlechtweg, S.: Non-photorealistic computer graphics: modeling, rendering,
    and animation. Morgan Kaufmann (2002)
17. Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: The missing ingredient for
    fast stylization. arXiv preprint arXiv:1607.08022 (2016)