<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Creating 3D Diorama from Single Image with Deep Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martin Vejbora</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena Šikudová</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Mathematics and Physics, Charles University</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Creating 3D scenes is a time-consuming task that requires experience with modeling software. This paper presents a novel approach that combines neural models for panoptic segmentation and monocular depth estimation to construct dioramas. While previous research has explored generating dioramas from single images, to the best of our knowledge, there is no research utilizing deep learning techniques for the task. This paper provides an analysis of existing approaches to diorama generation. We then describe the construction of the diorama, where objects identified by segmentation are separated into distinct images with transparent backgrounds. These images are then placed in a 3D scene, arranged to reflect the estimated depth of each object. We also address several challenges that had to be overcome. Specifically, we employed fine-tuning to address the limitations of the available depth model when applied to outdoor scenes. Our method has been implemented as an add-on for the open-source 3D software Blender, utilizing neural models in the ONNX format for depth and segmentation inferences.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;deep learning</kwd>
        <kwd>diorama</kwd>
        <kwd>Blender</kwd>
        <kwd>panoptic segmentation</kwd>
        <kwd>monocular depth estimation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>computers or notebooks. Since research in deep learning
has been very progressive in recent years, we pay
attenCreating 3D environments in modeling software can be tion to designing the implementation in order to be able
a repetitive and time-consuming task. However, lower- to easily use better models in the future.
quality models are usually suficient for assets in the We implement our method as an add-on for the free
background and further away from the camera. This is 3D software Blender1, which supports all three major
where the use of automated tools can come in handy. platforms; Linux, Windows, and Mac. The add-on strives</p>
      <p>This paper focuses on creating dioramas which are sets to be easy to use, the user selects an input image, and the
of planes placed in a 3D scene to evoke the perception add-on automatically creates a diorama from it without
of depth. They are computationally cheap for rendering a need to do any further manual steps in the process.
since they do not utilize any complex mesh, making them Even though the quality of the resulting diorama varies
suitable for background assets. Dioramas work best when based on the input image, our approach has weaker
conthe camera is facing them, moving slightly, and viewing straints on the input images than the previous works.
the diorama from slightly diferent angles. The efect This paper is structured as follows. Section 2 provides
breaks when the diorama is viewed from a side. an overview of existing work on automatic diorama
cre</p>
      <p>Previous works used traditional machine learning tech- ation. Section 3 discusses the used framework and models
niques to create dioramas, limiting their usage to hazy with a focus on fine-tuning the depth model. Then, this
input images, outdoor scenes, or images with zero or one section covers the implementation of the add-on and the
vanishing point. Moreover, their implementation was most significant design choices. Section 4 compares the
either not published or is now outdated and no longer results of the original and fine-tuned depth models and
functional, making them impractical to use. shows the visual appearance of the diorama.
Further</p>
      <p>We study the utilization of deep learning to automate more, it discusses the strengths and weaknesses of our
the process of creating dioramas. Our implementation solution. Final Section 5 summarizes what was achieved
uses a pre-trained state-of-the-art model for panoptic and outline potential areas for future work.
segmentation and a competitive model for depth
estimation that we fine-tune for outdoor scenes. The selected
models are powerful yet small enough to run on standard 2. Related Work</p>
      <p>
        Based on the research of human depth perception, Assa A significant portion of the article is dedicated to
enand Wolf [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] define depth cues, including partial occlu- hancing the visual appeal of segmented images. The
sion, texture density analysis, depth of focus, atmospheric authors blend the alpha channel of segment edges to
crescattering, and object height in the visual field. They uti- ate smoother transitions between planes. Additionally,
lize segmentation to obtain 10-20 major segments per areas of the photographed scene that were not visible
image and smaller patches called superpixels. They es- in the original image are filled with inpainting. Prior to
timate relative depth diferences among objects by com- the actual inpainting, a few border pixels of the segment
paring depth clues between superpixels on borders or edge are removed using erosion to prevent misclassified
inside of bigger segments. pixels from afecting the inpainting algorithm. These
      </p>
      <p>Having defined a new viewing point, authors render misclassified pixels often have colors diferent from the
a novel image that occludes certain parts of the original color of the main object within the segment.
image. They also use image completion techniques to The main drawback of the described algorithm is that
inpaint the previously occluded areas which become vis- it can only be applied to hazy images. This limitation
ible. Their approach yields the best results for outdoor comes from the used depth estimation algorithm.
scenes with minimal regular patterns or straight lines.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Proposed Solution</title>
      <p>
        Similarly to the previous approach, Make3D [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] uses
segment patches and defines depth cues both within and
between these patches. Instead of estimating a depth
map, authors build a 3D mesh from planes to represent
a scene from an input image. They train a Markov
Random Field (MRF) to model relationships between adjacent
patches. The MRF infers locations and rotations of
segment planes in a three-dimensional space. This inference
is conditioned by over 500 local features computed from
each patch, along with various relationships computed
between patches. These inter-patch relationships involve
advanced edge detection or estimation of co-planarity
and co-linearity. Since an output of the algorithm is a
whole textured mesh, it allows easy synthesizing of novel
views.
      </p>
      <p>All the approaches described in Section 2 use some form
of segmentation and depth estimation. With recent
advancements in neural networks, state-of-the-art solutions
for both tasks now use deep learning. However, to our
best knowledge, no publicly available research exists
where authors would create a diorama using deep
learning.</p>
      <p>While monocular depth estimation is an
unambiguous task, image segmentation is usually categorized into
one of three main tasks: instance, semantic and panoptic
segmentation. In instance segmentation, the objective
is to identify all instances of given object classes and to
determine masks for individual objects. Semantic
segmentation assigns a label category to every pixel in an
image while not distinguishing multiple instances of a</p>
      <p>
        The research paper called PEEP: Perceptually En- class. Panoptic segmentation, proposed by Kirillov et al.
hanced Exploration of Pictures [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] focuses on images [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], unifies semantic and instance segmentation by
introwith zero or one vanishing point. PEEP maps an image ducing two types of objects – things and stuf. Things
to 5 planes forming a pyramidal frustum to achieve a plau- include countable objects like cars or people, where each
sible 3D efect. Similarly to the previous approaches, the instance needs a distinct label. Stuf refers to uncountable
ifrst step obtains segmentation patches. These patches or amorphous regions like grass or sky, where it is not
correspond to planes in three-dimensional space. Graph- possible or desired to distinguish individual instances.
cut strategy on patches is used to fit points representing Similarly to semantic segmentation, panoptic
segmenfrustum. If we limit ourselves to images with zero or tation labels all image pixels which makes it the most
one vanishing point, authors claim their result is visually suitable for our use-case. A comprehensive survey of
more plausible even though geometrically less precise methods for all three segmentation tasks can be found in
than the one created by Make3D [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>Our algorithm takes the input image, cuts objects
de</p>
      <p>
        Zhao et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] limit their depth estimation to a single tected by panoptic segmentation into separate images,
depth cue – atmospheric scattering. They use the Dark and places these images relatively behind each other
Channel Prior dehazing algorithm to compute depth in based on their average depth predicted by the monocular
their research. Authors cluster depth and radiance out- depth estimation model. This approach is illustrated in
puts of the dehazing process obtaining approximately Figure 1.
ifve segments per image. After estimating the depth and Re-implementing a state-of-the-art model based on
segmentation, the position and orientation of segment its research paper can be challenging. Also, with
transplanes are computed. Segmented alpha planes are placed former neural networks rapidly developing, new
state-ofbehind each other to form a resulting diorama. the-art models for datasets like ADE20K [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] or NYUv2 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
Segmentation
      </p>
      <p>Model</p>
      <p>Depth Model
appear even multiple times a year. Therefore, better mod- from the input image, and a decoder head that predicts
els will likely be available for our tasks in the future, the class and mask for each object query.
requiring us to re-implement the code again. To extract multi-scale features from the input image,</p>
      <p>
        For these reasons, we have decided to use a high- OneFormer uses Swin [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] backbone encoder and a
multilevel framework called HuggingFace2 that contains im- scale deformable transformer [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] as a pixel decoder.
plemented models, including pre-trained weights that Pixel decoder leverages a deformable attention module
can be downloaded from the HuggingFace hub 3. Hug- that limits attention to a local surrounding, mimicking
gingFace has a large community, well-documented code, the inductive bias of convolutions. Like a typical
hierand a lot of online resources. At the time of writing, it had archical decoder, it gradually upsamples the backbone
over 80,000 stars on GitHub. Most of its models are im- features with the aid of skip connections from the
enplemented in PyTorch, but some also have a TensorFlow coder layers of corresponding spatial resolutions. The
or JAX version. pixel decoder extracts features at 14 , 81 , 116 , and 312 of the
      </p>
      <p>
        HuggingFace contains very capable models for both of input resolution.
our tasks. The best panoptic model is OneFormer [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], The query formulation module combines the task type
the state-of-the-art model for panoptic segmentation input "the task is {task}" in a 2-layer transformer
on the ADE20K dataset according to the paperswith- with 14 scale features from the pixel decoder to generate
code.com ranking4 at the time of implementing our add- query tokens Q. Each query token represents a potential
on. The best depth estimation model from HuggingFace object or segment in the input image. These query tokens
is GLPN [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], ranked 7th on the NYU v2 dataset5. We are later passed to the transformer decoder, which verifies
describe the details of these models in the following sec- that they correspond to an actual object, classifies them,
tions. and creates a mask for them.
      </p>
      <p>
        The last part of OneFormer’s architecture is the
trans3.1. Panoptic Segmentation former decoder with classification and mask heads. The
input of the transformer decoder are object queries Q,
Jain et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] introduced OneFormer, a model that uni- which are repeatedly combined with multi-scale features
ifes instance, semantic, and panoptic segmentation tasks. from the pixel decoder. The transformer decoder consists
OneFormer achieves state-of-the-art results on all three of a masked cross-attention, followed by a self-attention,
tasks after training only once, simultaneously. and a feed-forward network repeated  times for each of
the 18 , 116 , and 312 pixel feature scales. The resulting
features are then passed to the classification and mask heads.
      </p>
      <p>The classification head predicts a class or no-object for
each query token. The mask head, on the other hand,
computes a binary mask using pixel features at 14
resolution of the original image.</p>
      <sec id="sec-2-1">
        <title>OneFormer takes two inputs, an RGB image, and a</title>
        <p>
          text token. The token determines whether OneFormer
executes instance, semantic, or panoptic segmentation.
The model’s architecture is based on Mask2Former [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ],
and it consists of three main parts: an encoder-decoder
backbone for extracting hierarchical features from the
input image, a query module that computes object queries
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2https://huggingface.co/</title>
        <p>3https://huggingface.co/models
4https://paperswithcode.com/sota/panoptic-segmentation-onade20k-val
5https://paperswithcode.com/sota/monocular-depth-estimationon-nyu-depth-v2</p>
      </sec>
      <sec id="sec-2-3">
        <title>HuggingFace contains versions of OneFormer with</title>
        <p>
          Swin backbone [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] pre-trained on the Cityscapes [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ],
ADE20K [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], and COCO [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] datasets. Figure 2 compares
them on an indoor and outdoor scene. We observe that
the Cityscapes version yields competitive results on
outdoor scenes but does not work at all on indoor scenes
(Figure 2d). That is due to the dataset’s structure contain- block comprises several reduced self-attention and
MLPing annotations for 30 classes related only to autonomous Conv-MLP modules with residual skip connections. The
driving. On the other hand, models trained on ADE20K ifnal component of each encoder block is the patch
emand COCO produce comparable results, likely due to the bedding layer which employs overlapped convolution
similar structure of both datasets. We choose the COCO with stride to reduce the spatial shape of hierarchical
version because of its more permissive license. features while increasing the number of channels.
        </p>
        <p>A lightweight decoder is connected to the encoder on
multiple resolution layers through a Selective Feature
Fusion (SFF) module. This module enhances global
features with fine details of the local structures that may
have been lost in the latter encoder steps. SFFs connect
the encoder with the decoder, allowing the decoder to
access both the global path from the encoder and the local
(a) (b) path through the skip connections. SFF computes a
twochannel attention map where the input global features are
multiplied by one channel and the local features by the
other. These multiplications are element-wise along the
channel dimension. Finally, the resulting scaled global
and local features are added element-wise.</p>
        <p>
          GLPN applies sigmoid as the last step, which scales the
(c) (d) depth output to the range [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]. The result is multiplied
by the desired maximal depth in meters, which is specific
for each dataset.
(e)
(g)
(f)
(h)
        </p>
        <p>
          HuggingFace ofers two versions of GLPN, pre-trained
on either NYUv2 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] or KITTI [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] dataset. A comparison
of their inference is shown in Figure 3.
        </p>
        <p>(d)
(a)
(b)
(e)
(c)
(f)</p>
      </sec>
      <sec id="sec-2-4">
        <title>GLPN uses a hierarchical transformer encoder from SegFormer [16] to extract features from the input image at four diferent resolution levels. Each level’s encoder</title>
        <p>We choose the NYUv2 version as it produces more
consistent depth maps. The KITTI-trained model
produces artifacts, such as a brighter stripe at the top part
of the indoor scene in Figure 3e or inconsistent depth
estimates of buildings in the left part of the outdoor scene
in Figure 3b. The top parts of the higher apartment
building and the smaller buildings are estimated to be closer
(darker values) than the lower parts of the same real
distance. We attribute these artifacts to the structure of the</p>
      </sec>
      <sec id="sec-2-5">
        <title>KITTI dataset, which only contains images captured from</title>
        <p>a car, so the model does not generalize well on varying
scenes. For instance, almost all KITTI images have a sky
at the top, and a sky does not have any valid depth values.</p>
      </sec>
      <sec id="sec-2-6">
        <title>Thus, the model cannot learn anything there.</title>
        <sec id="sec-2-6-1">
          <title>3.2.1. Fine-tuning</title>
        </sec>
      </sec>
      <sec id="sec-2-7">
        <title>As shown in Figure 3, the selected model trained on</title>
        <p>
          NYUv2 performs well on indoor scenes, and despite never
seeing any depth-annotated outdoor scenes, it
generalizes surprisingly well on them. However, there are still
some inconsistencies. For example, in Figure 3c, the two
smaller buildings are estimated to be further away than
the high apartment building behind them. To improve
the quality of our diorama on outdoor scenes, we decide
to fine-tune the model on the DIODE dataset [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], which
contains both indoor and outdoor scenes. It contains
around 17,000 outdoor images and almost 9,000 indoor
images, with all depth maps obtained using a laser
scanner. Figure 4 shows an example of RGB images, depth
maps, and binary validity masks which mark invalid
depth values by black color.
        </p>
        <p>
          (a)
(d)
(b)
(e)
(c)
(f)
where  = log  − log * ,  denotes a predicted
depth map, * a ground truth,  a total number of pixels
and  an index of pixel. The authors show this metric is
invariant to the global scale of the predicted and ground
truth depth maps for 
obtained them after experimenting with various settings.
Minimum
Average
Maximum
.001
.01
.10
.25
.50
.75
.90
.95
.99
.999
Analysis of the training split of the DIODE dataset [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ],
presenting minimum, average, maximum, and quantile values.
        </p>
        <p>
          First, we need to adjust the depth range predicted
by the GLPN model. The available pre-trained model
outputs values in the 10-meter range as it was trained
on the NYUv2 dataset, which has a maximal distance
of 10 meters. On the contrary, our DIODE dataset was
obtained using a laser scanner with a maximal range
of 350 meters. We compensate for this diference by
adjusting the final scale, which multiplies the output of
the decoder sigmoid. A straightforward choice would be
to multiply the result by 350. For example, the authors
of the dataset also construct their baseline model [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]
to output depth values from 0 to 350 meters. However,
we achieve slightly better results using a smaller range,
and we argue it is suficient. When analyzing the oficial
training split of the dataset, we found that more than
99.9% of the depth values of the joint indoor and outdoor
parts are smaller than 150 meters. Thus, we can safely
use 150 as the maximum depth value without limiting the
model too much. We hypothesize that this utilizes the
output range better, as well as the slope of the sigmoid
for computing gradients. The full analysis of the training
split of the DIODE dataset can be found in Table 1.
        </p>
        <p>During training, when we feed the model with images,
we use data augmentation to improve its generalization
capabilities. We limit the augmentation techniques to a
subset of those used in the original paper. Specifically,
we apply horizontal flipping with a 50% probability and
make random adjustments to brightness (±0.2), contrast
(±0.2), hue (±0.2), and saturation (±0.3).</p>
        <p>
          In the original research paper, the authors use
training images with the resolution of 576×448. We have
conducted experiments with various resolutions,
including the original sizes of NYUv2 (640×480) and DIODE
(1024×768) images. However, we have found that
changing the resolution does not significantly impact the
accuracy. As a result, we use a resolution of 640×480 for
most of our experiments. We attribute this robustness to that’s better suited for deployment. Models implemented
the fact that the SegFormer [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] backbone in GLPN does in PyTorch can be converted into a computational graph
not rely on fixed positional encodings concatenated to in TorchScript format using the torch.jit.trace()
the input patches. The variable resolution of the images or torch.jit.script() methods. The TorchScript
only changes the number of patches but not the encoded graph representation can be compiled just before
execuvalue of the input patch. tion and run using PyTorch JIT, which further optimizes
        </p>
        <p>For most of our experiments, we use a batch size of 8, models using runtime information. There are also
aheadthe maximum batch size where the training of 640×480 of-time compilers for TorchScript, such as the TensorRT
images fits in the 24GB memory of the NVIDIA Titan compiler for NVIDIA GPUs.</p>
        <p>RTX GPU that we mainly use. Machine learning models can also be deployed in Open</p>
        <p>In the original paper, the authors use the polynomial Neural Network Exchange (ONNX) format. ONNX was
learning rate schedule with a factor of 0.9, which in- created as a format for interoperability between diferent
creases the learning rate from 3× 10− 5 to 1× 10− 4 in the frameworks. Both PyTorch and TensorFlow ofer
methifrst half of training and then decreases it from 1 × 10− 4 ods for converting models to the ONNX format. PyTorch
to 3 × 10− 5 in the second half. Accordingly, we employ models are converted using the torch.onnx.export()
a learning rate schedule where the learning rate first in- method.
creases and then decreases. We use a standard PyTorch Multiple runtimes exist for models in ONNX; the most
implementation of the 1cycle learning rate policy with a used one is onnxruntime, maintained by Microsoft. It
peak learning rate of 1 × 10− 4. enables models to run on Windows, Linux, and Mac, as
well as in a web browser or on mobile devices. With all
3.3. Blender Add-on its dependencies, onnxruntime only requires around
600 MB for the GPU version and 150 MB for the
CPUIn this section, we describe the implementation of our only version. onnxruntime also provides options for
Blender add-on, the design decisions we have made, and optimizing models, including quantization which reduces
some adjustments that the add-on does to improve the computation precision, making models smaller and faster.
visual appearance of the final result. We have chosen to use ONNX in our add-on because it</p>
        <p>Blender6 is a powerful open-source software for 3D is easy to convert models into this format and allows us to
graphics released under the GNU General Public License experiment with optimizing inference speed in the future.
(GPL). It supports a wide range of graphics-related tasks, Using ONNX also enables us to decouple the add-on code
including modeling, still image rendering, and anima- from PyTorch, so if a better model becomes available, we
tion creation. Blender is a cross-platform application can simply convert it to ONNX even if it is implemented
that can be run on Linux, Windows, and Mac comput- in TensorFlow. Then the new model could be used in the
ers. Although Blender is mostly developed in C++, it also add-on without needing to refactor the code or requiring
provides a Python API, allowing add-ons to be developed. users to install another runtime.</p>
        <sec id="sec-2-7-1">
          <title>3.3.1. Model Deployment</title>
        </sec>
        <sec id="sec-2-7-2">
          <title>3.3.2. Creating a Diorama</title>
          <p>Both our HuggingFace models, GLPN and OneFormer, The workflow of the add-on begins with the user selecting
are implemented in PyTorch. While trained models can an input image. Depth and segmentation models in the
run in a native PyTorch environment, using it for pro- ONNX format are then used to perform inferences. An
duction has some disadvantages. example of the input image, along with the depth map</p>
          <p>Firstly, users of our add-on would need to download and the panoptic segmentation mask generated inside
the large PyTorch Python module, which can take up the add-on, is shown in Figure 5.
several gigabytes of disk space. On a testing Windows
machine, the installed PyTorch occupies about 1 GB, and
the size increases to 4 GB for the GPU version with CUDA
support. Another disadvantage is that PyTorch natively
uses eager execution mode, which is convenient for
developing models, but it is slower compared to a graph
mode, where a computational graph of all operations (a) Input image. (b) Segmentation (c) Depth map.
is constructed before execution, allowing for powerful mask.
optimizations. Figure 5: Input image with inferences of deep learning models</p>
          <p>To solve some of these issues, PyTorch ofers Torch- inside the add-on.</p>
          <p>Script, a statically typed subset of the Python language</p>
        </sec>
      </sec>
      <sec id="sec-2-8">
        <title>6https://www.blender.org/</title>
      </sec>
      <sec id="sec-2-9">
        <title>The input image is cut along the segment borders,</title>
        <p>resulting in a set of images where each contains only one
object from the panoptic map, and the rest of the pixels
are transparent. A 2D plane is spawned in a Blender scene
for each segmented image, and the planes are textured
with the segmented images. The planes are positioned
behind each other, based on the average depth of their
segments, and scaled to match the camera’s perspective.
The more distant planes appear larger, creating a sense
of depth, as shown in Figure 6a.</p>
        <sec id="sec-2-9-1">
          <title>3.3.4. Depth of the Sky</title>
        </sec>
      </sec>
      <sec id="sec-2-10">
        <title>Another issue that arises in our diorama creation is the</title>
        <p>incorrect depth estimation of the sky. We can see in
Figure 5c that the sky is estimated to be closer than the
(a) No inpainting.</p>
        <p>(b) Inpainted holes from the</p>
        <p>
          foreground objects.
building as it has darker values in the depth map. This
happens because the distance of the sky cannot be learned
3.3.3. Cutout Inpainting from real-world datasets. The distance of the sky cannot
be measured, and it would efectively need to be infinity
Figure 6a shows that the basic diorama created as de- compared to other distances in the image. The DIODE
scribed above has artifacts that disrupt the depth percep- dataset [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], which we use for fine-tuning, is not an
extion. The most noticeable artifacts are the holes from ception, and it contains masks indicating invalid depth
foreground objects when the diorama is viewed from an values for the sky in ground-truth maps.
angle. We address this issue with inpainting, which fills We use a segmentation model already available in our
in missing parts of the image based on existing parts. add-on to address this issue. The panoptic mask
conWe experiment in the add-on with multiple inpainting tains the classified pixels of the sky if it is present in the
methods; however, the best results are usually achieved input image. We then move the plane with the sky
segwith the inpainting algorithm available in Blender’s com- ment behind all other segments to create a more realistic
positor. representation of the scene.
        </p>
        <p>
          Blender’s inpainting algorithm starts at an image edge
and gradually spreads the color of the edge to the more
distant pixels. However, inpainting performed straight 4. Results and Discussion
from the edge is prone to artifacts, as shown in Figure 7
where dark pixels from a segmented mountain are in- In this section, we discuss the performance of our
finepainted into the sky. Especially at complex boundaries, it tuned GLPN model and compare its results with the
orighappens very often that edges contain pixels from the oc- inal model trained on the NYUv2 dataset [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. We also
cluding objects. To avoid these artifacts, we apply erosion analyze the dioramas created by our solution and discuss
to remove pixels from the borders of segments before its strengths and weaknesses.
computing the inpainting. In our add-on, the width of
the removed edges is empirically set to 3 pixels. Figure 8 4.1. Results of Fine-tuning
shows a comparison between inpainting with and
without erosion, while an example of the final diorama with
inpainting applied is shown in Figure 6b.
        </p>
        <p>
          We use the oficial validation split of the DIODE
dataset [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], which contains both indoor and outdoor
scenes, to compare the two depth models. A challenge
with the comparison is that the original pre-trained
version of GLPN outputs depth values between 0 and 10
meters, while DIODE’s measured depth values go up to
350 meters. To select the right method for the
comparison, we hypothesize that the trained GLPN model has
either a dominant notion of absolute depth or a dominant
notion of relative distances. If the dominant notion is
absolute depth, the model would correctly estimate the
depth of objects closer than 10 meters and return the
maximum value for everything further away. On the
other hand, if the dominant notion is relative distances,
the model would estimate which objects are closer than
others correctly, even in outdoor scenes. We can see in
Figure 9 that the second option is prevalent. For example,
the outdoor scene in Figure 9b shows a building that is
further than 10 meters and still not estimated as the
maximal depth. Moreover, Figure 9h shows a street nearly
a hundred meters long, and the NYUv2 model is able
to estimate the relative relations of objects correctly, as
well as the direction of the depth gradient on the street,
only that it estimated a big depth change in the first few
meters and then a smaller change further away.
        </p>
        <p>
          Based on this observation, we suggest comparing the
models using the scale-invariant log scale metric with
 = 1.0 (SILog1.0) [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. It is invariant to the scale of the
predicted and ground truth depth maps which allows
for fair comparison of models trained on datasets with
diferent ranges. This is in contrast to training where we
used SILog with  = 0.5 to jointly learn relative depth
relations together with absolute depth values.
        </p>
        <p>Table 2 provides results for several metrics, including
SILog1.0. We can see that our fine-tuning significantly
improved SILog1.0 on outdoor scenes. This result is also
supported by the observations from Figure 9. The
groundtruth map in Figure 9a shows that the right part of the
building is further away than the left part; however, the
NUYv2-trained model estimates the whole building as
approximately the same distance in Figure 9b while our
ifne-tuned model estimates it more correctly in Figure 9c.</p>
        <p>On the other hand, SILog1.0 on indoor scenes improved
by a smaller margin which supports the selection of this
metric for comparison since we know that the original
model was already well-trained for indoor scenes. Again,
the indoor scene in Figure 9 supports the similarity of
indoor SILog1.0 metrics by showing that both models
(a)
(d)
(g)
(b)
(e)
(h)
(c)
(f)
(i)
estimate the depth similarly.</p>
        <p>To assess if improvement in SILog1.0 metric from
Table 2 is statistically significant, we computed the
Wilcoxon signed-rank test between the metric values
obtained from the original and fine-tuned models. The
p-values were computed separately for indoor and
outdoor scenes as well as for the indoor and outdoor scenes
together. The tests on all three sets confirmed that the
improvement in SILog1.0 achieved through our fine-tuning
is statistically significant on the level of  = 0.001.</p>
        <p>Moreover, we can see from Table 2 that other metrics
also improved with fine-tuning, but comparing them is
not fair due to the diferent output scales. Interestingly,
the original model achieved better absolute relative
difference (AbsRel) on indoor scenes. We hypothesize that
this indicates that the original model works slightly bet- However, our solution has some limitations related
ter on near objects, as AbsRel penalizes the errors in to the used deep learning models or to the method of
depth estimation more for close objects by dividing the constructing the diorama. Firstly, an object with varying
estimation diference by the ground-truth values. Since depth, such as a wall that is partially in the foreground
more than 95% of values in the training split’s indoor part and partially in the background, has to be placed in a
are smaller than 10 meters (as shown in Table 1), errors single depth-plane in our diorama, which can result in an
from not estimating distances beyond 10 meters are not incorrect position relative to other objects in the scene, as
significant. shown in Figure 10h. The rear wall is segmented together
with the side walls and therefore placed incorrectly in
4.2. Evaluation of Dioramas front of the chairs and tables. This is a general limitation
of all plane-based dioramas. In some cases, rotating the
Comparing dioramas presents a challenge due to the ab- planes in the diorama according to the depth gradient
sence of a definitive ground truth. Therefore, in this of each segment could improve the issue. However, this
section, we focus on outlining the strengths and weak- would not always help, as with the walls of the room in
nesses of our approach. our example.</p>
        <p>Our solution for creating dioramas works decently for To improve the diorama from Figure 10h, we would
various types of scenes. Particularly, the efect is en- need first to split the walls into separate segments. This
hanced if there are multiple objects in the foreground brings us to the next issue related to the definition of the
that can pop up from the background. We have also panoptic segmentation task. Models are not trained to
identified that it works well for outdoor scenes with a segment instances of objects from the stuff category,
clear depth separation between objects, as shown in Fig- such as walls, roads, or vegetation, so all instances from
ures 10b and 10d. one category end up in the same plane in our diorama.
Figure 10f shows an example of this issue, depicting an
ancient tomb where our segmentation model marks
almost everything as a single wall object.</p>
        <p>Lastly, our solution struggles with segmentation of
objects with complex shapes, as seen in Figure 11, where
tree branches without leaves are segmented together with
the surrounding sky pixels. This limitation is due to the
(a) (b) resolution of the used panoptic model and thus, it may
be improved with better models in the future.
(c)
(e)
(g)
(d)
(f)
(h)</p>
        <p>
          Despite these limitations, our solution is usable in
many cases, and we believe it can be already helpful
in a graphics workflow. Compared to previous works
described in Section 2, our solution is less restrictive in
terms of the general input image requirements. For
example, dioramas created by Assa and Wolf [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] work mostly
on outdoor scenes without regular patterns and straight
lines. PEEP [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] is limited to images with zero or one
vanishing point as it is fitting frustums to the images, and
Zhao et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] restrict their solution to hazy images only.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusion and Future Work</title>
      <p>We started this paper by providing an overview of
existing methods for automatic diorama construction and
identifying their limitations. Then, we decided to use
recent deep-learning models to overcome some of those
limitations. We reviewed the fundamentals of current
deep learning models for monocular depth estimation
and panoptic segmentation and selected a suitable,
welldeveloped framework with transformer-based models.
We chose a state-of-the-art model for panoptic
segmentation and a competitive model for depth estimation which
we fine-tuned to improve performance on outdoor scenes.
Furthermore, we investigated ways of deploying deep
learning models and we selected the ONNX format as
the most suitable for future updates.</p>
      <p>Even though the resulting add-on has some limitations,
using deep learning to create dioramas is a promising
approach. Overall, we believe our implementation is
already a useful tool for creating dioramas in Blender,
and we expect to continuously improve it.</p>
      <p>One option for future work is to focus on improving
all the small adjustments made to the cutout images,
as the visual quality of dioramas depends on them
significantly. For example, the current inpainting method
simply spreads the color of the edge pixels into the holes.
Our algorithm would benefit from a more advanced
inpainting algorithm, such as one based on deep learning.</p>
      <p>While we showed in this paper that separate depth
and segmentation models can be used for generating
dioramas. There is still an open question for future research
if deep learning can be used for an end-to-end solution.</p>
      <p>The quality of dioramas is closely linked to how users
perceive 3D information from it, which is inherently
subjective. Thus, conducting a user study to compare our
method with prior research would be beneficial.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <sec id="sec-4-1">
        <title>The work was supported by grant number SVV-20209/260699.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Assa</surname>
          </string-name>
          , L. Wolf,
          <article-title>Diorama construction from a single image</article-title>
          ,
          <source>Computer Graphics Forum</source>
          <volume>26</volume>
          (
          <year>2007</year>
          )
          <fpage>599</fpage>
          -
          <lpage>608</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Saxena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <article-title>Make3d: Learning 3d scene structure from a single still image</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>31</volume>
          (
          <year>2009</year>
          )
          <fpage>824</fpage>
          -
          <lpage>840</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Agus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Villanueva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pintore</surname>
          </string-name>
          , E. Gobbetti, PEEP: Perceptually Enhanced Exploration of Pictures, in: M.
          <string-name>
            <surname>Hullin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Stamminger</surname>
          </string-name>
          , T. Weinkauf (Eds.), Vision, Modeling &amp; Visualization, The Eurographics Association,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hansard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cavallaro</surname>
          </string-name>
          ,
          <article-title>Pop-up modelling of hazy scenes</article-title>
          , in: V.
          <string-name>
            <surname>Murino</surname>
          </string-name>
          , E. Puppo (Eds.),
          <source>Image Analysis and Processing - ICIAP 2015</source>
          , Springer International Publishing, Cham,
          <year>2015</year>
          , pp.
          <fpage>306</fpage>
          -
          <lpage>318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kirillov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rother</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollar</surname>
          </string-name>
          ,
          <article-title>Panoptic segmentation</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Minaee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Boykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Porikli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kehtarnavaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Terzopoulos</surname>
          </string-name>
          ,
          <article-title>Image segmentation using deep learning: A survey</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>44</volume>
          (
          <year>2022</year>
          )
          <fpage>3523</fpage>
          -
          <lpage>3542</lpage>
          . doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2021</year>
          .
          <volume>3059968</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Puig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fidler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barriuso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          ,
          <article-title>Scene parsing through ade20k dataset</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Silberman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hoiem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kohli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <article-title>Indoor segmentation and support inference from rgbd images</article-title>
          , in: A.
          <string-name>
            <surname>Fitzgibbon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Lazebnik</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Sato</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Schmid (Eds.),
          <source>Computer Vision - ECCV 2012</source>
          , Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2012</year>
          , pp.
          <fpage>746</fpage>
          -
          <lpage>760</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hassani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Orlov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <article-title>Oneformer: One transformer to rule universal image segmentation</article-title>
          ,
          <source>CoRR abs/2211</source>
          .06220 (
          <year>2022</year>
          ). arXiv:
          <volume>2211</volume>
          .
          <fpage>06220</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Joo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Global-local path networks for monocular depth estimation with vertical cutdepth</article-title>
          ,
          <source>CoRR abs/2201</source>
          .07436 (
          <year>2022</year>
          ). arXiv:
          <volume>2201</volume>
          .
          <fpage>07436</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Cheng</surname>
          </string-name>
          , I. Misra,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Schwing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kirillov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girdhar</surname>
          </string-name>
          ,
          <article-title>Masked-attention mask transformer for universal image segmentation</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1290</fpage>
          -
          <lpage>1299</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Swin transformer: Hierarchical vision transformer using shifted windows</article-title>
          ,
          <source>2021 IEEE/CVF International Conference on Computer Vision</source>
          (ICCV) (
          <year>2021</year>
          )
          <fpage>9992</fpage>
          -
          <lpage>10002</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <surname>Deformable</surname>
            <given-names>DETR</given-names>
          </string-name>
          :
          <article-title>deformable transformers for endto-end object detection</article-title>
          , CoRR abs/
          <year>2010</year>
          .04159 (
          <year>2020</year>
          ). arXiv:
          <year>2010</year>
          .04159.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cordts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Omran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ramos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rehfeld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Enzweiler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Benenson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Franke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schiele</surname>
          </string-name>
          ,
          <article-title>The cityscapes dataset for semantic urban scene understanding</article-title>
          ,
          <source>in: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hays</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Ramanan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dollár</surname>
            ,
            <given-names>C. L.</given-names>
          </string-name>
          <string-name>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <article-title>Microsoft coco: Common objects in context</article-title>
          , in: D.
          <string-name>
            <surname>Fleet</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Pajdla</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Schiele</surname>
          </string-name>
          , T. Tuytelaars (Eds.),
          <source>Computer Vision - ECCV 2014</source>
          , Springer International Publishing, Cham,
          <year>2014</year>
          , pp.
          <fpage>740</fpage>
          -
          <lpage>755</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>E.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anandkumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Alvarez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <article-title>Segformer: Simple and eficient design for semantic segmentation with transformers</article-title>
          ,
          <source>in: Neural Information Processing Systems (NeurIPS)</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Geiger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lenz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Urtasun</surname>
          </string-name>
          ,
          <article-title>Vision meets robotics: The kitti dataset</article-title>
          ,
          <source>International Journal of Robotics Research (IJRR)</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>I.</given-names>
            <surname>Vasiljevic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kolkin</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Daniele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mostajabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Basart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Walter</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Shakhnarovich, DIODE: A Dense Indoor and Outdoor DEpth Dataset</article-title>
          , CoRR abs/
          <year>1908</year>
          .00463 (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>D.</given-names>
            <surname>Eigen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Puhrsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <article-title>Depth map prediction from a single image using a multi-scale deep network</article-title>
          , in: Z.
          <string-name>
            <surname>Ghahramani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Welling</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cortes</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Lawrence</surname>
          </string-name>
          , K. Weinberger (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>27</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>