1. Introduction

Creating 3D Diorama from Single Image with Deep Learning

Martin Vejbora

Elena Šikudová

0 0 Faculty of Mathematics and Physics, Charles University , Prague , Czech Republic

Creating 3D scenes is a time-consuming task that requires experience with modeling software. This paper presents a novel approach that combines neural models for panoptic segmentation and monocular depth estimation to construct dioramas. While previous research has explored generating dioramas from single images, to the best of our knowledge, there is no research utilizing deep learning techniques for the task. This paper provides an analysis of existing approaches to diorama generation. We then describe the construction of the diorama, where objects identified by segmentation are separated into distinct images with transparent backgrounds. These images are then placed in a 3D scene, arranged to reflect the estimated depth of each object. We also address several challenges that had to be overcome. Specifically, we employed fine-tuning to address the limitations of the available depth model when applied to outdoor scenes. Our method has been implemented as an add-on for the open-source 3D software Blender, utilizing neural models in the ONNX format for depth and segmentation inferences.

eol>deep learning diorama Blender panoptic segmentation monocular depth estimation

1. Introduction

computers or notebooks. Since research in deep learning has been very progressive in recent years, we pay attenCreating 3D environments in modeling software can be tion to designing the implementation in order to be able a repetitive and time-consuming task. However, lower- to easily use better models in the future. quality models are usually suficient for assets in the We implement our method as an add-on for the free background and further away from the camera. This is 3D software Blender1, which supports all three major where the use of automated tools can come in handy. platforms; Linux, Windows, and Mac. The add-on strives

This paper focuses on creating dioramas which are sets to be easy to use, the user selects an input image, and the of planes placed in a 3D scene to evoke the perception add-on automatically creates a diorama from it without of depth. They are computationally cheap for rendering a need to do any further manual steps in the process. since they do not utilize any complex mesh, making them Even though the quality of the resulting diorama varies suitable for background assets. Dioramas work best when based on the input image, our approach has weaker conthe camera is facing them, moving slightly, and viewing straints on the input images than the previous works. the diorama from slightly diferent angles. The efect This paper is structured as follows. Section 2 provides breaks when the diorama is viewed from a side. an overview of existing work on automatic diorama cre

Previous works used traditional machine learning tech- ation. Section 3 discusses the used framework and models niques to create dioramas, limiting their usage to hazy with a focus on fine-tuning the depth model. Then, this input images, outdoor scenes, or images with zero or one section covers the implementation of the add-on and the vanishing point. Moreover, their implementation was most significant design choices. Section 4 compares the either not published or is now outdated and no longer results of the original and fine-tuned depth models and functional, making them impractical to use. shows the visual appearance of the diorama. Further

We study the utilization of deep learning to automate more, it discusses the strengths and weaknesses of our the process of creating dioramas. Our implementation solution. Final Section 5 summarizes what was achieved uses a pre-trained state-of-the-art model for panoptic and outline potential areas for future work. segmentation and a competitive model for depth estimation that we fine-tune for outdoor scenes. The selected models are powerful yet small enough to run on standard 2. Related Work

Based on the research of human depth perception, Assa A significant portion of the article is dedicated to enand Wolf [ 1 ] define depth cues, including partial occlu- hancing the visual appeal of segmented images. The sion, texture density analysis, depth of focus, atmospheric authors blend the alpha channel of segment edges to crescattering, and object height in the visual field. They uti- ate smoother transitions between planes. Additionally, lize segmentation to obtain 10-20 major segments per areas of the photographed scene that were not visible image and smaller patches called superpixels. They es- in the original image are filled with inpainting. Prior to timate relative depth diferences among objects by com- the actual inpainting, a few border pixels of the segment paring depth clues between superpixels on borders or edge are removed using erosion to prevent misclassified inside of bigger segments. pixels from afecting the inpainting algorithm. These

Having defined a new viewing point, authors render misclassified pixels often have colors diferent from the a novel image that occludes certain parts of the original color of the main object within the segment. image. They also use image completion techniques to The main drawback of the described algorithm is that inpaint the previously occluded areas which become vis- it can only be applied to hazy images. This limitation ible. Their approach yields the best results for outdoor comes from the used depth estimation algorithm. scenes with minimal regular patterns or straight lines.

3. Proposed Solution

Similarly to the previous approach, Make3D [ 2 ] uses segment patches and defines depth cues both within and between these patches. Instead of estimating a depth map, authors build a 3D mesh from planes to represent a scene from an input image. They train a Markov Random Field (MRF) to model relationships between adjacent patches. The MRF infers locations and rotations of segment planes in a three-dimensional space. This inference is conditioned by over 500 local features computed from each patch, along with various relationships computed between patches. These inter-patch relationships involve advanced edge detection or estimation of co-planarity and co-linearity. Since an output of the algorithm is a whole textured mesh, it allows easy synthesizing of novel views.

All the approaches described in Section 2 use some form of segmentation and depth estimation. With recent advancements in neural networks, state-of-the-art solutions for both tasks now use deep learning. However, to our best knowledge, no publicly available research exists where authors would create a diorama using deep learning.

While monocular depth estimation is an unambiguous task, image segmentation is usually categorized into one of three main tasks: instance, semantic and panoptic segmentation. In instance segmentation, the objective is to identify all instances of given object classes and to determine masks for individual objects. Semantic segmentation assigns a label category to every pixel in an image while not distinguishing multiple instances of a

The research paper called PEEP: Perceptually En- class. Panoptic segmentation, proposed by Kirillov et al. hanced Exploration of Pictures [ 3 ] focuses on images [ 5 ], unifies semantic and instance segmentation by introwith zero or one vanishing point. PEEP maps an image ducing two types of objects – things and stuf. Things to 5 planes forming a pyramidal frustum to achieve a plau- include countable objects like cars or people, where each sible 3D efect. Similarly to the previous approaches, the instance needs a distinct label. Stuf refers to uncountable ifrst step obtains segmentation patches. These patches or amorphous regions like grass or sky, where it is not correspond to planes in three-dimensional space. Graph- possible or desired to distinguish individual instances. cut strategy on patches is used to fit points representing Similarly to semantic segmentation, panoptic segmenfrustum. If we limit ourselves to images with zero or tation labels all image pixels which makes it the most one vanishing point, authors claim their result is visually suitable for our use-case. A comprehensive survey of more plausible even though geometrically less precise methods for all three segmentation tasks can be found in than the one created by Make3D [ 2 ]. [ 6 ].

Our algorithm takes the input image, cuts objects de

Zhao et al. [ 4 ] limit their depth estimation to a single tected by panoptic segmentation into separate images, depth cue – atmospheric scattering. They use the Dark and places these images relatively behind each other Channel Prior dehazing algorithm to compute depth in based on their average depth predicted by the monocular their research. Authors cluster depth and radiance out- depth estimation model. This approach is illustrated in puts of the dehazing process obtaining approximately Figure 1. ifve segments per image. After estimating the depth and Re-implementing a state-of-the-art model based on segmentation, the position and orientation of segment its research paper can be challenging. Also, with transplanes are computed. Segmented alpha planes are placed former neural networks rapidly developing, new state-ofbehind each other to form a resulting diorama. the-art models for datasets like ADE20K [ 7 ] or NYUv2 [ 8 ] Segmentation

Model

Depth Model appear even multiple times a year. Therefore, better mod- from the input image, and a decoder head that predicts els will likely be available for our tasks in the future, the class and mask for each object query. requiring us to re-implement the code again. To extract multi-scale features from the input image,

For these reasons, we have decided to use a high- OneFormer uses Swin [ 12 ] backbone encoder and a multilevel framework called HuggingFace2 that contains im- scale deformable transformer [ 13 ] as a pixel decoder. plemented models, including pre-trained weights that Pixel decoder leverages a deformable attention module can be downloaded from the HuggingFace hub 3. Hug- that limits attention to a local surrounding, mimicking gingFace has a large community, well-documented code, the inductive bias of convolutions. Like a typical hierand a lot of online resources. At the time of writing, it had archical decoder, it gradually upsamples the backbone over 80,000 stars on GitHub. Most of its models are im- features with the aid of skip connections from the enplemented in PyTorch, but some also have a TensorFlow coder layers of corresponding spatial resolutions. The or JAX version. pixel decoder extracts features at 14 , 81 , 116 , and 312 of the

HuggingFace contains very capable models for both of input resolution. our tasks. The best panoptic model is OneFormer [ 9 ], The query formulation module combines the task type the state-of-the-art model for panoptic segmentation input "the task is {task}" in a 2-layer transformer on the ADE20K dataset according to the paperswith- with 14 scale features from the pixel decoder to generate code.com ranking4 at the time of implementing our add- query tokens Q. Each query token represents a potential on. The best depth estimation model from HuggingFace object or segment in the input image. These query tokens is GLPN [ 10 ], ranked 7th on the NYU v2 dataset5. We are later passed to the transformer decoder, which verifies describe the details of these models in the following sec- that they correspond to an actual object, classifies them, tions. and creates a mask for them.

The last part of OneFormer’s architecture is the trans3.1. Panoptic Segmentation former decoder with classification and mask heads. The input of the transformer decoder are object queries Q, Jain et al. [ 9 ] introduced OneFormer, a model that uni- which are repeatedly combined with multi-scale features ifes instance, semantic, and panoptic segmentation tasks. from the pixel decoder. The transformer decoder consists OneFormer achieves state-of-the-art results on all three of a masked cross-attention, followed by a self-attention, tasks after training only once, simultaneously. and a feed-forward network repeated times for each of the 18 , 116 , and 312 pixel feature scales. The resulting features are then passed to the classification and mask heads.

The classification head predicts a class or no-object for each query token. The mask head, on the other hand, computes a binary mask using pixel features at 14 resolution of the original image.

OneFormer takes two inputs, an RGB image, and a

text token. The token determines whether OneFormer executes instance, semantic, or panoptic segmentation. The model’s architecture is based on Mask2Former [ 11 ], and it consists of three main parts: an encoder-decoder backbone for extracting hierarchical features from the input image, a query module that computes object queries

2https://huggingface.co/

3https://huggingface.co/models 4https://paperswithcode.com/sota/panoptic-segmentation-onade20k-val 5https://paperswithcode.com/sota/monocular-depth-estimationon-nyu-depth-v2

HuggingFace contains versions of OneFormer with

Swin backbone [ 12 ] pre-trained on the Cityscapes [ 14 ], ADE20K [ 7 ], and COCO [ 15 ] datasets. Figure 2 compares them on an indoor and outdoor scene. We observe that the Cityscapes version yields competitive results on outdoor scenes but does not work at all on indoor scenes (Figure 2d). That is due to the dataset’s structure contain- block comprises several reduced self-attention and MLPing annotations for 30 classes related only to autonomous Conv-MLP modules with residual skip connections. The driving. On the other hand, models trained on ADE20K ifnal component of each encoder block is the patch emand COCO produce comparable results, likely due to the bedding layer which employs overlapped convolution similar structure of both datasets. We choose the COCO with stride to reduce the spatial shape of hierarchical version because of its more permissive license. features while increasing the number of channels.

A lightweight decoder is connected to the encoder on multiple resolution layers through a Selective Feature Fusion (SFF) module. This module enhances global features with fine details of the local structures that may have been lost in the latter encoder steps. SFFs connect the encoder with the decoder, allowing the decoder to access both the global path from the encoder and the local (a) (b) path through the skip connections. SFF computes a twochannel attention map where the input global features are multiplied by one channel and the local features by the other. These multiplications are element-wise along the channel dimension. Finally, the resulting scaled global and local features are added element-wise.

GLPN applies sigmoid as the last step, which scales the (c) (d) depth output to the range [ 0, 1 ]. The result is multiplied by the desired maximal depth in meters, which is specific for each dataset. (e) (g) (f) (h)

HuggingFace ofers two versions of GLPN, pre-trained on either NYUv2 [ 8 ] or KITTI [ 17 ] dataset. A comparison of their inference is shown in Figure 3.

(d) (a) (b) (e) (c) (f)

GLPN uses a hierarchical transformer encoder from SegFormer [16] to extract features from the input image at four diferent resolution levels. Each level’s encoder

We choose the NYUv2 version as it produces more consistent depth maps. The KITTI-trained model produces artifacts, such as a brighter stripe at the top part of the indoor scene in Figure 3e or inconsistent depth estimates of buildings in the left part of the outdoor scene in Figure 3b. The top parts of the higher apartment building and the smaller buildings are estimated to be closer (darker values) than the lower parts of the same real distance. We attribute these artifacts to the structure of the

KITTI dataset, which only contains images captured from

a car, so the model does not generalize well on varying scenes. For instance, almost all KITTI images have a sky at the top, and a sky does not have any valid depth values.

Thus, the model cannot learn anything there. 3.2.1. Fine-tuning As shown in Figure 3, the selected model trained on

NYUv2 performs well on indoor scenes, and despite never seeing any depth-annotated outdoor scenes, it generalizes surprisingly well on them. However, there are still some inconsistencies. For example, in Figure 3c, the two smaller buildings are estimated to be further away than the high apartment building behind them. To improve the quality of our diorama on outdoor scenes, we decide to fine-tune the model on the DIODE dataset [ 18 ], which contains both indoor and outdoor scenes. It contains around 17,000 outdoor images and almost 9,000 indoor images, with all depth maps obtained using a laser scanner. Figure 4 shows an example of RGB images, depth maps, and binary validity masks which mark invalid depth values by black color.

(a) (d) (b) (e) (c) (f) where = log − log * , denotes a predicted depth map, * a ground truth, a total number of pixels and an index of pixel. The authors show this metric is invariant to the global scale of the predicted and ground truth depth maps for obtained them after experimenting with various settings. Minimum Average Maximum .001 .01 .10 .25 .50 .75 .90 .95 .99 .999 Analysis of the training split of the DIODE dataset [ 18 ], presenting minimum, average, maximum, and quantile values.

First, we need to adjust the depth range predicted by the GLPN model. The available pre-trained model outputs values in the 10-meter range as it was trained on the NYUv2 dataset, which has a maximal distance of 10 meters. On the contrary, our DIODE dataset was obtained using a laser scanner with a maximal range of 350 meters. We compensate for this diference by adjusting the final scale, which multiplies the output of the decoder sigmoid. A straightforward choice would be to multiply the result by 350. For example, the authors of the dataset also construct their baseline model [ 18 ] to output depth values from 0 to 350 meters. However, we achieve slightly better results using a smaller range, and we argue it is suficient. When analyzing the oficial training split of the dataset, we found that more than 99.9% of the depth values of the joint indoor and outdoor parts are smaller than 150 meters. Thus, we can safely use 150 as the maximum depth value without limiting the model too much. We hypothesize that this utilizes the output range better, as well as the slope of the sigmoid for computing gradients. The full analysis of the training split of the DIODE dataset can be found in Table 1.

During training, when we feed the model with images, we use data augmentation to improve its generalization capabilities. We limit the augmentation techniques to a subset of those used in the original paper. Specifically, we apply horizontal flipping with a 50% probability and make random adjustments to brightness (±0.2), contrast (±0.2), hue (±0.2), and saturation (±0.3).

In the original research paper, the authors use training images with the resolution of 576×448. We have conducted experiments with various resolutions, including the original sizes of NYUv2 (640×480) and DIODE (1024×768) images. However, we have found that changing the resolution does not significantly impact the accuracy. As a result, we use a resolution of 640×480 for most of our experiments. We attribute this robustness to that’s better suited for deployment. Models implemented the fact that the SegFormer [ 16 ] backbone in GLPN does in PyTorch can be converted into a computational graph not rely on fixed positional encodings concatenated to in TorchScript format using the torch.jit.trace() the input patches. The variable resolution of the images or torch.jit.script() methods. The TorchScript only changes the number of patches but not the encoded graph representation can be compiled just before execuvalue of the input patch. tion and run using PyTorch JIT, which further optimizes

For most of our experiments, we use a batch size of 8, models using runtime information. There are also aheadthe maximum batch size where the training of 640×480 of-time compilers for TorchScript, such as the TensorRT images fits in the 24GB memory of the NVIDIA Titan compiler for NVIDIA GPUs.

RTX GPU that we mainly use. Machine learning models can also be deployed in Open

In the original paper, the authors use the polynomial Neural Network Exchange (ONNX) format. ONNX was learning rate schedule with a factor of 0.9, which in- created as a format for interoperability between diferent creases the learning rate from 3× 10− 5 to 1× 10− 4 in the frameworks. Both PyTorch and TensorFlow ofer methifrst half of training and then decreases it from 1 × 10− 4 ods for converting models to the ONNX format. PyTorch to 3 × 10− 5 in the second half. Accordingly, we employ models are converted using the torch.onnx.export() a learning rate schedule where the learning rate first in- method. creases and then decreases. We use a standard PyTorch Multiple runtimes exist for models in ONNX; the most implementation of the 1cycle learning rate policy with a used one is onnxruntime, maintained by Microsoft. It peak learning rate of 1 × 10− 4. enables models to run on Windows, Linux, and Mac, as well as in a web browser or on mobile devices. With all 3.3. Blender Add-on its dependencies, onnxruntime only requires around 600 MB for the GPU version and 150 MB for the CPUIn this section, we describe the implementation of our only version. onnxruntime also provides options for Blender add-on, the design decisions we have made, and optimizing models, including quantization which reduces some adjustments that the add-on does to improve the computation precision, making models smaller and faster. visual appearance of the final result. We have chosen to use ONNX in our add-on because it

Blender6 is a powerful open-source software for 3D is easy to convert models into this format and allows us to graphics released under the GNU General Public License experiment with optimizing inference speed in the future. (GPL). It supports a wide range of graphics-related tasks, Using ONNX also enables us to decouple the add-on code including modeling, still image rendering, and anima- from PyTorch, so if a better model becomes available, we tion creation. Blender is a cross-platform application can simply convert it to ONNX even if it is implemented that can be run on Linux, Windows, and Mac comput- in TensorFlow. Then the new model could be used in the ers. Although Blender is mostly developed in C++, it also add-on without needing to refactor the code or requiring provides a Python API, allowing add-ons to be developed. users to install another runtime.

3.3.1. Model Deployment 3.3.2. Creating a Diorama

Both our HuggingFace models, GLPN and OneFormer, The workflow of the add-on begins with the user selecting are implemented in PyTorch. While trained models can an input image. Depth and segmentation models in the run in a native PyTorch environment, using it for pro- ONNX format are then used to perform inferences. An duction has some disadvantages. example of the input image, along with the depth map

Firstly, users of our add-on would need to download and the panoptic segmentation mask generated inside the large PyTorch Python module, which can take up the add-on, is shown in Figure 5. several gigabytes of disk space. On a testing Windows machine, the installed PyTorch occupies about 1 GB, and the size increases to 4 GB for the GPU version with CUDA support. Another disadvantage is that PyTorch natively uses eager execution mode, which is convenient for developing models, but it is slower compared to a graph mode, where a computational graph of all operations (a) Input image. (b) Segmentation (c) Depth map. is constructed before execution, allowing for powerful mask. optimizations. Figure 5: Input image with inferences of deep learning models

To solve some of these issues, PyTorch ofers Torch- inside the add-on.

Script, a statically typed subset of the Python language

6https://www.blender.org/ The input image is cut along the segment borders,

resulting in a set of images where each contains only one object from the panoptic map, and the rest of the pixels are transparent. A 2D plane is spawned in a Blender scene for each segmented image, and the planes are textured with the segmented images. The planes are positioned behind each other, based on the average depth of their segments, and scaled to match the camera’s perspective. The more distant planes appear larger, creating a sense of depth, as shown in Figure 6a.

3.3.4. Depth of the Sky Another issue that arises in our diorama creation is the

incorrect depth estimation of the sky. We can see in Figure 5c that the sky is estimated to be closer than the (a) No inpainting.

(b) Inpainted holes from the

foreground objects. building as it has darker values in the depth map. This happens because the distance of the sky cannot be learned 3.3.3. Cutout Inpainting from real-world datasets. The distance of the sky cannot be measured, and it would efectively need to be infinity Figure 6a shows that the basic diorama created as de- compared to other distances in the image. The DIODE scribed above has artifacts that disrupt the depth percep- dataset [ 18 ], which we use for fine-tuning, is not an extion. The most noticeable artifacts are the holes from ception, and it contains masks indicating invalid depth foreground objects when the diorama is viewed from an values for the sky in ground-truth maps. angle. We address this issue with inpainting, which fills We use a segmentation model already available in our in missing parts of the image based on existing parts. add-on to address this issue. The panoptic mask conWe experiment in the add-on with multiple inpainting tains the classified pixels of the sky if it is present in the methods; however, the best results are usually achieved input image. We then move the plane with the sky segwith the inpainting algorithm available in Blender’s com- ment behind all other segments to create a more realistic positor. representation of the scene.

Blender’s inpainting algorithm starts at an image edge and gradually spreads the color of the edge to the more distant pixels. However, inpainting performed straight 4. Results and Discussion from the edge is prone to artifacts, as shown in Figure 7 where dark pixels from a segmented mountain are in- In this section, we discuss the performance of our finepainted into the sky. Especially at complex boundaries, it tuned GLPN model and compare its results with the orighappens very often that edges contain pixels from the oc- inal model trained on the NYUv2 dataset [ 8 ]. We also cluding objects. To avoid these artifacts, we apply erosion analyze the dioramas created by our solution and discuss to remove pixels from the borders of segments before its strengths and weaknesses. computing the inpainting. In our add-on, the width of the removed edges is empirically set to 3 pixels. Figure 8 4.1. Results of Fine-tuning shows a comparison between inpainting with and without erosion, while an example of the final diorama with inpainting applied is shown in Figure 6b.

We use the oficial validation split of the DIODE dataset [ 18 ], which contains both indoor and outdoor scenes, to compare the two depth models. A challenge with the comparison is that the original pre-trained version of GLPN outputs depth values between 0 and 10 meters, while DIODE’s measured depth values go up to 350 meters. To select the right method for the comparison, we hypothesize that the trained GLPN model has either a dominant notion of absolute depth or a dominant notion of relative distances. If the dominant notion is absolute depth, the model would correctly estimate the depth of objects closer than 10 meters and return the maximum value for everything further away. On the other hand, if the dominant notion is relative distances, the model would estimate which objects are closer than others correctly, even in outdoor scenes. We can see in Figure 9 that the second option is prevalent. For example, the outdoor scene in Figure 9b shows a building that is further than 10 meters and still not estimated as the maximal depth. Moreover, Figure 9h shows a street nearly a hundred meters long, and the NYUv2 model is able to estimate the relative relations of objects correctly, as well as the direction of the depth gradient on the street, only that it estimated a big depth change in the first few meters and then a smaller change further away.

Based on this observation, we suggest comparing the models using the scale-invariant log scale metric with = 1.0 (SILog1.0) [ 19 ]. It is invariant to the scale of the predicted and ground truth depth maps which allows for fair comparison of models trained on datasets with diferent ranges. This is in contrast to training where we used SILog with = 0.5 to jointly learn relative depth relations together with absolute depth values.

Table 2 provides results for several metrics, including SILog1.0. We can see that our fine-tuning significantly improved SILog1.0 on outdoor scenes. This result is also supported by the observations from Figure 9. The groundtruth map in Figure 9a shows that the right part of the building is further away than the left part; however, the NUYv2-trained model estimates the whole building as approximately the same distance in Figure 9b while our ifne-tuned model estimates it more correctly in Figure 9c.

On the other hand, SILog1.0 on indoor scenes improved by a smaller margin which supports the selection of this metric for comparison since we know that the original model was already well-trained for indoor scenes. Again, the indoor scene in Figure 9 supports the similarity of indoor SILog1.0 metrics by showing that both models (a) (d) (g) (b) (e) (h) (c) (f) (i) estimate the depth similarly.

To assess if improvement in SILog1.0 metric from Table 2 is statistically significant, we computed the Wilcoxon signed-rank test between the metric values obtained from the original and fine-tuned models. The p-values were computed separately for indoor and outdoor scenes as well as for the indoor and outdoor scenes together. The tests on all three sets confirmed that the improvement in SILog1.0 achieved through our fine-tuning is statistically significant on the level of = 0.001.

Moreover, we can see from Table 2 that other metrics also improved with fine-tuning, but comparing them is not fair due to the diferent output scales. Interestingly, the original model achieved better absolute relative difference (AbsRel) on indoor scenes. We hypothesize that this indicates that the original model works slightly bet- However, our solution has some limitations related ter on near objects, as AbsRel penalizes the errors in to the used deep learning models or to the method of depth estimation more for close objects by dividing the constructing the diorama. Firstly, an object with varying estimation diference by the ground-truth values. Since depth, such as a wall that is partially in the foreground more than 95% of values in the training split’s indoor part and partially in the background, has to be placed in a are smaller than 10 meters (as shown in Table 1), errors single depth-plane in our diorama, which can result in an from not estimating distances beyond 10 meters are not incorrect position relative to other objects in the scene, as significant. shown in Figure 10h. The rear wall is segmented together with the side walls and therefore placed incorrectly in 4.2. Evaluation of Dioramas front of the chairs and tables. This is a general limitation of all plane-based dioramas. In some cases, rotating the Comparing dioramas presents a challenge due to the ab- planes in the diorama according to the depth gradient sence of a definitive ground truth. Therefore, in this of each segment could improve the issue. However, this section, we focus on outlining the strengths and weak- would not always help, as with the walls of the room in nesses of our approach. our example.

Our solution for creating dioramas works decently for To improve the diorama from Figure 10h, we would various types of scenes. Particularly, the efect is en- need first to split the walls into separate segments. This hanced if there are multiple objects in the foreground brings us to the next issue related to the definition of the that can pop up from the background. We have also panoptic segmentation task. Models are not trained to identified that it works well for outdoor scenes with a segment instances of objects from the stuff category, clear depth separation between objects, as shown in Fig- such as walls, roads, or vegetation, so all instances from ures 10b and 10d. one category end up in the same plane in our diorama. Figure 10f shows an example of this issue, depicting an ancient tomb where our segmentation model marks almost everything as a single wall object.

Lastly, our solution struggles with segmentation of objects with complex shapes, as seen in Figure 11, where tree branches without leaves are segmented together with the surrounding sky pixels. This limitation is due to the (a) (b) resolution of the used panoptic model and thus, it may be improved with better models in the future. (c) (e) (g) (d) (f) (h)

Despite these limitations, our solution is usable in many cases, and we believe it can be already helpful in a graphics workflow. Compared to previous works described in Section 2, our solution is less restrictive in terms of the general input image requirements. For example, dioramas created by Assa and Wolf [ 1 ] work mostly on outdoor scenes without regular patterns and straight lines. PEEP [ 3 ] is limited to images with zero or one vanishing point as it is fitting frustums to the images, and Zhao et al. [ 4 ] restrict their solution to hazy images only.

5. Conclusion and Future Work

We started this paper by providing an overview of existing methods for automatic diorama construction and identifying their limitations. Then, we decided to use recent deep-learning models to overcome some of those limitations. We reviewed the fundamentals of current deep learning models for monocular depth estimation and panoptic segmentation and selected a suitable, welldeveloped framework with transformer-based models. We chose a state-of-the-art model for panoptic segmentation and a competitive model for depth estimation which we fine-tuned to improve performance on outdoor scenes. Furthermore, we investigated ways of deploying deep learning models and we selected the ONNX format as the most suitable for future updates.

Even though the resulting add-on has some limitations, using deep learning to create dioramas is a promising approach. Overall, we believe our implementation is already a useful tool for creating dioramas in Blender, and we expect to continuously improve it.

One option for future work is to focus on improving all the small adjustments made to the cutout images, as the visual quality of dioramas depends on them significantly. For example, the current inpainting method simply spreads the color of the edge pixels into the holes. Our algorithm would benefit from a more advanced inpainting algorithm, such as one based on deep learning.

While we showed in this paper that separate depth and segmentation models can be used for generating dioramas. There is still an open question for future research if deep learning can be used for an end-to-end solution.

The quality of dioramas is closely linked to how users perceive 3D information from it, which is inherently subjective. Thus, conducting a user study to compare our method with prior research would be beneficial.

Acknowledgments The work was supported by grant number SVV-20209/260699.

[1]

Assa , L. Wolf, Diorama construction from a single image , Computer Graphics Forum 26 ( 2007 ) 599 - 608 .

[2]

Saxena ,

Sun ,

A. Y.

Ng , Make3d: Learning 3d scene structure from a single still image , IEEE Transactions on Pattern Analysis and Machine Intelligence 31 ( 2009 ) 824 - 840 .

[3]

Agus ,

A. J.

Villanueva ,

Pintore , E. Gobbetti, PEEP: Perceptually Enhanced Exploration of Pictures, in: M. Hullin , M. Stamminger , T. Weinkauf (Eds.), Vision, Modeling & Visualization, The Eurographics Association, 2016 .

[4]

Zhao ,

Hansard ,

Cavallaro , Pop-up modelling of hazy scenes , in: V. Murino , E. Puppo (Eds.), Image Analysis and Processing - ICIAP 2015 , Springer International Publishing, Cham, 2015 , pp. 306 - 318 .

[5]

Kirillov ,

He ,

Girshick ,

Rother ,

Dollar , Panoptic segmentation , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2019 .

[6]

Minaee ,

Boykov ,

Porikli ,

Plaza ,

Kehtarnavaz ,

Terzopoulos , Image segmentation using deep learning: A survey , IEEE Transactions on Pattern Analysis and Machine Intelligence 44 ( 2022 ) 3523 - 3542 . doi: 10 .1109/TPAMI. 2021 . 3059968 .

[7]

Zhou ,

Zhao ,

Puig ,

Fidler ,

Barriuso ,

Torralba , Scene parsing through ade20k dataset , in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2017 .

[8]

Silberman ,

Hoiem ,

Kohli ,

Fergus , Indoor segmentation and support inference from rgbd images , in: A. Fitzgibbon , S.

Lazebnik , P.

Perona , Y.

Sato , C. Schmid (Eds.), Computer Vision - ECCV 2012 , Springer Berlin Heidelberg, Berlin, Heidelberg, 2012 , pp. 746 - 760 .

[9]

Jain ,

Li ,

Chiu ,

Hassani ,

Orlov ,

Shi , Oneformer: One transformer to rule universal image segmentation , CoRR abs/2211 .06220 ( 2022 ). arXiv: 2211 . 06220 .

[10]

Kim ,

Ga ,

Ahn ,

Joo ,

Chun ,

Kim , Global-local path networks for monocular depth estimation with vertical cutdepth , CoRR abs/2201 .07436 ( 2022 ). arXiv: 2201 . 07436 .

[11]

Cheng , I. Misra,

A. G.

Schwing ,

Kirillov ,

Girdhar , Masked-attention mask transformer for universal image segmentation , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2022 , pp. 1290 - 1299 .

[12]

Liu ,

Lin ,

Cao ,

Hu ,

Wei ,

Zhang ,

Lin ,

Guo , Swin transformer: Hierarchical vision transformer using shifted windows , 2021 IEEE/CVF International Conference on Computer Vision (ICCV) ( 2021 ) 9992 - 10002 .

[13]

Zhu ,

Su ,

Lu ,

Li ,

Wang ,

Dai , Deformable

DETR

: deformable transformers for endto-end object detection , CoRR abs/ 2010 .04159 ( 2020 ). arXiv: 2010 .04159.

[14]

Cordts ,

Omran ,

Ramos ,

Rehfeld ,

Enzweiler ,

Benenson ,

Franke ,

Roth ,

Schiele , The cityscapes dataset for semantic urban scene understanding , in: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016 .

[15] T.-Y. Lin , M.

Maire , S.

Belongie , J.

Hays , P.

Perona , D.

Ramanan , P.

Dollár , C. L.

Zitnick , Microsoft coco: Common objects in context , in: D. Fleet , T.

Pajdla , B.

Schiele , T. Tuytelaars (Eds.), Computer Vision - ECCV 2014 , Springer International Publishing, Cham, 2014 , pp. 740 - 755 .

[16]

Xie ,

Wang ,

Yu ,

Anandkumar ,

J. M.

Alvarez ,

Luo , Segformer: Simple and eficient design for semantic segmentation with transformers , in: Neural Information Processing Systems (NeurIPS) , 2021 .

[17]

Geiger ,

Lenz ,

Stiller ,

Urtasun , Vision meets robotics: The kitti dataset , International Journal of Robotics Research (IJRR) ( 2013 ).

[18]

Vasiljevic ,

Kolkin , S. Zhang ,

Luo ,

Wang ,

F. Z.

Dai ,

A. F.

Daniele ,

Mostajabi ,

Basart ,

M. R.

Walter , G. Shakhnarovich, DIODE: A Dense Indoor and Outdoor DEpth Dataset , CoRR abs/ 1908 .00463 ( 2019 ).

[19]

Eigen ,

Puhrsch ,

Fergus , Depth map prediction from a single image using a multi-scale deep network , in: Z. Ghahramani , M.

Welling , C.

Cortes , N.

Lawrence , K. Weinberger (Eds.), Advances in Neural Information Processing Systems , volume 27 , Curran

Associates

, Inc., 2014 .