1. Introduction

Visualizing Motion of Natural Objects by Deep Learning Optical Flow Estimation in an Omnidirectional Image for Virtual Sightseeing

Motoki Kakuho

Norihiko Kawai

0 0 Graduate School of Information Science, Osaka Institute of Technology

Services using omnidirectional images have become increasingly popular. For example, Google Street View enables users to view the scenery of a location online without physically visiting it. However, the use of still images limits the sense of presence. This study proposes a method that focuses on natural elements such as water, sky, and trees within a single omnidirectional image and utilizes deep learning to reproduce their motion in 3D space, generating omnidirectional videos. Experiments demonstrate the efectiveness of the proposed method by comparing results with conventional methods.

eol>Omnidirectional Image Video Generation Motion Reproduction Virtual Sightseeing

1. Introduction

a significant amount of time for collecting video data.

To solve this problem, we propose a method that foVirtual sightseeing services using omnidirectional images cuses on natural objects such as water, sky, and trees have been increasing. For example, TOWNWARP [ 1 ] and within a single omnidirectional image and reproduces AirPano [ 2 ] enable users to enjoy the scenery of famous their motion to generate omnidirectional videos for tourist spots and cities as videos online without physi- highly realistic virtual sightseeing at arbitrary locations. cally visiting them. Additionally, there are studies that In the proposed method, for water surface and sky recombine virtual tourism with education by synthesizing gions, a part of the target omnidirectional image is convirtual objects into omnidirectional images. For example, verted into a perspective projection image, and the optical CoSpaces [ 3 ] provides functions to place virtual objects lfow of the water surface and sky is estimated by a deep such as information boards, explanatory text, and human learning-based approach. The optical flow is then transavatars in virtual environments created with omnidirec- formed into the motion in 3D space and projected back tional images, which can support various types of learn- onto the omnidirectional image, reproducing the motion ing. For ruins tourism, an application has been developed of the water and sky in the omnidirectional image. For that allows users to learn and enjoy the scenery of the trees, the optical flow is obtained from a reference video past of a historical site not only as VR at arbitrary loca- in the perspective projection, converted into the motion tions but also as Indirect AR at the site by synthesizing on vertical 3D planes, and then applied to the omnidirecvirtual buildings that existed in the past into omnidirec- tional image. Semantic segmentation [ 6 ] is also used to tional images and presenting them to the user [ 4 ]. While clearly separate the sky, water surface, and tree regions. such services allow users to virtually experience and This process generates an omnidirectional video where learn various places around the world, users cannot view motion is reproduced only in the regions of water, sky, locations other than famous tourist spots chosen by the and trees. content creators.

In contrast, Google Street View [ 5 ] is an example of services that allow users to explore any location world- 2. Related Work wide. However, since this service presents still images, it lacks the sense of presence. One solution to this issue is to record videos from fixed points while traveling around the world. However, this method would require Various studies have been conducted on converting still images into videos by moving objects in them. Among these, there are studies [ 7, 8, 9, 10 ] that focuses on the movement of natural objects. For instance, Creating Fluid Animation from a Single Image using Video Database [ 9 ] generates high-quality animations by eficiently assigning target images using a Markov Random Field (MRF) and leveraging a fluid video database. Another example using machine learning with neural networks. However, these methods deal with images in perspective projection.

Therefore, for example, if the method [ 10 ] is applied to omnidirectional images in equirectangular projection, the generated motion appears unnatural because the model is trained on perspective projection images. It also sufers from parameter dependency, causing motion in regions where no motion should occur. In addition, even though the left and right edges of the omnidirectional image are connected, the conventinal methods do not take this into account. Therefore, when looking around the omnidirectional image as a perspective projection image, we can observe a misaligned border in the texture at the edges.

For these problems, in our previous study [ 11 ], we reproduced the motion of the sky and water surface in omnidirectional images by assuming that the sky and water surface could be expressed by straight-line motion on a plane. The proposed method in this study is the extended version, and reproduces more natural motion by using optical flow estimated by deep learning, and also reproduces the motion of trees.

3. Proposed Method 3.1. Overview

The flow of the proposed method is as follows. First, (1) we input an omnidirectional landscape image containing either sky, water or trees, as shown in figure 1(a). This study assumes that omnidirectional images are generated by equirectangular projection so that the bottom pixel is in the direction of gravity obtained from the accelerometer in the camera. Next, (2) we apply the semantic segmentation [ 6 ] to the image to divide it into regions such as water surface, sky, trees, and others as shown in Figure 1(b). From the segmented image, we generate a mask image that mask all objects above the horizon except the sky area, as shown in Figure 1(c). Here, due to inaccuracies near the boundaries of the semantic segmentation, the mask regions are expanded to fully include objects except the sky area. Next, (3) using the generated mask image, we generate an image in which all areas above the horizon have sky textures by inpainting [ 12 ], as shown in Figure 1(d).

Next, (4) we generate the motion of the water surface, sky and trees by copying pixel values using calculated optical flows. The motion of the water surface and sky are calculated by estimating the motion in 3D space based on the deep learning-based optical flow estimation [ 10 ]. The tree motion is calculated by acquiring the motion from a perspective projection video of the trees and reproducing the motion in 3D space.

Finally (5) We combine the video of each region gen(a) Example of input image (b) Semantic segmentation

(c) Mask image (d) Inpainting result erated in (4) with the input image using the segmented image as shown in Figure 1(b) to generate a video in which only the water surface, sky, and trees move. Here, alpha blending is performed at the boundary of the mask to reduce the unnaturalness at the boundary between the moving and static regions. The following sections describe the details of motion generation in Step (4). the motion calculated on the plane here only corresponds to a part of the lower part of the omnidirectional image. To handle the entire water area in the omnidirectional image, this study assumes that the motion of the water at any given location is similar. Here, since the projected region from the perspective projection image is a trapezoidal shape, as shown by the red outline in Figure 3, the lfow map of the region is extracted, scaled, interpolated, and shifted to align with the square region projected from the lower half of the omnidirectional image, as shown in Figure 4.

Next, as shown in Figure 2, the , coordinates on the plane obtained by equation (1) are shifted by flow (, ), and projected onto the surface of the sphere. Pixel (2, 2) in the omnidirectional image corresponding to the shifted coordinate ( +, +, ) on the horizontal plane is determined as follows:

3.2. Generation of Water Motion

For the motion of water, we assume that the water is moving along a planar surface in 3D space. This 3D motion on a plane is represented as a 2D optical flow on the omnidirectional image.

Specifically, we first define the coordinate system for the omnidirectional image and the plane. As shown in Figure 2, position (, , ) on the water surface corresponding to pixel (1, 1) of the omnidirectional image is determined in a coordinate system where the center of the sphere corresponding to the omnidirectional image is at the origin, as follows: ⎡⎤ ⎡ tan ℎ1 cos 2 1 ⎤ ⎣ ⎦ = ⎣ tan ℎ1 sin 2 1 ⎦ ,

where and ℎ are the width and height of the omnidirectional image, and is a negative constant representing the height of the water surface.

In this coordinate system, we compute the flow (, ) at the water surface. First, as shown in Figure 3, a part of the omnidirectional image is extracted as a perspective projection image so that its horizon is at the center of the image height. The optical flow is then estimated by the deep learning-based method in [ 10 ]. As illustrated in Figure 3, both the original pixel and the pixel after moving based on the flow are projected onto the plane at height using the focal lengths , and the image center , of the perspective projection image. The 3D coordinates after projecting the pixel (, ) onto the plane are calculated as follows: ⎡⎤ ⎡ ((−− )) ⎤ ⎣ ⎦ = ⎣⎢ (−) ⎦⎥ .

Next, the flow map on the plane is determined from 2 the diferences between the respective projected 3D coordinates. This process is performed on the pixels in the Finally, the diference between the transformed pixel lower half of the perspective projection image. However, (2, 2) and the original pixel (1, 1) is calculated as the (3) ︂[ 2]︂ = [︃ 2 tan− 1 +

+ ℎ cos− 1 √(+)2+( +)2+2 ]︃ . optical flow (, ) on the omnidirectional image. By performing this process for all pixels, the optical flows for the entire water surface on the omnidirectional image are obtained. Based on the optical flows, pixel values are copied to generate an image where the water surface has moved. This process is repeated for each frame, and a video is generated by combining all the frames.

3.3. Generation of sky Motion

For the motion of the sky, assuming that clouds in the sky move on a plane in 3D space above the scene, the motion is estimated in the same manner as for the water. The optical flows in the upper part of the perspective projection image in Figure 3 is estimated by the deep learning-based method [ 10 ], and the motion is projected on the plane, and the motion on of the upper part of the omnidirectional image is finally determined by re-projecting the motion on the plane onto the sphere representing the omnidirectional image.

Note that, as described in section 3.1, by removing all areas other than the sky using inpainting, the plausible sky texture is generated in the areas. Even when the flow is from behind buildings, the generated texture is copied, and the motion of the sky can be reproduced.

3.4. Generation of tree Motion

For the motion of trees, rather than assuming a single plane like water and sky, we assume that, as shown in Figure 5, they move on a vertical plane perpendicular to the radial line from the center of the sphere to the sphere surface at height 0, for each column. Similar to the sky and water, this 3D motion is expressed as a 2D optical lfow on the omnidirectional image.

Specifically, a reference video is first input and the optical flows are estimated by Farneback method [ 13 ]. The flow map is resized to match the tree region. Next, the 2D coordinates of the input image in the mask region for trees are converted into 3D coordinates as follows: ⎡⎤ ⎡cos 2 1 ⎤ ⎣⎦ = ⎢sin 12 1 ⎥⎦ .

⎣ tan ℎ1 (4) The 3D coordinate shifted on the vertical plane based on the flow map, and the shifted 3D coordinate is reprojected onto the shpere. The flow on the omnidirectinal image is determined by the original and the re-projected pixels. By repeating this process for the number of frames in the reference video, a video with the tree motion is generated.

4. Experiments and Discussions 4.1. Experimental Settings

We conducted experiments to generate a video from a single omnidirectional image. As input, we used an image captured with the 360° camera RICOH THETA Z1 and an image obtained from Google Street View, which were resized to a resolution of 1600 × 800. We used the image captured with the 360° camera as Case 1, and the image obtained from Google Street View as Case 2. In the experiments, we set the height of the planes representing the sky and water along the Z-axis to 2 and -2, respectively. We set the focal lengths , and image center , of the perspective projection image with a resolution of 384 × 384 to 192. We obtained the motion of the trees from the reference video as shown in Figure 6. The generated video consisted of 199 frames. Additionally, in Case 2, we compared the results with those obtained by directly applying the conventional method [ 10 ] to the equirectangular omnidirectional image. The following sections describe the experiments for Cases 1 and 2 in turn.

4.2. Experimental Results

4.2.1. Result of Case 1

Frame 60 Frame 120 Frame 60 Frame 120

From these experimental results, we can observe that the sky moves naturally in the sky region, and we can frame) by the proposed method in Case 1. Figures 8 also feel perspective because the clouds just above us and 9 show the result of converting these frames into moves faster than those in the distance. As for the waperspective projection images in a specific direction. In ter, we can see that the water surface moves in various this experiment, we convereted the omnidirectional im- directions, resulting in successfully representing waves. age into the perspective projection image as shown in In the oflw map of the water plane in Figure 10(c), we Figure 10(a). Figure 10(b) shows the the calculated optical can observe various hues between green and yellow, and lfow at the 30th frame. From this flow map, we generated the brightness also varies, indicating that the complex the flow maps of the water and sky planes as shown in motion of the water is well-represented. In contrast, the Figures 10(c) and (d). In these figures, the angle of mo- lfow of the sky shows less variation in hue compared to tion is represented by hue, the relative magnitude of the the water, confirming that it moves in a mostly consistent motion is represented by brightness, and the saturation direction. Regarding the trees, although the motion of is fixed at 1. the trees in the reference video is reflected in the omni

Frame 60

(a) Perspective projection im-(b) Flow of perspective projecage tion image (c) Flow of water plane (d) Flow of sky plane

Frame 60 Frame 120 Frame 60 Frame 120

results. A specific example of this issue is illustrated in Figure 11, where the sun is also considered as part of the sky, making it move in the same way as the clouds. One solution is to extract the sun from the sky by developing a new semantic segmentation method and then to keep the sun in its original position.

Furthermore, while this study focuses on animating natural objects, many tourist spots also have moving man-made objects such as cars and flags. If these objects are not properly animated, the realism of the video is reduced. We should develop a method for animating man-made objects, further enhancing the realism of the video. (a) Omnidirectional image (b) Perspective projection

5. Conclusion

In this study, we proposed a method for generating videos with motion of natural objects from a single omnidirectional image by the combination of estimating optical lfows using deep learning and considering the motion in 3D space for virtual sightseeing. Through experiments, we confirmed that the proposed method is efective. However, while the water and sky regions moved naturally, the tree regions still show some unnatural motion. In future work, we introduce deep learning for the motion of trees as well.

Acknowledgment

This research was partially supported by JSPS KAKENHI JP23K21689.

[1]

V. T.

Consortium , Townwrap, 2024 . URL: https:// townwarp.net/, last accessed: September 25 , 2024 .

[2] AirPano, Airpano, 2024 . URL: https://www.airpano. com/, last accessed: September 25 , 2024 .

[3]

Valero-Franco ,

Berns , A virtual reality app created with cospaces: Student perceptions and attitudes, in: Ethical Considerations of Virtual Reality in the College Classroom , 1st ed., Routledge , 2023 , p. 16 .

[4]

Suganuma ,

Oda ,

Nakayama ,

Nishikawa ,

Hata ,

Paul , S. Wada, N. Kawai, Integrated system of augmented and virtual reality for ruins tourism , in: Proceedings of NICOGRAPH International 2023 , 2023 , p. 85 .

[5] Google , Google street view, 2024 . URL: https://www. google.co.jp/maps, last accessed: July 17 , 2024 .

[6]

Lambert ,

Lie ,

Sener ,

Hays , V. Koltun, MSeg: A composite dataset for multi-domain semantic segmentation , in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition , 2020 .

[7]

Y.-Y.

Chuang ,

D. B.

Goldman ,

K. C.

Zheng ,

Curless ,

Salesin ,

Szeliski , Animating pictures with stochastic motion textures , ACM Transactions on Graphics 24 ( 2005 ) 853 - 860 .

[8]

Okabe ,

Anjyor ,

Igarashi , H.-P. , Animating pictures of fluid using video examples , Computer Graphics Forum 28 ( 2009 ) 677 - 686 .

[9]

Okabe ,

Anjyor ,

Onai , Creating fluid animation from a single image using video database , Computer Graphics Forum 30 ( 2011 ) 1973 - 1982 .

[10]

Endo ,

Kanamori ,

Kuriyama , Animating landscape: Self-supervised learning of decoupled motion and appearance for single-image video video synthesis , ACM Transactions on Graphics 38 ( 2019 ).

[11]

Kakuho ,

Ikebayashi ,

Kawai , Motion reproduction of sky and water surface from an omnidirectional still image , in: Proceedings of IEEE Global Conference on Consumer Electronics , 2023 , pp. 150 - 151 .

[12]

J. Y. ans Z.

Lin ,

Yang ,

Shen ,

Lu ,

T. S.

Huang , Free-form image inpainting with gated convolution , in: Proceedings of IEEE International Conference on Computer Vision , 2019 .

[13]

Farnebäck , Two-frame motion estimation based on polynomial expansion , in: Proceedings of Scandinavian Conference on Image Analysis (SCIA 2003 ), volume 2749 , 2003 .