3D Reconstruction of Gastrointestinal Regions from Single Images Bilal Ahmad* , Pål Anders Floor, Ivar Farup and Milan Kresović Department of Computer Science, Norwegian University of Science & Technology, 2815 Gjøvik, Norway. Abstract 3D shape reconstruction from images is one of the problem under investigation in the field of computer vision. Shape-from-shading (SfS) is an important approach which requires the reflectance properties of surface and light source position to infer the 3D shape. SfS is usually tested in medical applications without the availability of the ground truth data which makes the conclusion dubious. In this article SfS is applied on synthetic gastrointestinal regions and precise comparison is done between recovered shape and ground truth data by measuring the depth error and correlation between them. Results shows that SfS can recover the shapes quite well if penalized correctly. Keywords 3D reconstruction, Capsule endoscopy, Shape-from-shading, 1. Introduction With the advancement of medical field, the current trend is to make surgery ever less invasive. This implies smaller and smaller cuts in the patient’s skin, which do not give the surgeons a direct view of their work. They only leave enough space for small cameras to be introduced into the patient’s body. If the resulting images in such conditions are of poor quality [1], it can make the surgeon’s work even harder. 3D reconstruction can be helpful in such cases to better diagnose, visualize or analyze the areas of interests. 3D reconstruction is an inverse problem which can be rectified by applying different tech- niques to the images [2]. It is vital to get the information regarding the 3D structure or the scene’s depth since most tasks are completed in the 3D world. The concept of depth estimation involves using various approaches or algorithms to attain the spatial information of the object or to acquire the distances of all the points present in the scene, with respect to a specific chosen point. Vision-based depth estimation methods are generally classified into different categories. Some methods comprise of the usage of special devices for depth estimation [3]. Examples of these technique are ultrasonic and optical time-of-flight estimation in which measured energy beam is first transported and then reflected energy is detected [4]. Other methods do not make use The 11th Colour and Visual Computing Symposium, September 08–09, 2022, Gjøvik, Norway * Corresponding author. $ bilal.ahmad@ntnu.no (B. Ahmad); paal.anders.floor@ntnu.no (P. A . Floor); ivar.farup@ntnu.no ( I. Farup); milank@stud.ntnu.no (M. Kresović)  0000-0001-8811-0404 (B. Ahmad); 0000-0001-6328-7414 (P. A. Floor); 0000-0003-3473-1138 (I. Farup); 0000-0002-1031-1668 (M. Kresović) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) of any artificial source of energy and natural outdoor scenes fall under its category. Various monocular image-based techniques such as texture gradient analysis and photo-metric methods are used. Other methods are hinged on the motion or multiple relative positions of the camera [5]. 3D reconstruction has numerous applications in robotics, medical applications including diagnostics, video surveillance and monitoring etc. [6]. Shape-from-shading (SfS) is one of the many computer vision techniques to reconstruct the 3D shape of an object. It is distinct from other methods because it requires only one image for 3D reconstruction. SfS consists of two steps. In the first step, a reflection model is developed based on reflectance properties of the surface, position of the camera and light source. In the second step, a numerical scheme is designed to solve the image irradiance equation (IIE) which is constructed either by using partial differential equation (PDE) or optimization methods. SfS was first discussed by Horn and Brooks [7], who developed an iterative scheme based on nonlinear first-order PDE by relating 3D shape to intensity variation in its image. Kimmel et al. [8] solved the SfS problem using fast marching method. Tankus et al. re-examines the SfS problem by solving the IIE under perspective projection so that it could be treated on a broader set of real world cases. Wu et al. [9] also solve the IIE under perspective projection with multiple light sources around camera center. In real world applications, SfS is useful in situations where one shot of the scene is available. One of the recent application of SfS in today’s era is capsule endoscopy [10] where usually position of light sources are known which are essential for this method. Additionally, rapid movement of capsule in certain areas of GI makes SfS a preferable choice because those areas might be captured once. In recent years, SfS has been applied on endoscopic images for 3D reconstruction [11, 12]. Although results seem promising, the conclusion is still uncertain because SfS methods are mostly applied without the availability of the ground truth data. In this paper, a precise comparison is done between the recovered 3D shape and the ground truth data of synthetic models of Gastrointestinal (GI) regions developed by [13]. The models are imported in Blender1 and then modified for true comparison. SfS was implemented with anisotropic diffusion as a smoothness constraint to preserve details in the recovered geometry. It is also novel in this work because L2 regularizer is typically used as a smoothness constraint which is unable to preserve the edges. Depth Error and correlation is then measured between recovered shape and the ground truth to estimate the quality of 3D reconstruction. The remainder of this article is organized as follows. Section 2 explains perspective SfS model with anisotropic diffusion. Results are compared and discussed in Section 3 and Section 4 concludes the article. 2. Point Light Source Perspective SfS Model This section briefly explains the SfS model under point light and perspective projection, where the light source is placed at the center of the camera projection as shown in Figure 1. Under the assumption of diffuse surface, radiance emitted by the surface element S can be computed 1 https://www.blender.org/ Figure 1: Perspective shape from shading model with Light source at the camera projection center O. Camera coordinate system (𝑥, 𝑦, 𝑧) is centered at O. 𝑧-axis is parallel to optical axis pointing towards image plane. according to Lambert’s cosine Law and inverse square fall of law of point light source [9], (︂ )︂ n(̃︀ ̃︀ 𝑧, 𝑝, 𝑞) · l(̃︀ 𝑥, 𝑦, 𝑥, 𝑦, ̃︀ 𝑧) 𝑅(̃︀ 𝑥, 𝑦, ̃︀ 𝑧, 𝑝, 𝑞) = 𝐼𝑂 𝜌 , (1) 𝑟(̃︀ ̃︀ 𝑧)2 𝑥, 𝑦, where 𝐼𝑂 is the light intensity and 𝜌 is the surface albedo. 𝑝 = 𝜕̃︀ 𝑥 and 𝑞 = 𝜕 𝑦̃︀ are the 𝜕𝑧 𝜕𝑧 components of surface gradients. n is the surface unit normal and l is a unit vector representing the direction of the light ray incident at point S. 𝑟2 is inverse square distance fall-off law of isotropic point light. The light source is considered at the camera center, but can easily be extended to multiple point light source not necessarily at the center [9]. The surface normal n can be represented in terms of partial derivatives of the depth 𝑧 with respect to 𝑥 and 𝑦 [7]: 𝜕𝑧 𝜕𝑧 [− 𝜕𝑥 , − 𝜕𝑦 , 1] n = √︁ , (2) 𝜕𝑧 2 𝜕𝑧 2 ( 𝜕𝑥 ) + ( 𝜕𝑦 ) +1 where (𝑥, 𝑦, 𝑧) are camera coordinates. Under perspective projection we have, 𝑧 𝑧 𝑥=𝑥 ̃︀ 𝑦 = 𝑦̃︀ , (3) 𝐹 𝐹 where 𝐹 is the focal length and (̃︀ ̃︀ are image coordinates, and camera is pointing in the 𝑥, 𝑦) negative 𝑧-direction as depicted in Figure 1. According to Horn and Brooks [7], the IIE is, 𝑅(̃︀ 𝑥, 𝑦, ̃︀ 𝑧, 𝑝, 𝑞) = 𝐼(̃︀ 𝑥, 𝑦). ̃︀ (4) Equation (4) is solved to estimate 𝑧 by minimizing the difference between image irradiance 𝐼(̃︀ ̃︀ and reflectance map 𝑅(̃︀ 𝑥, 𝑦) ̃︀ 𝑧, 𝑝, 𝑞). Optimization is done on depth 𝑧 whereas 𝑝 and 𝑞 𝑥, 𝑦, are updated by taking the gradient of updated 𝑧. The relevant optimization problem is given by, arg min 𝐸(𝑧) = 𝜆𝑒𝑖 (𝑧) + (1 − 𝜆)𝑒𝑠 (𝑧), (5) 𝑧 where 𝑒𝑖 is the irradiance error and 𝑒𝑠 represents smoothness constraint. 𝜆 is the weighting factor between 𝑒𝑖 and 𝑒𝑠 . 𝑒𝑖 (𝑧) can be computed over the image domain (Ω ⊂ R) as, ∫︁ 𝑒𝑖 (𝑧) = (𝐼(̃︀ ̃︀ − 𝑅(̃︀ 𝑥, 𝑦) ̃︀ 𝑧, 𝑝, 𝑞))2 𝑑Ω, 𝑥, 𝑦, (6) Ω 𝑒𝑠 (𝑧) is solved with anisotropic diffusion [14], which is a non-linear, space-variant technique utilized to reduce the noise from the surface without smoothing edges, lines or other details which are important to interpret the surface. It is then combined with Equation (6) which is then solved with gradient decent. A small time step Δ𝑡 is introduces to ensure stability with higher values of 𝜆. To impose anisotropic diffusion as a smoothness constraint, a 2 × 2 structure tensor is derived as a first step from the gradient of the depth 𝑧 which is given as, 𝜕𝑧 𝜕𝑧 𝑆𝑖,𝑗 = . (7) 𝜕𝑥𝑖 𝜕𝑦 𝑗 Afterwards, corresponding eigenvalues (𝜆+ , 𝜆− ) and eigenvectors (𝜃+ , 𝜃− ) are derived similar to [15]. From (𝜆+ , 𝜆− ) and (𝜃+ , 𝜃− ), diffusion tensor D is derived such as, 𝜕𝜓 𝜕𝜓 D= 𝑇 𝜃+ 𝜃 + + 𝑇 𝜃− 𝜃 − . (8) 𝜕𝜆+ 𝜕𝜆− In terms of (𝜆+ , 𝜆− ), langrangian density 𝜓 can be written as [14], ∫︁ 𝑒𝑠 (𝑧) = 𝜓(𝜆+ , 𝜆− )𝑑Ω. (9) Ω Equations (6) and (9) are combined together in Equation (5) and can be written as, ∫︁ arg min 𝐸(𝑧) = (𝜆(𝐼 − 𝑅)2 + (1 − 𝜆)𝜓(𝜆+ , 𝜆− ))𝑑Ω, (10) 𝑧 Ω Figure 2: GI model. The solution to Equation (10) is given by Euler-Lagrange PDE, 𝜕𝑅 𝜆(𝐼 − 𝑅) + (1 − 𝜆)∇ · (D∇𝑧) = 0, (11) 𝜕𝑧 which we numerically solve by, 𝜕𝑧 𝜆 𝜕𝑅 = ∇ · (D∇𝑧) + (𝐼 − 𝑅) . (12) 𝜕𝑡 1−𝜆 𝜕𝑧 For simplicity and linearizing the diffusion tensor to compute it once and for all, structure tensor in Equation (8) is derived from the gray scale image 𝜐(̃︀ 𝑥, 𝑦). ̃︀ 3. Results & Discussions 3.1. Ground Truth Models Shape-from-shading algorithm is tested on different areas of synthetic GI regions [13]. The model is imported in Blender to render images of different areas of the model. Highlighted regions utilized for 3D reconstruction along with the model are shown in Figure 2. Blender is chosen not only to construct a ground truth scenario but also controlling different parameters like light intensity (I), focal length (F), which are needed for 3D reconstruction using SfS. An environment is created similar to Figure 1. The camera is placed at (0, 0, 0). 𝐹 of the camera is set to 25mm. A point light source is also placed at camera center. Point light source is (a) ROI 1 (b) ROI 2 (c) ROI 3 Figure 3: Rendered images of GI regions selected to imitate the illumination mechanism of pillcams which has four light sources around camera center. GI model is now cut into different regions of interest and placed under the camera at 𝑧 < 0. Material properties of the model is set to be Diffuse BSDF with a constant albedo 𝜌 = 1. For a true comparison between the reconstructed surface and the ground truth, these re- spective regions are modified using Python API in Blender. When a model is placed under a perspective camera some of the occluded vertices/areas are not viewed by the camera. Therefore, it is necessary to remove all the occluded vertices and build the model consisting of only those vertices which are inside the camera frustum as well as viewed by the camera. The modified model is then exported as an obj format and then finally imported in MATLAB. These ground truth models are shown in Figure 4(a), 4(c), 4(e). 3.2. Assessment Criteria In order to evaluate the quality of 3D reconstruction, the reconstructed surfaces are compared with ground truth by measuring depth error and correlation. These methods are chosen to assess different features of the reconstructed surfaces. Depth error is chosen because it correctly evaluates the geometric deformation of the reconstructed shape. Correlation is chosen because it evaluates the shape of the reconstructed surface independent of scale and position. Correlation is computed by estimating variance and covariance of the recovered shape and the ground truth data. Whereas, Geometric deformation is investigated by measuring the average depth error (𝑒𝑑 ) between the recovered shapes and the ground truth and is given by, ⃒ ⃒ ̂︀ 𝑖,𝑗 − 𝐷𝑖,𝑗 ⃒ 1 ∑︁ ∑︁ ⃒⃒ 𝐷 𝑒𝑑 = ⃒, (13) Ω ⃒ 𝐷𝑖,𝑗 ⃒ 𝑖,𝑗∈Ω where 𝐷 is the ground truth and 𝐷̂︀ is recovered 3D shape. Ω represents the region of the 3D model considered for error estimation. (a) GT 1 (b) RS 1 (c) GT 2 (d) RS 2 (e) GT 3 (f) RS 3 Figure 4: Ground truth (GT) comparison with recovered shapes (RS) Table 1 Comparison between recovered shape and ground truth model. G1 Regions Depth Error Correlation ROI 1 0.4470 0.8883 ROI 2 0.3878 0.7927 ROI 3 0.3726 0.8725 3.3. Image Irradiance from Rendered Images Images of size 100 × 100 of each model is rendered as shown in Figure 3. 𝐼(̃︀ ̃︀ falling on the 𝑥, 𝑦) camera sensor is related to gray scale image 𝜐(̃︀ ̃︀ via camera response function 𝑟(·) [9], 𝑥, 𝑦) 𝑟−1 [𝜐(̃︀ 𝑥, 𝑦)] (14) ̃︀ 𝐼(̃︀ 𝑥, 𝑦) ̃︀ = , 𝑀 (̃︀ 𝑥, 𝑦) ̃︀ where 𝑀 (̃︀ ̃︀ is anisotropy of the light source. Point lights are perfectly isotropic by definition 𝑥, 𝑦) and so 𝑀 (̃︀ ̃︀ = 1. Images are saved in Portable Network Graphics (PNG) file format and 𝑥, 𝑦) therefore, image irradiance is just the gamma correction 𝛾 = 2.2 of the gray scale image i.e., 𝐼(̃︀ ̃︀ = 𝜐 𝛾 (̃︀ 𝑥, 𝑦) 𝑥, 𝑦). ̃︀ (15) 𝐼(̃︀ ̃︀ is also converted from pixel units to physical units in order to have correspondence 𝑥, 𝑦) between 𝐼(̃︀ ̃︀ and 𝑅. Conversion to physical units is given by, 𝑥, 𝑦) (︂ )︂ 𝐼(̃︀ ̃︀ − min 𝐼(̃︀ 𝑥, 𝑦) 𝑥, 𝑦) 𝐼𝑂 cos 𝜃1 𝐼𝑂 cos 𝜃2 𝐼𝑂 cos 𝜃2 × − (16) ̃︀ 𝐼𝑝 (̃︀ 𝑥, 𝑦) ̃︀ = 2 2 + , max 𝐼(̃︀ ̃︀ − min 𝐼(̃︀ 𝑥, 𝑦) 𝑥, 𝑦) ̃︀ 𝑟1 𝑟2 𝑟22 where 𝐼𝑝 (̃︀ ̃︀ represents the physical value of the image irradiance and (𝜃1 , 𝑟1 ) and (𝜃2 , 𝑟2 ) 𝑥, 𝑦) decides the upper and lower bound of 𝐼𝑝 (̃︀ ̃︀ (𝜃1 , 𝜃2 ) are the angles between surface normal 𝑥, 𝑦). and light ray at the maximum and minimum point on the surface respectively. (𝑟1 , 𝑟2 ) are the distance from the light source to the maximum and minimum point on the surface respectively. These points are chosen by identifying the maximum and minimum lit area in the ground truth model and then computing the angles and the distances from the light source. 3.4. 3D Reconstruction A flat surface is given as an initial condition for all three cases in order to test the robustness of the method. Initial reflectance map is then computed using Equation (1). Updated 𝑧 values are calculated by solving Equation (12). The value of 𝜆 is different for different cases and empirical in our experiment but more weight is given on irradiance term to have a better reconstruction. 3D reconstruction of different areas of GI is shown in Figure 4(b), 4(d), 4(f). Correlation and depth error are computed between recovered shapes and ground truth models for all three regions and shown in Table 1. Due to occlusion and dim light on lower areas of GI, some part of the recovered surfaces were smoothened and therefore, could not recover precisely. In spite of that, results are quite plausible because a high value of correlation and a low value of depth error is attained for all three cases. Although, certain simplifying assumptions are made because authors are interested in proof of concept of using SfS to reconstruct complex GI geometry. However, considering a flat surface as an initial condition and reaching to the solution is quite reassuring to apply SfS technique on real capsule images. 4. Conclusion and Future Work In this article, near light source perspective SfS method is applied on different GI regions. Given a reflection model, numerical scheme is formulated with anisotropic diffusion. Shape for each region is recovered and then compared with ground truth by measuring the average depth error and correlation. Result shows that SfS can handle complex geometries if penalized correctly. In future work, SfS method will be applied on real capsule endoscopic images where we will have to deal with different textures, specularities, occlusion and distorted images. In addition to that, brightest and dimmest image points in physical units will be estimated for the right scale between 𝐼(̃︀ ̃︀ and 𝑅. Radiometric calibration will be needed to compute the image irradiance 𝑥, 𝑦) and intensity of the light sources will also be measured. All of this will be essential to correctly implement SfS technique on real capsule images. Acknowledgments Funding was provided by the Research Council of Norway under the project CAPSULE no. 300031. References [1] D. E. Yung, J. N. Plevris, R. Leenhardt, X. Dray, A. Koulaouzidis, E. S. B. R. W. Group, et al., Poor quality of small bowel capsule endoscopy images has a significant negative effect in the diagnosis of small bowel malignancy, Clinical and Experimental Gastroenterology 13 (2020) 475. [2] H. Ham, J. Wesley, H. Hendra, Computer vision based 3d reconstruction: A review, International Journal of Electrical and Computer Engineering 9 (2019) 2394. [3] X.-b. Lai, H.-s. Wang, Y.-h. Xu, A real-time range finding system with binocular stereo vision, International Journal of Advanced Robotic Systems 9 (2012) 26. [4] J. Steckel, H. Peremans, Batslam: Simultaneous localization and mapping using biomimetic sonar, PloS one 8 (2013) e54076. [5] P. A. Floor, I. Farup, M. Pedersen, 3d reconstruction of the human colon from capsule endoscope video. (accepted), in: Colour and Visual Computing Symposium (CVCS), 2022. [6] A. Koulaouzidis, D. K. Iakovidis, D. E. Yung, E. Mazomenos, F. Bianchi, A. Karagyris, G. Dimas, D. Stoyanov, H. Thorlacius, E. Toth, et al., Novel experimental and software methods for image reconstruction and localization in capsule endoscopy, Endoscopy International Open 6 (2018) E205–E210. [7] B. K. Horn, M. J. Brooks, The variational approach to shape from shading, Computer Vision, Graphics, and Image Processing 33 (1986) 174–208. [8] R. Kimmel, J. A. Sethian, Optimal algorithm for shape from shading and path planning, Journal of Mathematical Imaging and Vision 14 (2001) 237–244. [9] C. Wu, S. G. Narasimhan, B. Jaramaz, A multi-image shape-from-shading framework for near-lighting perspective endoscopes, International Journal of Computer Vision 86 (2010) 211–228. [10] G. Iddan, G. Meron, A. Glukhovsky, P. Swain, Wireless capsule endoscopy, Nature 405 (2000) 417–417. [11] A. Koulaouzidis, A. Karargyris, Three-dimensional image reconstruction in capsule en- doscopy, World journal of gastroenterology: WJG 18 (2012) 4086. [12] V. S. Prasath, I. N. Figueiredo, P. N. Figueiredo, K. Palaniappan, Mucosal region detection and 3d reconstruction in wireless capsule endoscopy videos using active contours, in: 2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, IEEE, 2012, pp. 4014–4017. [13] K. İncetan, I. O. Celik, A. Obeid, G. I. Gokceler, K. B. Ozyoruk, Y. Almalioglu, R. J. Chen, F. Mahmood, H. Gilbert, N. J. Durr, et al., Vr-caps: a virtual environment for capsule endoscopy, Medical image analysis 70 (2021) 101990. [14] D. Tschumperlé, R. Deriche, Vector-valued image regularization with pdes: A common framework for different applications, IEEE transactions on pattern analysis and machine intelligence 27 (2005) 506–517. [15] G. Sapiro, D. L. Ringach, Anisotropic diffusion of multivalued images with applications to color filtering, IEEE transactions on image processing 5 (1996) 1582–1586.