3D Reconstruction of Gastrointestinal Regions from
Single Images
Bilal Ahmad* , Pål Anders Floor, Ivar Farup and Milan Kresović
Department of Computer Science, Norwegian University of Science & Technology, 2815 Gjøvik, Norway.


                                         Abstract
                                         3D shape reconstruction from images is one of the problem under investigation in the field of computer
                                         vision. Shape-from-shading (SfS) is an important approach which requires the reflectance properties
                                         of surface and light source position to infer the 3D shape. SfS is usually tested in medical applications
                                         without the availability of the ground truth data which makes the conclusion dubious. In this article SfS
                                         is applied on synthetic gastrointestinal regions and precise comparison is done between recovered shape
                                         and ground truth data by measuring the depth error and correlation between them. Results shows that
                                         SfS can recover the shapes quite well if penalized correctly.

                                         Keywords
                                         3D reconstruction, Capsule endoscopy, Shape-from-shading,


1. Introduction
With the advancement of medical field, the current trend is to make surgery ever less invasive.
This implies smaller and smaller cuts in the patient’s skin, which do not give the surgeons a
direct view of their work. They only leave enough space for small cameras to be introduced
into the patient’s body. If the resulting images in such conditions are of poor quality [1], it can
make the surgeon’s work even harder. 3D reconstruction can be helpful in such cases to better
diagnose, visualize or analyze the areas of interests.
   3D reconstruction is an inverse problem which can be rectified by applying different tech-
niques to the images [2]. It is vital to get the information regarding the 3D structure or the
scene’s depth since most tasks are completed in the 3D world. The concept of depth estimation
involves using various approaches or algorithms to attain the spatial information of the object
or to acquire the distances of all the points present in the scene, with respect to a specific chosen
point.
   Vision-based depth estimation methods are generally classified into different categories. Some
methods comprise of the usage of special devices for depth estimation [3]. Examples of these
technique are ultrasonic and optical time-of-flight estimation in which measured energy beam
is first transported and then reflected energy is detected [4]. Other methods do not make use
The 11th Colour and Visual Computing Symposium, September 08–09, 2022, Gjøvik, Norway
*
 Corresponding author.
$ bilal.ahmad@ntnu.no (B. Ahmad); paal.anders.floor@ntnu.no (P. A . Floor); ivar.farup@ntnu.no ( I. Farup);
milank@stud.ntnu.no (M. Kresović)
 0000-0001-8811-0404 (B. Ahmad); 0000-0001-6328-7414 (P. A. Floor); 0000-0003-3473-1138 (I. Farup);
0000-0002-1031-1668 (M. Kresović)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
of any artificial source of energy and natural outdoor scenes fall under its category. Various
monocular image-based techniques such as texture gradient analysis and photo-metric methods
are used. Other methods are hinged on the motion or multiple relative positions of the camera
[5]. 3D reconstruction has numerous applications in robotics, medical applications including
diagnostics, video surveillance and monitoring etc. [6].
    Shape-from-shading (SfS) is one of the many computer vision techniques to reconstruct the
3D shape of an object. It is distinct from other methods because it requires only one image for
3D reconstruction. SfS consists of two steps. In the first step, a reflection model is developed
based on reflectance properties of the surface, position of the camera and light source. In the
second step, a numerical scheme is designed to solve the image irradiance equation (IIE) which
is constructed either by using partial differential equation (PDE) or optimization methods.
    SfS was first discussed by Horn and Brooks [7], who developed an iterative scheme based
on nonlinear first-order PDE by relating 3D shape to intensity variation in its image. Kimmel
et al. [8] solved the SfS problem using fast marching method. Tankus et al. re-examines the
SfS problem by solving the IIE under perspective projection so that it could be treated on a
broader set of real world cases. Wu et al. [9] also solve the IIE under perspective projection
with multiple light sources around camera center.
    In real world applications, SfS is useful in situations where one shot of the scene is available.
One of the recent application of SfS in today’s era is capsule endoscopy [10] where usually
position of light sources are known which are essential for this method. Additionally, rapid
movement of capsule in certain areas of GI makes SfS a preferable choice because those areas
might be captured once. In recent years, SfS has been applied on endoscopic images for 3D
reconstruction [11, 12]. Although results seem promising, the conclusion is still uncertain
because SfS methods are mostly applied without the availability of the ground truth data.
    In this paper, a precise comparison is done between the recovered 3D shape and the ground
truth data of synthetic models of Gastrointestinal (GI) regions developed by [13]. The models
are imported in Blender1 and then modified for true comparison. SfS was implemented with
anisotropic diffusion as a smoothness constraint to preserve details in the recovered geometry.
It is also novel in this work because L2 regularizer is typically used as a smoothness constraint
which is unable to preserve the edges. Depth Error and correlation is then measured between
recovered shape and the ground truth to estimate the quality of 3D reconstruction.
    The remainder of this article is organized as follows. Section 2 explains perspective SfS
model with anisotropic diffusion. Results are compared and discussed in Section 3 and Section
4 concludes the article.


2. Point Light Source Perspective SfS Model
This section briefly explains the SfS model under point light and perspective projection, where
the light source is placed at the center of the camera projection as shown in Figure 1. Under
the assumption of diffuse surface, radiance emitted by the surface element S can be computed


1
    https://www.blender.org/
Figure 1: Perspective shape from shading model with Light source at the camera projection center O.
Camera coordinate system (𝑥, 𝑦, 𝑧) is centered at O. 𝑧-axis is parallel to optical axis pointing towards
image plane.


according to Lambert’s cosine Law and inverse square fall of law of point light source [9],
                                            (︂                               )︂
                                               n(̃︀ ̃︀ 𝑧, 𝑝, 𝑞) · l(̃︀
                                                 𝑥, 𝑦,              𝑥, 𝑦,
                                                                       ̃︀ 𝑧)
                    𝑅(̃︀
                       𝑥, 𝑦,
                          ̃︀ 𝑧, 𝑝, 𝑞) = 𝐼𝑂 𝜌                                   ,            (1)
                                                       𝑟(̃︀  ̃︀ 𝑧)2
                                                         𝑥, 𝑦,

where 𝐼𝑂 is the light intensity and 𝜌 is the surface albedo. 𝑝 = 𝜕̃︀     𝑥 and 𝑞 = 𝜕 𝑦̃︀ are the
                                                                        𝜕𝑧             𝜕𝑧

components of surface gradients. n is the surface unit normal and l is a unit vector representing
the direction of the light ray incident at point S. 𝑟2 is inverse square distance fall-off law of
isotropic point light. The light source is considered at the camera center, but can easily be
extended to multiple point light source not necessarily at the center [9].
   The surface normal n can be represented in terms of partial derivatives of the depth 𝑧 with
respect to 𝑥 and 𝑦 [7]:
                                               𝜕𝑧     𝜕𝑧
                                            [− 𝜕𝑥 , − 𝜕𝑦 , 1]
                                   n = √︁                     ,                                (2)
                                            𝜕𝑧 2      𝜕𝑧 2
                                          ( 𝜕𝑥 ) + ( 𝜕𝑦  ) +1
where (𝑥, 𝑦, 𝑧) are camera coordinates. Under perspective projection we have,
                                              𝑧                 𝑧
                                      𝑥=𝑥
                                        ̃︀               𝑦 = 𝑦̃︀ ,                             (3)
                                              𝐹                 𝐹
where 𝐹 is the focal length and (̃︀   ̃︀ are image coordinates, and camera is pointing in the
                                   𝑥, 𝑦)
negative 𝑧-direction as depicted in Figure 1.
  According to Horn and Brooks [7], the IIE is,

                                     𝑅(̃︀
                                       𝑥, 𝑦,
                                          ̃︀ 𝑧, 𝑝, 𝑞) = 𝐼(̃︀
                                                          𝑥, 𝑦).
                                                             ̃︀                                (4)

  Equation (4) is solved to estimate 𝑧 by minimizing the difference between image irradiance
𝐼(̃︀ ̃︀ and reflectance map 𝑅(̃︀
  𝑥, 𝑦)                            ̃︀ 𝑧, 𝑝, 𝑞). Optimization is done on depth 𝑧 whereas 𝑝 and 𝑞
                                𝑥, 𝑦,
are updated by taking the gradient of updated 𝑧. The relevant optimization problem is given by,

                            arg min 𝐸(𝑧) = 𝜆𝑒𝑖 (𝑧) + (1 − 𝜆)𝑒𝑠 (𝑧),                            (5)
                                 𝑧

where 𝑒𝑖 is the irradiance error and 𝑒𝑠 represents smoothness constraint. 𝜆 is the weighting
factor between 𝑒𝑖 and 𝑒𝑠 .
   𝑒𝑖 (𝑧) can be computed over the image domain (Ω ⊂ R) as,
                                   ∫︁
                           𝑒𝑖 (𝑧) = (𝐼(̃︀   ̃︀ − 𝑅(̃︀
                                         𝑥, 𝑦)        ̃︀ 𝑧, 𝑝, 𝑞))2 𝑑Ω,
                                                   𝑥, 𝑦,                                  (6)
                                      Ω

𝑒𝑠 (𝑧) is solved with anisotropic diffusion [14], which is a non-linear, space-variant technique
utilized to reduce the noise from the surface without smoothing edges, lines or other details
which are important to interpret the surface. It is then combined with Equation (6) which is
then solved with gradient decent. A small time step Δ𝑡 is introduces to ensure stability with
higher values of 𝜆.
   To impose anisotropic diffusion as a smoothness constraint, a 2 × 2 structure tensor is derived
as a first step from the gradient of the depth 𝑧 which is given as,
                                                      𝜕𝑧 𝜕𝑧
                                           𝑆𝑖,𝑗 =              .                               (7)
                                                      𝜕𝑥𝑖 𝜕𝑦 𝑗
Afterwards, corresponding eigenvalues (𝜆+ , 𝜆− ) and eigenvectors (𝜃+ , 𝜃− ) are derived similar
to [15]. From (𝜆+ , 𝜆− ) and (𝜃+ , 𝜃− ), diffusion tensor D is derived such as,
                                          𝜕𝜓           𝜕𝜓
                                 D=                𝑇
                                              𝜃+ 𝜃 + +          𝑇
                                                           𝜃− 𝜃 − .                            (8)
                                          𝜕𝜆+          𝜕𝜆−
In terms of (𝜆+ , 𝜆− ), langrangian density 𝜓 can be written as [14],
                                            ∫︁
                                   𝑒𝑠 (𝑧) =    𝜓(𝜆+ , 𝜆− )𝑑Ω.                                  (9)
                                                  Ω

Equations (6) and (9) are combined together in Equation (5) and can be written as,
                                  ∫︁
                  arg min 𝐸(𝑧) = (𝜆(𝐼 − 𝑅)2 + (1 − 𝜆)𝜓(𝜆+ , 𝜆− ))𝑑Ω,                          (10)
                        𝑧              Ω
Figure 2: GI model.


The solution to Equation (10) is given by Euler-Lagrange PDE,

                                       𝜕𝑅
                            𝜆(𝐼 − 𝑅)      + (1 − 𝜆)∇ · (D∇𝑧) = 0,                               (11)
                                       𝜕𝑧
which we numerically solve by,

                             𝜕𝑧                𝜆          𝜕𝑅
                                = ∇ · (D∇𝑧) +     (𝐼 − 𝑅)    .                                  (12)
                             𝜕𝑡               1−𝜆         𝜕𝑧
For simplicity and linearizing the diffusion tensor to compute it once and for all, structure tensor
in Equation (8) is derived from the gray scale image 𝜐(̃︀ 𝑥, 𝑦).
                                                             ̃︀


3. Results & Discussions
3.1. Ground Truth Models
Shape-from-shading algorithm is tested on different areas of synthetic GI regions [13]. The
model is imported in Blender to render images of different areas of the model. Highlighted
regions utilized for 3D reconstruction along with the model are shown in Figure 2. Blender is
chosen not only to construct a ground truth scenario but also controlling different parameters
like light intensity (I), focal length (F), which are needed for 3D reconstruction using SfS.
   An environment is created similar to Figure 1. The camera is placed at (0, 0, 0). 𝐹 of the
camera is set to 25mm. A point light source is also placed at camera center. Point light source is
             (a) ROI 1                       (b) ROI 2                       (c) ROI 3

Figure 3: Rendered images of GI regions


selected to imitate the illumination mechanism of pillcams which has four light sources around
camera center. GI model is now cut into different regions of interest and placed under the
camera at 𝑧 < 0. Material properties of the model is set to be Diffuse BSDF with a constant
albedo 𝜌 = 1.
    For a true comparison between the reconstructed surface and the ground truth, these re-
spective regions are modified using Python API in Blender. When a model is placed under a
perspective camera some of the occluded vertices/areas are not viewed by the camera. Therefore,
it is necessary to remove all the occluded vertices and build the model consisting of only those
vertices which are inside the camera frustum as well as viewed by the camera. The modified
model is then exported as an obj format and then finally imported in MATLAB. These ground
truth models are shown in Figure 4(a), 4(c), 4(e).

3.2. Assessment Criteria
In order to evaluate the quality of 3D reconstruction, the reconstructed surfaces are compared
with ground truth by measuring depth error and correlation. These methods are chosen to
assess different features of the reconstructed surfaces. Depth error is chosen because it correctly
evaluates the geometric deformation of the reconstructed shape. Correlation is chosen because
it evaluates the shape of the reconstructed surface independent of scale and position.
   Correlation is computed by estimating variance and covariance of the recovered shape and the
ground truth data. Whereas, Geometric deformation is investigated by measuring the average
depth error (𝑒𝑑 ) between the recovered shapes and the ground truth and is given by,
                                              ⃒               ⃒
                                                ̂︀ 𝑖,𝑗 − 𝐷𝑖,𝑗 ⃒
                                     1 ∑︁ ∑︁ ⃒⃒ 𝐷
                                𝑒𝑑 =                          ⃒,                               (13)
                                     Ω        ⃒       𝐷𝑖,𝑗    ⃒
                                             𝑖,𝑗∈Ω


where 𝐷 is the ground truth and 𝐷̂︀ is recovered 3D shape. Ω represents the region of the 3D
model considered for error estimation.
                    (a) GT 1                                        (b) RS 1


                    (c) GT 2                                        (d) RS 2


                    (e) GT 3                                        (f) RS 3

Figure 4: Ground truth (GT) comparison with recovered shapes (RS)
Table 1
Comparison between recovered shape and ground truth model.
                              G1 Regions     Depth Error    Correlation
                                 ROI 1          0.4470      0.8883
                                 ROI 2          0.3878      0.7927
                                 ROI 3          0.3726      0.8725


3.3. Image Irradiance from Rendered Images
Images of size 100 × 100 of each model is rendered as shown in Figure 3. 𝐼(̃︀ ̃︀ falling on the
                                                                           𝑥, 𝑦)
camera sensor is related to gray scale image 𝜐(̃︀ ̃︀ via camera response function 𝑟(·) [9],
                                               𝑥, 𝑦)

                                                  𝑟−1 [𝜐(̃︀
                                                         𝑥, 𝑦)]
                                                                                                  (14)
                                                             ̃︀
                                      𝐼(̃︀
                                        𝑥, 𝑦)
                                           ̃︀ =                 ,
                                                    𝑀 (̃︀
                                                        𝑥, 𝑦)
                                                           ̃︀

where 𝑀 (̃︀   ̃︀ is anisotropy of the light source. Point lights are perfectly isotropic by definition
           𝑥, 𝑦)
and so 𝑀 (̃︀   ̃︀ = 1. Images are saved in Portable Network Graphics (PNG) file format and
            𝑥, 𝑦)
therefore, image irradiance is just the gamma correction 𝛾 = 2.2 of the gray scale image i.e.,

                                         𝐼(̃︀ ̃︀ = 𝜐 𝛾 (̃︀
                                           𝑥, 𝑦)        𝑥, 𝑦).
                                                           ̃︀                                     (15)

  𝐼(̃︀  ̃︀ is also converted from pixel units to physical units in order to have correspondence
    𝑥, 𝑦)
between 𝐼(̃︀    ̃︀ and 𝑅. Conversion to physical units is given by,
             𝑥, 𝑦)
                                                (︂                       )︂
                     𝐼(̃︀ ̃︀ − min 𝐼(̃︀
                       𝑥, 𝑦)         𝑥, 𝑦)         𝐼𝑂 cos 𝜃1 𝐼𝑂 cos 𝜃2        𝐼𝑂 cos 𝜃2
                                              ×             −                               (16)
                                        ̃︀
     𝐼𝑝 (̃︀
         𝑥, 𝑦)
            ̃︀ =                                       2            2       +           ,
                   max 𝐼(̃︀  ̃︀ − min 𝐼(̃︀
                          𝑥, 𝑦)         𝑥, 𝑦)
                                           ̃︀         𝑟1           𝑟2             𝑟22

where 𝐼𝑝 (̃︀  ̃︀ represents the physical value of the image irradiance and (𝜃1 , 𝑟1 ) and (𝜃2 , 𝑟2 )
           𝑥, 𝑦)
decides the upper and lower bound of 𝐼𝑝 (̃︀   ̃︀ (𝜃1 , 𝜃2 ) are the angles between surface normal
                                           𝑥, 𝑦).
and light ray at the maximum and minimum point on the surface respectively. (𝑟1 , 𝑟2 ) are the
distance from the light source to the maximum and minimum point on the surface respectively.
These points are chosen by identifying the maximum and minimum lit area in the ground truth
model and then computing the angles and the distances from the light source.

3.4. 3D Reconstruction
A flat surface is given as an initial condition for all three cases in order to test the robustness of
the method. Initial reflectance map is then computed using Equation (1). Updated 𝑧 values are
calculated by solving Equation (12). The value of 𝜆 is different for different cases and empirical
in our experiment but more weight is given on irradiance term to have a better reconstruction.
3D reconstruction of different areas of GI is shown in Figure 4(b), 4(d), 4(f).
   Correlation and depth error are computed between recovered shapes and ground truth models
for all three regions and shown in Table 1. Due to occlusion and dim light on lower areas of GI,
some part of the recovered surfaces were smoothened and therefore, could not recover precisely.
In spite of that, results are quite plausible because a high value of correlation and a low value
of depth error is attained for all three cases. Although, certain simplifying assumptions are
made because authors are interested in proof of concept of using SfS to reconstruct complex
GI geometry. However, considering a flat surface as an initial condition and reaching to the
solution is quite reassuring to apply SfS technique on real capsule images.


4. Conclusion and Future Work
In this article, near light source perspective SfS method is applied on different GI regions. Given
a reflection model, numerical scheme is formulated with anisotropic diffusion. Shape for each
region is recovered and then compared with ground truth by measuring the average depth error
and correlation. Result shows that SfS can handle complex geometries if penalized correctly.
   In future work, SfS method will be applied on real capsule endoscopic images where we will
have to deal with different textures, specularities, occlusion and distorted images. In addition to
that, brightest and dimmest image points in physical units will be estimated for the right scale
between 𝐼(̃︀    ̃︀ and 𝑅. Radiometric calibration will be needed to compute the image irradiance
             𝑥, 𝑦)
and intensity of the light sources will also be measured. All of this will be essential to correctly
implement SfS technique on real capsule images.


Acknowledgments
Funding was provided by the Research Council of Norway under the project CAPSULE no.
300031.


References
 [1] D. E. Yung, J. N. Plevris, R. Leenhardt, X. Dray, A. Koulaouzidis, E. S. B. R. W. Group, et al.,
     Poor quality of small bowel capsule endoscopy images has a significant negative effect in
     the diagnosis of small bowel malignancy, Clinical and Experimental Gastroenterology 13
     (2020) 475.
 [2] H. Ham, J. Wesley, H. Hendra, Computer vision based 3d reconstruction: A review,
     International Journal of Electrical and Computer Engineering 9 (2019) 2394.
 [3] X.-b. Lai, H.-s. Wang, Y.-h. Xu, A real-time range finding system with binocular stereo
     vision, International Journal of Advanced Robotic Systems 9 (2012) 26.
 [4] J. Steckel, H. Peremans, Batslam: Simultaneous localization and mapping using biomimetic
     sonar, PloS one 8 (2013) e54076.
 [5] P. A. Floor, I. Farup, M. Pedersen, 3d reconstruction of the human colon from capsule
     endoscope video. (accepted), in: Colour and Visual Computing Symposium (CVCS), 2022.
 [6] A. Koulaouzidis, D. K. Iakovidis, D. E. Yung, E. Mazomenos, F. Bianchi, A. Karagyris,
     G. Dimas, D. Stoyanov, H. Thorlacius, E. Toth, et al., Novel experimental and software
     methods for image reconstruction and localization in capsule endoscopy, Endoscopy
     International Open 6 (2018) E205–E210.
 [7] B. K. Horn, M. J. Brooks, The variational approach to shape from shading, Computer
     Vision, Graphics, and Image Processing 33 (1986) 174–208.
 [8] R. Kimmel, J. A. Sethian, Optimal algorithm for shape from shading and path planning,
     Journal of Mathematical Imaging and Vision 14 (2001) 237–244.
 [9] C. Wu, S. G. Narasimhan, B. Jaramaz, A multi-image shape-from-shading framework for
     near-lighting perspective endoscopes, International Journal of Computer Vision 86 (2010)
     211–228.
[10] G. Iddan, G. Meron, A. Glukhovsky, P. Swain, Wireless capsule endoscopy, Nature 405
     (2000) 417–417.
[11] A. Koulaouzidis, A. Karargyris, Three-dimensional image reconstruction in capsule en-
     doscopy, World journal of gastroenterology: WJG 18 (2012) 4086.
[12] V. S. Prasath, I. N. Figueiredo, P. N. Figueiredo, K. Palaniappan, Mucosal region detection
     and 3d reconstruction in wireless capsule endoscopy videos using active contours, in:
     2012 Annual International Conference of the IEEE Engineering in Medicine and Biology
     Society, IEEE, 2012, pp. 4014–4017.
[13] K. İncetan, I. O. Celik, A. Obeid, G. I. Gokceler, K. B. Ozyoruk, Y. Almalioglu, R. J. Chen,
     F. Mahmood, H. Gilbert, N. J. Durr, et al., Vr-caps: a virtual environment for capsule
     endoscopy, Medical image analysis 70 (2021) 101990.
[14] D. Tschumperlé, R. Deriche, Vector-valued image regularization with pdes: A common
     framework for different applications, IEEE transactions on pattern analysis and machine
     intelligence 27 (2005) 506–517.
[15] G. Sapiro, D. L. Ringach, Anisotropic diffusion of multivalued images with applications to
     color filtering, IEEE transactions on image processing 5 (1996) 1582–1586.