<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>3D Reconstruction of the Human Colon from Capsule Endoscope Video⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pål Anders Floor</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivar Farup</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marius Pedersen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Norwegian University of Science and Technology (NTNU)</institution>
          ,
          <addr-line>Teknologivegen 22, 2815 Gjøvik</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we investigate the possibility of constructing 3D models of longer sections of the human colon using image sequences obtained from wireless capsule endoscope (WCE) video to provide enhanced viewing for gastroenterologists. As images from WCE contain severe distortions and artifacts non-ideal for 3D reconstruction algorithms, the problem is dificult to attack. However, recent developments of virtual graphics-based models of human gastrointestinal system, where most of the distortions and artifacts can be enabled or disabled, makes it possible to determine how each factor disturbs such algorithms individually. In this paper we disable distortions and artifacts in order to determine if longer sections of the human intestinal environment is at all feasible to reconstruct. Though simulation we show that this is possible using structure from motion and simultaneous localization and mapping (SLAM).</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;3D reconstruction</kwd>
        <kwd>capsule endoscopy</kwd>
        <kwd>structure from motion</kwd>
        <kwd>SLAM</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Severe diseases in the gastrointestinal (GI) system like Chron’s disease, inflammatory bowel
disease, and cancer, are reducing many peoples quality of life. One way to detect such diseases
at an early stage, making them more likely to combat, is to make screening of the GI system
a common procedure beyond a certain age. However, fear of pain and dificulties caused by
endoscopy is a major factor limiting the number of people screening themselves without clear
symptoms [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        The wireless capsule endoscope (WCE) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which is a pill-sized capsule that the patient
swallows, is a good alternative for preventive screening, as it avoids the above mentioned
problem and is capable of reaching all parts of the GI system. The WCE carries one- or several
cameras on board, recording video while travelling through the GI system. However, current
standard WCE’s have significantly lower resolution and frame rate than typical endoscopes, and
the images contain more severe noise and distortions. Further, the video is usually over eight
hours long, making it challenging for gastroenterologists to detect pathologies in the intestinal
wall. With increasing demand on intestinal screening, tools that make the gastroenterologists
workload less demanding, and thereby reduces time-use per patient, are needed.
      </p>
      <p>
        One method that can help gastroenterologists is a 3D model enhancing pathologies in the
intestinal wall, making them easier to detect. A 3D model may also prove useful for planning
of treatment. The inspiration for this approach comes from the positive feedback of using 3D
reconstruction in gastrointestinal endoscopy [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Currently, 3D models are obtained through
CTscanning, which is expensive and may expose the patient to unnecessary radiation. Therefore,
we will investigate the construction of 3D models based solely on WCE images.
      </p>
      <p>
        There are at least two methods that may be applied in order to reconstruct the 3D structure of
a scene based on WCE images: 1) Direct methods, like shape from shading (SFS), which recovers
3D structure based on geometric reasoning on how light is reflected of relevant surfaces [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
here the GI wall. 2) Feature-based methods like structure from motion (SfM), which recovers 3D
shapes from features captured in multiple views of the same scene [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. SfS can reconstruct 3D
shapes from only one image, while SfM needs at least two images.
      </p>
      <p>
        With many images of the same (rigid) scene available, SfM can provide accurate 3D
reconstruction. However, this is not necessarily easy to obtain for WCE from WCE images for the
following reasons: i) Sometimes only one image is available due to rapid movement of the WCE,
or debris in the intestine. ii) SfM assumes rigid motion, which is sometimes violated due to
muscle contractions and peristalsis. iii) Sometimes the WCE position does not change enough
from frame to frame to avoid degeneracies. In case i) single image techniques, like SfS, have to
be applied [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In case ii) one can apply non-rigid SfM (NR-SfM) [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] taking non-rigid scene
movement into account. In case iii) a simultaneous localization and mapping (SLAM) [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]
approach may be applied.
      </p>
      <p>SLAM uses the fact that both camera position and 3D structure are obtained from SfM, and
applies SfM locally among so-called key frames, which are frames with significantly diferent
poses. Therefore, SLAM is potentially able to detect and ignore frames that may lead to
degeneracies.</p>
      <p>
        Another problem is that WCE images are are highly corrupted. Examples are debris in the
intestinal fluids, specular reflections, motion blur, heavy lens distortion, chromatic aberrations,
compression artefact etc. All of these factors makes it hard, if at all possible, to design algorithms
for accurate 3D reconstruction as it is dificult to single out how each of these corruptions
afect the reconstruction individually. However, with recent developments of virtual
graphicsbased models of human GI system, where most of the distortion and artifacts can be enabled
and disabled, it is possible to dissect the problem and determine how each factor disturbs the
reconstruction individually. One such model is VR-CAPS [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which is a realistic looking
artificial GI system built from CT scans of humans, where also most corruptions in the WCE
imaging process is modelled. Further, their GI-model can easily be exported and dissected in
3D modelling applications like blender1, thereby providing a ground truth for evaluation of 3D
reconstruction algorithms, something which is hard to obtain for real WCE.
      </p>
      <p>
        In this paper we conduct a feasibility study using SfM and SLAM for 3D reconstruction of
longer sections of human colon in an ideal situation where most of the distortions mentioned
above are turned of. We will use typical WCE image resolution and frame-rate. This enables us
to conclude if 3D reconstruction using a feature based approach is at all possible for the unusual
and repetitive geometry of a typical colon. If the conclusion is negative, there is no point in
pursuing this problem further. We will first investigate SfM to gain basic knowledge, then use
this knowledge to investigate a SLAM approach, named ORB-SLAM [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], which is a fast and
accurate approach for monocular cameras.
      </p>
      <p>In Section 2 the problem formulation is given and the existing methods we apply for our
experiments are described. In Section 3, which contains the novel results of this paper, we
experiment with SfM and ORB-SLAM to obtain 3D models of human colon segments generated
in VR-CAPS. Summary, conclusion and future research ideas are given in Section 4.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Problem Formulation and Methods</title>
      <p>Direct methods usually rely on accurate radiometric information, whereas feature based methods
rely on image features. As the WCE is continuously adapting its camera response, accurate
radiometric information is hard to obtain over image sequences. Therefore we consider feature
based approaches here.</p>
      <p>We will assume that lens distortion and other non-linearities have been compensated for
so that we have a pinhole model. We will also assume that typical distortions seen in pillcam
images, like specs on lens, motion blur etc., are taken care of through pre-processing.</p>
      <p>Generally, we assume that the image capturing process is some mapping between 3D
projective space P3 and 2D projective (image) plane P2. Points in space are described in homogenous
world coordinates as X = [, , ,  ] and image point are in homogenous image coordinates
x = [, , ] (,  ∈ R+ are some unspecified scaling factors) [ 5, p. 7]. For 3D points in a
point cloud, we denote the i’th point as X,  = 1, · · · ,  , and its image x. With a pinhole
camera model the relation between a point in world coordinates and image coordinates is a
mapping  : P3 → P2. Then, for M views (images) of a given point, X, in the point cloud, the
imaging process of the j’th view is given by [5, p. 154]</p>
      <p>x =   X,  = 1, · · · , ,  = 1, · · · , ,
where   is the 3 × 4 camera matrix for the j’th view given by [5, p. 156]
 is a 3 × 3 rotation matrix and t is a 3 × 1 translation vector, both in P3. With WCE, the
same camera captures all views, and so the calibration- or intrinsic matrix, K, is the same for all
views, given by [5, p. 156]</p>
      <p>= [ |t ].</p>
      <p>⎡ 
 = ⎣ 0
0</p>
      <p>⎤
   ⎦ ,
0 1
(1)
(2)
(3)
where  is the focal length, ,  is the principal point,  is the skew and  is the number
of pixels per unit distance. The -factors in  makes Eq. (2) a transformation from world
coordinates to pixel coordinates. Skew is normally zero for pillcams, therefore we set  = 0 in
what follows. The other parameters can be found through a calibration procedure [5, p.226].</p>
      <p>We assume that the first view of any image sequence is taken by the camera when located
at the world origin, i.e.,  1 = [3× 3|0], with 3× 3 the 3 × 3 identity matrix. That is, camera
coordinates of the first view in the sequence are equivalent to world coordinates.</p>
      <sec id="sec-2-1">
        <title>2.1. Structure from Motion (SfM)</title>
        <p>
          SfM recovers both 3D structure and individual camera poses. For two, three, or four views
one can solve the SfM problem using tensors named fundamental matrix (FM), trifocal tensor
and quadrifocal tensor, respectively [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], which generate multi-linear relationships among the
coordinates of image measurements, providing closed form mathematical relations in terms of
camera matrices. For  &gt; 4 views one has to deal with the problem through bundle adjustment
(BA).
        </p>
        <sec id="sec-2-1-1">
          <title>2.1.1. Two views and fundamental matrix (FM)</title>
          <p>Take Eq. (1) with  = 2. Then any two points x1 , x2 , being the images of X in the two views,
have to be related by the epipolar constraint [5, p.245]</p>
          <p>(x2 )  x1 = 0, ∀,
with F, the FM, a 3 × 3 rank 2 matrix given by  = [e2]×  2( 1)+ [5, p.244]. Here e2 is
the epipole, i.e., the image of the camera center of view 1, and ( 1)+ is the Moore-Penrose
pseudoinverse of  1.  = [e2]× is a skew-symmetric matrix where 21 = 23, 31 = − 22, and
32 = 2.</p>
          <p>
            1
 can be numerically estimated from common features in two images. Typically, SIFT,
SURF, Eigen- or ORB features are generated in the two images and matches between them
are searched. With  ≥ 8 such matches the normalized 8-point algorithm can estimate  [5,
p. 282]. With significant noise in the image, outliers can be problematic. These can be dealt
with by the RANSAC [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] algorithm. With  estimated, assuming that the camera center of
the first view is at the world origin, the two camera matrices are found by  1 = [|0] and
 2 = [[e2]×  |e2] [5, p.256].
          </p>
          <p>With  1,  2 determined, one can find 3D point X for the correspondence x1 ↔ x2 ,
satisfying the constraint (4), by a triangulation method [5, p. 311]</p>
          <p>X =  (x1 , x2 ,  1,  2).</p>
          <p>A common method is to use the fact that x × x = x ×   X = 0. For two corresponding
points  = 1, 2, this generates four linearly independent equations contained in matrix . To
ifnd X one solves X = 0 numerically (see [5, pp. 312-313]). Typically, one minimizes the
reprojection error [5, p. 314], (x1 , x2 ) = (x1 , xˆ1 )2 + (x2 , xˆ2 )2, subject to (4), with (· , · )
some distance measure.</p>
          <p>With  and  known it is shown in [5, p. 272-273] that the 3D scene can be reconstructed
up to a similarity transform, i.e, a Euclidean reconstruction with an unknown scaling factor. The
exception is the degenerate case which can occur when the camera centers and X are co-linear,
or in a practical noisy case, close to co-linear. Also, under pure rotation about the camera center,
the degenerate case  = 0 occur. For WCE, degeneracies can occur due to both of these cases.
With  known, one can estimate the Essential Matrix (EM)  =    [5, p. 257], instead of
the FM, which is simpler to compute.</p>
          <p>
            To obtain metric reconstruction one will need additional information about the known length
of some object in the scene, which is hard to obtain in the GI-system. One efort dealing with
this issue is [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ].
(4)
(5)
          </p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.1.2. Multiple Views and Bundle Adjustment (BA)</title>
          <p>For three and four views the trifocal- and quadrifocal tensors provide relations in a similar way
as the FM did for two. However, the relations are more general. One example is the possibility
of transfer. That is, with a point correspondence between two views, the point in the third
(or forth) will be determined. For  &gt; 4 views, the problem has to be dealt with numerically
through bundle adjustment (BA), which is a minimization problem on the form [5, p. 434]
min ∑︁ (︀ ˆ  Xˆ , x )︀ ,  = 1, · · · ,   = 1, · · · , ,
^ , X^ ,
(6)
with (· , · ), some distance measure, typically Euclidean norm. That is, BA is the reprojection
error over all views and 3D points. BA needs a good initial estimate, ˆ  , Xˆ , of camera poses
and 3D points, which is typically obtained by computing the FM (or trifocal tensor) sequentially
over pairs (or triplets) of neighboring images until all views in the sequence are covered [5,
p.453].</p>
          <p>
            The main problem with BA is that it is very costly to compute for large  [5, p. 435]. This
problem is addressed by some SLAM algorithms (like ORB-SLAM) by using BA locally over
sub-sets of key frames.
Performing 3D reconstruction on hundreds or even thousands of images, can be necessary for
WCE video streams. Then a pure SfM approach is inconvenient both due to computational
complexity and the dificulty of keeping track of which features are visible in a given view. For
this a SLAM approach is more convenient. As WCE video is a sequences of monoculus images,
we consider an approach known to be eficient for that case, namely ORB-SLAM [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ].
          </p>
          <p>ORB-SLAM performs SfM (using FM and BA) locally among key frames, which can be seen
as structures (or objects) connected in a co-visibility graph. That is, a weighted graph where
each node is a key frame with all relevant information included (like number of features, their
strength, and all necessary adjacency information). There are edges among key-frames with
common features, where the weight corresponds to the number of features they share. The
local computation of camera poses and 3D geometry greatly reduces computational cost. A
global optimization is also performed to optimize the position of camera poses. ORB-features,
are used throughout as they are significantly faster to compute than SIFT or SURF features.</p>
          <p>
            ORB-SLAM is done in three steps in addition to an initialization procedure. We provide a
brief summary here and refer to [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] for details.
          </p>
          <p>0) Initialization: One out of two methods are chosen based on the scene in question: i) A
homography if the scene is plane, or if the parallax is low. ii) A FM if the scene is not plane
and with suficient parallax. With  known, the EM,  =   , is estimated. Solutions
are chosen based on [5, p. 257-260]. The choice between the two cases are done automatically
using a heuristic [10, p.1151]. A detection of low parallax case is also included and will refuse
the initialization as this leads to a bad reconstruction.</p>
          <p>1) Tracking: Localizes the pose of each frame w.r.t. the first view, which is assumed to be
at the world origin ( 1 = [|0]), by matching ORB-features. It also decides if a given frame
should be inserted as a key frame in the co-visibility graph. The poses are then optimized
using BA (6) over the  ’s only. If tracking is lost a place recognition module is used in a global
re-localization procedure.</p>
          <p>2) Local Mapping: Processes new key frames and performs local BA to obtain a sparse
3D reconstruction in the surroundings of the relevant pose/frame. New correspondences for
unmatched ORB-features are searched in key frames directly connected in co-visibility graph to
triangulate new 3D points. If a key frame is found to be redundant, i.e., if it does not change
enough compared to other key frames, or if it lacks high quality point matches, it is discarded.</p>
          <p>
            3) Loop Closure: With every new key frame the algorithm searches for loops (i.e., when the
camera re-visits a previous part of the scene). When a loop is detected it is possible to estimate
drifts in the data, like drift in scale and position. This is the essential step to remove or minimize
such errors.
2.3. VR-CAPS
VR-CAPS is a virtual environment for WCE [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] which is publicly available in github2. The
environment is based on Unity, which is a game-platform developed by Unity Technologies3.
The environment simulates a range of organ types, capsule endoscopy designs, normal and
abnormal tissue conditions as well as many other features detailed in [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. It is also possible to
emulate non-rigid movements like peristalsis. Therefore, VR-CAPS enables testing of medical
imaging algorithms both for current and future WCE designs.
          </p>
          <p>The standard setup in VR-CAPS is a virtual colon model which is built from CT scans of a real
human colon and covered with realistic looking textures. A section of this colon is depicted in
Fig. 1(a) and an example image captured by the WCE inside this segment is depicted in Fig. 1(b).
Many pillcam models can be built, but the default is a standard-sized pill with one camera and a
spot-light with conical beam, emulating several point-lights surrounding the lens often seen in
standard WCE’s. We will use the standard setup for our experiments.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Simulation setup and Experiments</title>
      <p>We run the VR-CAPS simulator through several subsets of the colon-segment shown in Fig. 1.
These subsets can be seen in Figs. 2(a), 4(a), 5(a) and serve as ground truth for the example 3D
reconstructions. We set image size to 512 × 512, framerate to 20 fps, focal length   = 163
and principle point  =  = 163 (in pixel units). Further, all distortion efects are
disabled as default. However, we will enable some distortions in turn to evaluate the impact on
the reconstruction. The WCE is controlled in VR-CAPS by key buttons and mouse. A steady
movement is dificult to obtain, therefore the resulting WCE trajectory becomes irregular and
ragged, especially through sharp bends. However, this movement appears quite similar to that
of a real WCE, and will therefore test the ORB-SLAM’s ability to cope with quite a realistic
movement.
2https://github.com/CapsuleEndoscope/VirtualCapsuleEndoscopy (31/10-21)
3https://unity.com/</p>
      <p>We rfist consider SfM on a short sequence of images to gain insight into what distortions the
algorithm is sensitive to. This knowledge will be useful in deciding suitable pre processing for
ORB-SLAM.</p>
      <p>We evaluate the reconstruction mainly trough geometric inspection, visually comparing
the resulting 3D point cloud models to the ground truth in Figs. 2(a), 4(a) and 5(a). For more
exact evaluation, a numerical comparison to ground truth is needed, which can be obtained by
computing the average distance between all reconstructed points and ground truth. However,
this is not straight forward to obtain as one has to exclude all parts of the colon from the
ground truth not captured by the camera over long image sequences, as well as triangulate the
reconstructed point cloud in an optimal way. This is currently work in progress.</p>
      <sec id="sec-3-1">
        <title>3.1. Structure from motion</title>
        <p>We consider two cases: 2-view and 6-view SfM. Since  is known, we estimate the EM,
 =   . SURF features are used to estimate  (and thereby  ) and 3D points, whereas
Eigen features are used to compute dense point clouds once  is known. For  = 6 views, an
initial reconstruction is made by sequentially computing the EM for pairs of consecutive frames
(as in Algorithm 18.3 in [5, p. 453]) followed by BA. All relevant computation and estimation
methods for our purposes are found in Matlab’s computer vision toolbox4.</p>
        <p>Based on experimentation on images obtained from VR-CAPS we concluded that the following
pre-processing is needed: One has to remove specs on lens and specular reflections as they tend
to confuse the feature detection algorithm. Motion blur causes similar problems, particularly in
conjunction with rapid rotations and panning caused by rapid movements of the WCE from
image to image. Lens distortion makes the assumption of pinhole camera fail, and therefore leads
4https://se.mathworks.com/products/computer-vision.html (10/11-21)
3D point cloud: SfM 6 views</p>
        <p>z
(d)
x
to very sparse and inaccurate 3D point clouds. As the WCE has a spotlight source, lighting will
vary hugely across the image. Due to dim lighting, particularly in fields imaging deeper parts,
contrast enhancement is essential to detect stable features. Due to the variation in brightness
we applied adaptive histogram equalization.</p>
        <p>The colon segment we aim to reconstruct is shown in Fig. 2(a). The image seen by view
1 is depicted in Fig. 1(b). This view is the reference for the 3D computation. The 2-view
reconstruction is depicted in real color from the front and side in Figs. 2(b) and 2(c) respectively,
with the position of camera 1 included. The 6-view reconstruction is shown in Fig. 2(d) from the
side. The reconstruction is quite convincing, even with two images, but clearly more noisy than
the 6-view case. The 6-view case is more restrictive and accurate as it rules out some outliers.
Therefore, the point cloud may be somewhat less dense. Note in particular that the overall
“cylindrical” geometry of the colon in Fig. 2(a) is reconstructed quite well, indicating that the
algorithm eliminates perspective distortion.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. ORB-SLAM</title>
        <p>For longer segments than the one in Fig. 2(a) a large number of views may be necessary. Then
ORB-SLAM is needed.</p>
        <p>Algorithm for Densification: As ORB-SLAM is optimized for fast computation and and
accurate localization, it produces a sparse point cloud only containing those 3D points of high
accuracy needed to optimize camera localization. To make a denser point cloud for the purpose
of GI inspection, we use the camera poses and co-visibility graph obtained by ORB-SLAM, then
traverse the graph computing dense SfM over sub-sets of key-frames as detailed in Algorithm 1
below.</p>
        <p>Algorithm 1. Densification of ORB-SLAM point cloud
Input: i) Tracking data from ORB-SLAM,   = [|t], for all key-frames  = 1, · · · ,  . ii)
Co-visibility graph of key-frame objects.</p>
        <p>Initialization: i) Point cloud array ii) Max number of views,  , used in dense 3D reconstruction
Algorithm:
for  = 1 to</p>
        <p>i) Determine number of key-frames,  , with strong co-visible features shared with
keyframe  for frames  &gt; 
ii) if  &lt;  , set  to 
iii) if  = 0, set point cloud to zero and jump to i) for next key frame
iv) else Perform  -view SfM (as in Section 2) with dense features, given   ,  = , · · · ,  +
 − 1, with key frame  as reference view</p>
        <p>v) Rule out degeneracies: If unproportionately large values exist in point cloud, set it to zero
and jump to i) for next key frame</p>
        <p>vi) Denoise point cloud and store in point cloud array
end
vii) Concatenate all point clouds using available position data,</p>
        <p>Simulation Setup: In initialization step 0) (see section 2.2) we force the algorithm to choose a
FM model as plane scenes never occur in the GI-system. Due to the WCE movement, initialization
may fail due to low parallax. If the initialization is rejected, we skip to the next frame and
re-start the algorithm until the initialization succeeds. Loop closure (Step 3) should be disabled
as loops never occur when the WCE travels through the GI system. The repetitive geometrical
structure of the colon also tend to confuse the ORB-SLAM algorithm, mis-interpreting these for
being potential loop closure candidates. Without loop closure one has to expect inaccuracies
as both scale and position will drift, particularly over longer segments. In sharp bends of the
colon this will be most noticeable due to large rotations or panning of the camera. Here we
try to avoid huge scaling errors by running ORB-SLAM several times over diferent colon
segments. In a real scenario, one would likely make 3D models only in segments of the colon
surrounding pathologies. However, if longer segments are needed, one may concatenate several
reconstructed segments after estimating some scale factors. One may then use the fact that
the colon can be approximately described as a tube with radius being contained within certain
boundaries.</p>
        <p>ORB-features are detected under SLAM. However, SURF and eigen features are applied to
compute dense reconstruction in Algorithm 1 as they appear to produce more reliable features
for typical colon geometry and texture. We assume the same pre-processing as for SfM. The
textures of the colon walls as well as its geometry are both crucial to obtain enough features to
obtain a decent reconstruction. This leads to feature detection over a range of scales. To cover
all relevant scales, 8 pyramid levels in the feature detection is needed. Further, all frames with
significant motion blur are removed manually as they make the algorithm fail due to lack of
ORB-feature matches.</p>
        <p>Experiments: We consider three scenarios of colon segments that a 3D reconstruction
algorithm should be able to handle: 1. Nearly straight short segments. 2. Longer sections
bending slowly. 3. Shorter segments with sharp bends.</p>
        <p>A version of ORB-SLAM has been implemented by the MatLab community5. We build on
and extend this example for our purposes here.</p>
        <p>Scenario 1: The colon segment under consideration is given in Fig. 3(a) and is the same as in
the SfM case. 68 images were generated of this segment in VR-CAPS, and 49 key-frames was
chosen by the ORB-SLAM algorithm for reconstruction. Notice that the reconstruction is less
noisy and denser than what was the case for pure SfM in Fig. 2. It also appears to be a better
reconstruction which fits quite well with the colon model as shown in Fig. 3(d). This indicates
that ORB-SLAM copes with colon geometry, providing better accuracy than pure SfM.</p>
        <p>Scenario 2: The colon segment under consideration is given in Fig. 4(a). 996 images were
generated of this segment in VR-CAPS, and 445 key-frames was chosen by the ORB-SLAM
algorithm for reconstruction. The estimated camera poses, i.e., the movement of the camera
through the relevant segment, as well as the corresponding sparse point cloud is shown in
Fig. 4(b). “Optimized trajectory” refers to a global optimization over all key-frame camera poses
after ORB-SLAM. The “ragged” trajectories fits with the simulated movement obtained through
VR-CAPS. The sparse cloud seems to capture the overall shape of the colon segment. The dense
reconstruction in Figs. 4(c) and 4(d) shows a clearer outline of the reconstruction, and appears
to have quite similar shape to the relevant segment. Note in particular that the narrowing of
the colon is captured quite well. However, there is quite some noise in the cloud, particularly
around the sharpest bend as well as close to the end of the segment, which is expected due to
lack of loop closure. Overall, the result is quite promising.</p>
        <p>Scenario 3: The colon segment under consideration is given in Fig. 4(a). 1173 images were
generated of this segment in VR-CAPS, and 332 key-frames was chosen by the ORB-SLAM
algorithm for reconstruction. The movement of the camera through the relevant segment, as
well as the corresponding sparse point cloud is shown in Fig. 5(b). The sparse cloud again seems
to capture the rough outline of the colon segment, and the “ragged” trajectory is inline with the
simulated movement. The dense reconstruction in Figs. 5(c) and 5(d) shows a clearer outline of
the reconstruction, and appears to have quite similar shape to the relevant segment. However,
there is even more noise than in Scenario 2, especially after the sharp bend, which is expected
due to scale- and position drift. Also, we can see that the cloud is denser on the outer side of
the bend, whereas it is very sparse, or lacking completely, on the inner side. The reason is that
5https://se.mathworks.com/help/vision/ug/monocular-visual-simultaneous-localization-and-mapping.html
(20/1121)
the camera is mainly facing outwards while it moves throughout the bend. A real WCE has
a fisheye lens with a much larger viewing angle, and so one may expect this efect to be less
severe (but still present). Anyway, the results seem quite promising, especially given the fact
that the results are not yet optimized and fully (post-) processed.</p>
        <p>As ORB-SLAM copes quite well with all three scenarios, even sharp bends, it seems that
3D reconstruction of the human colon is indeed possible. As mentioned earlier, for further
evaluation and post-processing we first need a numerical comparison to ground truth.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Summary and Conclusions</title>
      <p>In this paper we have investigated the possibility for 3D reconstruction of the human colon
from WCE images using structure from motion and ORB-SLAM. To generate data sets, we used
a virtual graphics-based environment emulating both the human colon as well as the WCE’s
movement through it. Experimental results in this paper indicate that 3D reconstruction of
human colon is possible.</p>
      <p>
        Future research should aim at optimizing the 3D reconstruction process and find suitable
post-processing methods to improve the resulting point cloud. Extensions to more realistic
scenarios include non-rigid motion as well as enabling of distortions and artefact seen in real
WCE images, all of which can be emulated in VR-CAPS. Then, with suitable pre-processing
algorithms based on what we have learned through experiments in place, a study on real WCE
videos will be possible. One should also combine or merge methods studied here with single
image techniques, like the efort in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] using shape from shading, to cope with a broader scenario.
Lastly, one may eliminate drifts due to lack of loop closure in ORB-SLAM through additional
information available. The WCE emits electromagnetic radiation received by several on-body
sensors that can be used to track its position quite accurately [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], or compute the path length
traveled [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. This can help to correct for drift in position.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Acknowledgments</title>
      <p>We would like to give our appreciation to Anuja Vats for bringing our attention to the VR-CAPS
environment.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Editorial</surname>
          </string-name>
          ,
          <article-title>Improving uptake of colorectal cancer screening</article-title>
          ,
          <source>The Lancet Gastroenterology &amp; Hepatology</source>
          <volume>2</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>A. T. C.</surname>
          </string-name>
          et al.,
          <article-title>Wireless capsule endoscopy</article-title>
          ,
          <source>Gastrointestinal Endoscopy</source>
          <volume>78</volume>
          (
          <year>2013</year>
          )
          <fpage>805</fpage>
          -
          <lpage>815</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>K. N.</surname>
          </string-name>
          et. al.,
          <article-title>Three-dimensional upper gastrointestinal endoscopy: A clinical study of safety and an ex vivo study of utility in endoscopic submucosal dissection</article-title>
          ,
          <source>Gastrointestinal Endoscopy</source>
          <volume>87</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Horn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brooks</surname>
          </string-name>
          ,
          <article-title>The variational approach to shape from shading</article-title>
          ,
          <source>Computer Vision</source>
          , Graphics, and
          <source>Image Processing</source>
          <volume>33</volume>
          (
          <year>1986</year>
          )
          <fpage>174</fpage>
          -
          <lpage>208</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R. I.</given-names>
            <surname>Hartley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          , Multiple View Geometry in Computer Vision, second ed., Cambridge University Press, ISBN:
          <volume>0521540518</volume>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Floor</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Farup,</surname>
          </string-name>
          <article-title>3d reconstruction of gastrointestinal regions from single images</article-title>
          ,
          <source>in: Colour and Visual Computing Symposium (CVCS)</source>
          , Gjøvik, Norway,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Jensen</surname>
          </string-name>
          et al.,
          <article-title>A benchmark and evaluation of non-rigid structure from motion</article-title>
          ,
          <source>International Journal on Computer Vision</source>
          <volume>129</volume>
          (
          <year>2021</year>
          )
          <fpage>882</fpage>
          -
          <lpage>899</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sidhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tretschk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Golyanik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agudo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Theobalt</surname>
          </string-name>
          ,
          <article-title>Neural dense non-rigid structure from motion with latent space constraints</article-title>
          ,
          <source>in: European Conference on Computer Vision (ECCV)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Murray, Parallel tracking and mapping for small ar workspaces</article-title>
          ,
          <source>in: 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality</source>
          ,
          <year>2007</year>
          , pp.
          <fpage>225</fpage>
          -
          <lpage>234</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Mur-Artal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M. M.</given-names>
            <surname>Montiel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Tardós</surname>
          </string-name>
          ,
          <article-title>Orb-slam: A versatile and accurate monocular slam system</article-title>
          ,
          <source>IEEE Transactions on Robotics</source>
          <volume>31</volume>
          (
          <year>2015</year>
          )
          <fpage>1147</fpage>
          -
          <lpage>1163</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Incetan</surname>
          </string-name>
          et al.,
          <article-title>Vr-caps: A virtual environment for capsule endoscopy</article-title>
          ,
          <source>Medical Image Analysis</source>
          <volume>70</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Fischler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Bolles</surname>
          </string-name>
          ,
          <article-title>Random sample consensus: A paradigm for model fitting with applications to image analysis</article-title>
          and
          <source>automated cartography, Commun. of the ACM</source>
          <volume>24</volume>
          (
          <year>1981</year>
          )
          <fpage>381</fpage>
          -
          <lpage>395</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Dimas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bianchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. K.</given-names>
            <surname>Iakovidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karargyris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ciuti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koulaouzidis</surname>
          </string-name>
          ,
          <article-title>Endoscopic single-image size measurements</article-title>
          ,
          <source>Measurement Science and Technology</source>
          <volume>31</volume>
          (
          <year>2020</year>
          )
          <fpage>9</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B.</given-names>
            <surname>Moussakhani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Flåm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Støa</surname>
          </string-name>
          , I. Balasingham, T. Ramstad,
          <article-title>On localisation accuracy inside the human abdomen region</article-title>
          ,
          <source>IET Wireless Sensor Systems</source>
          <volume>2</volume>
          (
          <year>2012</year>
          )
          <fpage>9</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bjørnevik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Floor</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Balasingham</surname>
          </string-name>
          ,
          <article-title>On path length estimation for wireless capsule endoscopy</article-title>
          ,
          <source>in: 12th International Symposium on Medical Information and Communication Technology (ISMICT)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>