<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Possibility estimation of 3D scene reconstruction from multiple images</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>E A Dmitriev</string-name>
          <email>DmitrievEgor94@yandex.ru</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>V V Myasnikov</string-name>
          <email>vmyas@geosamara.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Image Processing Systems Institute of RAS - Branch of the FSRC "Crystallography and Photonics" RAS</institution>
          ,
          <addr-line>Molodogvardejskaya street 151, Samara, Russia, 443001</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Samara National Research University</institution>
          ,
          <addr-line>Moskovskoe Shosse 34А, Samara, Russia, 443086</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>293</fpage>
      <lpage>296</lpage>
      <abstract>
        <p>This paper presents a pixel-by-pixel possibility estimation of 3D scene reconstruction from multiple images. This method estimates conjugate pairs number with convolutional neural networks for further 3D reconstruction using classic approach. We considered neural networks that showed good results in semantic segmentation problem. The efficiency criterion of an algorithm is the resulting estimation accuracy. We conducted all experiments on images from Unity 3d program. The results of experiments showed the effectiveness of our approach in 3D scene reconstruction problem.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>3D-scene reconstruction is a classic computer vision problem. Algorithms for 3D-scene reconstruction
are prevalent in many spheres like robotics, architecture, design, Earth remote sensing, automated
driving systems.</p>
      <p>
        There are several methods for solving considered problem [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Binocular stereo vision is one of
such methods [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This method calculates a disparity between conjugate points on rectified stereo
images. The main problem is to find conjugate points. A possible solution is searching key points on
stereo images, then getting points descriptors and matching points by metrics values between
descriptors [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. There are more modern approaches that use convolutional neural networks [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ].
      </p>
      <p>
        3D-scene reconstruction using multiple images is a computationally expensive problem [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The
current level of technology development doesn’t make it possible to reconstruct 3D-scenes in real time
with good quality.
      </p>
      <p>This article proposes an algorithm for possibility estimation of 3D-scene reconstruction using
several frames of a video sequence in real time. This procedure helps to evaluate images number to get
good quality 3D points cloud. The algorithm estimates conjugate pairs number from multiple images
using a deep convolutional neural network. The article presents a model of a neural network with a
quite small amount of weights. The model is possible to use on mobile devices with a graphical
accelerator in real time. We conducted all experiments on images from Unity 3d program.</p>
      <p>The article is structured as follows: the second section describes the main terms. Next section
describes the model of neural network. The fourth section presents the results of experiments. Finally,
we summarize results and tell about future researches.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Main terms</title>
      <p>Let Iks (n1, n2 )
be an RGB image from camera k
and scene s , where (n1, n2 )  D,
D  {(n1, n2 ) : n1  0, N1 1, n2  0, N2 1} , k  0, K 1, s  0, S 1 , N1, N2 are height and width of
image from camera, K is number of cameras on scene and S is number of scenes. Every scene differs
from each other by objects types or relative positions of objects. Let l be the index of the fixed camera
and we call the image from this camera as a relative image. Let Rks (n1, n2 ) be discrete function whose
values are points coordinates in space. Every value of Rks (n1, n2 ) is projected on the correspondent
position of the image plane Iks (n1, n2 ) . Rls (n1, n2 ) is a function whose values are projected on relative
image s . To form elements of train and test datasets we consider the following function:
0, Rsj n1, n2   Rls n1, n2 
Pjs n1, n2   </p>
      <p>1, Rsj n1, n2   Rls n1, n2 </p>
      <p>Let X G   xi , yi iG01 be a dataset, xi is a tensor, passed through the neural network, yi is a label
tensor and G is the size of the dataset. We form tensor xi by concatenating m  K different images
Iks (n1, n2 ) with the relative image from scene s . We choose m less than K in order to form more
input and label tensors from one scene. Number of elements from one scene is СKm11 .</p>
      <p>To get label tensor we consider the following function:</p>
      <p>m1
Ais (n1, n2 )   Pjs n1, n2  ,
j0
jl
(1)
(2)
(3)
where j is index of camera. Ais n1, n2  values show frames number in set of m images from input
tensor (without considering relative image), that contain projection of Rsj n1, n2  point. We represent
all values of Ais n1, n2  function in one hot encoding with m bits to get final yi label tensor. Our task
is similar to semantic segmentation task or pixel classification problem.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Model description</title>
      <p>
        We considered several fully convolutional neural networks whose output tensor has the same width
and height as input tensor [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Such networks are used in semantic segmentation task and show good
performance. Some of these networks are U-Net [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], SegNet [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. These networks have comparable
number of weights.
      </p>
      <p>
        Another considered network is LinkNet [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This model exploits all features of U-Net and has a
smaller number of weights. Specific feature of LinkNet is using several encoders and decoders.
Original model consists of four blocks with 11.5 million parameters number.
      </p>
      <p>In this work we use 3 decoder and encoder blocks to make network run faster in real time. We
reduced size of max pooling kernel to 2  2 with stride 1 instead of kernel 3 3 with stride 2. This
change in size of max pooling kernel doesn’t allow feature maps to decrease fast to save more
information. Our model has 3 million number of parameters. Figure 1 shows proposed model while
figure 2 and figure 3 demonstrate architecture of encoder and decoder blocks respectively.</p>
      <p>
        We used cross entropy as a loss function. According to [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], cross entropy loss function allows to
get local minimum that gives bigger accuracy than mean square distance loss function using random
weights initialization.
      </p>
      <p>Let v be an output tensor of neural networks that has the same shape as label tensor. Loss function
looks as follows:</p>
      <p>m1
H  y n1, n2 ,v n1, n2    y n1, n2 ,i log v(n1, n2 ,i) .</p>
      <p>i0</p>
      <p>Loss on whole dataset can be calculated as follows:</p>
      <p>1 G1 N11 N2 1 m1
Q( X G )       yi  j, k,t log vi  j, k,t  (4)</p>
      <p>G i0 j0 k0 t0</p>
      <p>
        We used an adaptive stochastic gradient descent Adam as optimization method [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ]. We also
decreased learning coefficient value if loss on test set remained the same or less than on previous
epoch.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Results of experiments</title>
      <p>We trained and tested model on dataset of Unity 3d images. Number of cameras K was 8, number of
scenes was 23 and number of RGB images m in input tensor was 5. Number of images in dataset was
805. Size of image was 300×300. We split images on 70/30 percent for train and test dataset
respectively.</p>
      <p>Input tensor contained 15 channels in third dimension. First 3 channels belonged to relative image.
Relative image, with non-relative image, label tensor and output tensor are presented on figures 4, 5,
6, 7. Intensity on images 6 and 7 depends on number of conjugate pairs in input tensor.</p>
      <p>We used accuracy metric for results estimation. Accuracy metric looks as follows:</p>
      <p>1 O1 N11 N2 1
M      arg max vt i, j,u   arg max yt i, j,l   , (5)</p>
      <p>N1N2O t0 i0 j0  u l 
where O is number of images in test dataset, u and l are indexes for output and label tensors
respectively.</p>
      <p>After training neural network on train dataset accuracy on test dataset was 0.96.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>In this paper, we presented a new approach for possibility estimation of 3D-scene reconstruction using
convolutional neural network. Our model can estimate conjugate pairs number from multiple images.</p>
      <p>We conducted experiments and showed effectiveness of our approach. The aim of future researches
to propose a method for camera-world rotation matrix and translation vector estimation.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>The reported study was funded by RFBR according to the research projects 18-01-00748, 17-29-3190.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Horn</surname>
            <given-names>B 1986</given-names>
          </string-name>
          <string-name>
            <surname>Robot Vision</surname>
          </string-name>
          (Cambridge: MIT Press)
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Choy</surname>
            <given-names>C B</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gwak</surname>
            <given-names>J Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            <given-names>K</given-names>
          </string-name>
          <source>and Savarese S</source>
          <year>2016</year>
          3D-
          <fpage>R2N2</fpage>
          :
          <article-title>A unified approach for single and multi-view 3D</article-title>
          <source>object reconstruction Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9912 LNCS 628-644</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Savchenko</surname>
            <given-names>A V</given-names>
          </string-name>
          <year>2017</year>
          <article-title>Maximum-likelihood dissimilarities in image recognition with deep neural networks</article-title>
          <source>Computer Optics</source>
          <volume>41</volume>
          (
          <issue>3</issue>
          )
          <fpage>422</fpage>
          -
          <lpage>430</lpage>
          DOI: 10.18287/
          <fpage>2412</fpage>
          -6179-2017-41-3-
          <fpage>422</fpage>
          -430
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Lowe</surname>
            <given-names>D G</given-names>
          </string-name>
          <year>2004</year>
          <article-title>Distinctive image features from scale-invariant keypoints</article-title>
          <source>International Journal of Computer Vision</source>
          <volume>60</volume>
          <fpage>91</fpage>
          -
          <lpage>110</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Žbontar</surname>
            <given-names>J</given-names>
          </string-name>
          and
          <string-name>
            <surname>Le Cun Y 2015 Computing</surname>
          </string-name>
          <article-title>the stereo matching cost with a convolutional neural network</article-title>
          <source>Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 1592-1599</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Shelhamer</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Long</surname>
            <given-names>J</given-names>
          </string-name>
          and
          <string-name>
            <surname>Darrell</surname>
            <given-names>T 2017</given-names>
          </string-name>
          <string-name>
            <surname>Fully Convolutional</surname>
          </string-name>
          <article-title>Networks for Semantic</article-title>
          <source>Segmentation IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>39</volume>
          <fpage>640</fpage>
          -
          <lpage>651</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Ronneberger</surname>
            <given-names>O</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fischer</surname>
            <given-names>P</given-names>
          </string-name>
          and
          <string-name>
            <surname>Brox</surname>
            <given-names>T 2015</given-names>
          </string-name>
          <article-title>U-net: Convolutional networks for biomedical image segmentation</article-title>
          <source>Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)</source>
          9351
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Badrinarayanan</surname>
            <given-names>V</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kendall</surname>
            <given-names>A</given-names>
          </string-name>
          and
          <string-name>
            <surname>Cipolla R 2017 SegNet: A Deep Convolutional</surname>
          </string-name>
          <article-title>EncoderDecoder Architecture for Image</article-title>
          <source>Segmentation IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>39</volume>
          <fpage>2481</fpage>
          -
          <lpage>2495</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Chaurasia</surname>
            <given-names>A</given-names>
          </string-name>
          and
          <string-name>
            <surname>Culurciello E 2018 LinkNet:</surname>
          </string-name>
          <article-title>Exploiting encoder representations for efficient semantic segmentation IEEE Visual Communications and Image Processing 1-4</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Golik</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doetsch</surname>
            <given-names>P</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ney</surname>
            <given-names>H 2013</given-names>
          </string-name>
          <article-title>Cross-entropy vs. Squared error training: A theoretical and experimental comparison</article-title>
          <source>Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 1756-1760</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Kingma D P and Ba J 2014 Adam:</surname>
          </string-name>
          <article-title>A Method for Stochastic Optimization ArXiv</article-title>
          :
          <volume>1412</volume>
          .
          <fpage>6980</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Nikonorov</surname>
            <given-names>A V</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petrov</surname>
            <given-names>M V</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bibikov</surname>
            <given-names>S A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kutikova</surname>
            <given-names>V V</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morozov</surname>
            <given-names>A A</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kazanskiy N L 2017</surname>
          </string-name>
          <article-title>Image restoration in diffractive optical systems using deep learning</article-title>
          and
          <source>deconvolution Computer Optics</source>
          <volume>41</volume>
          (
          <issue>6</issue>
          )
          <fpage>875</fpage>
          -
          <lpage>887</lpage>
          DOI: 10.18287/
          <fpage>2412</fpage>
          -6179-2017-41-6-
          <fpage>875</fpage>
          -887
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>