=Paper=
{{Paper
|id=Vol-2348/short03
|storemode=property
|title=Scene-Adaptive Optimization Scheme for Depth Sensor Networks
|pdfUrl=https://ceur-ws.org/Vol-2348/short03.pdf
|volume=Vol-2348
|authors=Johannes Wetzel,Samuel Zeitvogel,Astrid Laubenheimer,Michael Heizmann
|dblpUrl=https://dblp.org/rec/conf/cerc/WetzelZLH19
}}
==Scene-Adaptive Optimization Scheme for Depth Sensor Networks==
<pdf width="1500px">https://ceur-ws.org/Vol-2348/short03.pdf</pdf>
<pre>
Internet of Things, Networks and Security


    Scene-Adaptive Optimization Scheme for Depth
                  Sensor Networks

       Johannes Wetzel1 , Samuel Zeitvogel1 , Astrid Laubenheimer1 , and Michael
                                      Heizmann2
        1
          Intelligent Systems Research Group (ISRG), Karlsruhe University of Applied
                                   Sciences, Karlsruhe, Germany
       {johannes.wetzel,samuel.zeitvogel,astrid.laubenheimer}@hs-karlsruhe.de
        2
           Institute of Industrial Information Technology (IIIT), Karlsruhe Institute of
                             Technology (KIT), Karlsruhe, Germany
                                    michael.heizmann@kit.edu


             Abstract. In this work a scheme for scene-adaptive depth sensor net-
             work optimization is presented. We propose to fuse the knowledge in-
             ferred by the sensor network into a common world model while at the
             same time exploiting this knowledge to improve the perception and post
             processing algorithms themselves. Moreover, we show how our optimiza-
             tion scheme can be applied to improve the use cases of disparity estima-
             tion as well as people detection with multiple depth sensors.

             Keywords: depth sensor networks · context aware · knowledge based
             optimization · scene-adaptive · optimization


   1        Introduction

   Low cost commodity depth sensors are an emerging technology and are applied
   to a broad field of applications such as people detection and tracking, 3D recon-
   struction or emergency detection in an ambient assisted living context. However,
   depth sensor networks as well as modern vision algorithms have many parameters
   and require fine-tuned, scene-specific configurations to achieve optimal perfor-
   mance. Due to strongly varying scenes and changing conditions at run time it
   is very challenging to fine-tune those parameters manually in real world appli-
   cations. To overcome the problem of scene-specific manual (re)configuration of
   depth sensor networks, we propose a scene-adaptive scheme which exploits the
   scene knowledge to improve perception and post processing vision algorithms.
   Our objective is not only to tune the given parameters but also to improve the
   vision algorithms, such as stereo block matching, detection or tracking by ex-
   plicit exploitation of the scene knowledge, e.g. by building scene-specific object
   models. Therefore, we fuse the knowledge inferred from the sensor network into a
   common world model, representing our current context knowledge. This knowl-
   edge is then fed back to optimize sensor parameters and algorithms to improve
   the performance of a sensor network at run time.


                                                111
                                                                          Internet of Things, Networks and Security

2         J.Wetzel et al.

2   Related work
The configuration of video sensor networks in the context of video surveillance
has been widely studied in the literature. In [13] a general overview of the dif-
ferent aspects of sensor network reconfiguration is given. Rinner et al. [12] fo-
cus on the aspect of configuration of smart camera networks in the context of
video surveillance. They review the configuration for a specific analysis task and
evaluate different configuration methods. In [8] a flexible uncertainty model is
presented to reconfigure the sensor network with the objective to optimize the
detection performance. Fischer et al. [4] give an overview of intelligent surveil-
lance systems, analyzing the information flow between sensors, world model and
inference algorithms. In [14] an overview to visual sensor networks is given.
However, prior work focuses on monocular camera networks and employs pa-
rameter reconfiguration. In contrast, our work deals with depth sensor networks
and proposes a scheme for explicit exploitation of the given scene knowledge.
This includes conventional parameter reconfiguration methods as well as meth-
ods that construct and use sophisticated world models to improve the integrated
algorithms of sensor networks at run time.

3   Scene-adaptive sensor network optimization
In this section we present a scheme for scene-adaptive sensor network optimiza-
tion. The general information flow in a depth sensor network is depicted in
Fig. 1 and separated into five different abstraction layers. The sensing layer


     sensing        sensor data     local 2D and
                   postprocessing    3D analysis


                    sensor data     local 2D and         data and know-        global data        world
     sensing
                   postprocessing    3D analysis          ledge fusion          analysis          model
                         …
        …


                                        …


     sensing        sensor data     local 2D and
                                                                                              scene and situa-
                   postprocessing    3D analysis
                                                                                                tion analysis


Fig. 1. Information flow in a depth sensor network with scene-adapative optimization
strategy.


contains low-level methods related to the raw sensor measurement such as syn-
chronization, calibration and image acquisition. In the sensor data post pro-
cessing layer depth estimation algorithms (e.g. stereo block matching), filtering
and low-level feature extractors are included. Local data analysis covers high
level vision algorithms which take the RGB-D data as input such as segmenta-
tion, recognition, object detection, local 3D object and scene reconstruction or


                                                   112
Internet of Things, Networks and Security

                   Scene-Adaptive Optimization Scheme for Depth Sensor Networks     3

   tracking of objects. Based on the results of the data and knowledge fusion,
   the global data analysis layer includes methods which make use of the fused
   information of multiple sensors of the network. Examples are 3D scene recon-
   struction, 3D object localization and global tracking. The sensor network infers
   information about the scene across abstraction levels. Over time, information is
   fused into a common world model which represents the current scene knowledge.
   While a world model can be used to do e.g. scene and situation analysis, we use
   it to optimize the parameters of each individual sensor online and support the
   data analysis methods e.g. by building scene-specific object models gradually.

   3.1     Knowledge representation
   The employed knowledge representation within the world model has to be ex-
   pressive to solve the high-level task of the sensor network and the optimization
   of the sensor network itself. The fusion layer might provide sensor data as well as
   locally derived high-level knowledge and the world model therefore might need
   to cover low-level data up to high level information. Taking these aspects into
   account, several existing approaches for knowledge representations are qualified
   to serve as world model. For most tasks and networks, a world model consisting
   of geometric and semantic scene descriptions will be suitable. Geometric scene
   knowledge thereby encompasses information about the objects contained in the
   scene and their properties. This includes the object class (e.g. humans, furnature,
   floor plan), the object location and orientation in a global world coordinate sys-
   tem, dynamic properties e.g. a motion model, shape, material. Examples for such
   a world model are object oriented world models [2, 5]. In order to enhance the
   quality of the world model, a knowledge base consisting of preprocessed informa-
   tion or prior knowledge can be used. This includes morphable shape models [3]
   for different object classes as well as common recognition, detection and segmen-
   tation models [18] which are applied on image and 3D data, e.g. RGB-D data,
   point clouds, voxels or triangulated surfaces [1]. In terms of semantic knowledge
   Fuzzy Metric Temporal Logic and Situation Graph Trees [11] or ontologies [10]
   can be incorporated. The semantic description might be data driven, e.g. Hartz
   and Neumann [6] use a scene interpretation system [7] and learn ontological
   concept descriptions from data.

   3.2     Optimization possibilities
   Depth sensor networks involve multiple algorithms which leads to a large amount
   of parameters. In this section we give an overview of parameters and methods
   which are suitable for automatic scene-adaptive sensor optimization. We assume
   that a suitable knowledge base (see section 3.1) exists and focus on algorithm
   and parameter optimization. Following our layered scheme, we categorize the
   optimization targets into three major categories, see Fig. 2. Sensing parame-
   ters have a direct impact on the measurement quality. Parts of this category
   have already been addressed. Auto exposure is state-of-the-art for decades in
   consumer cameras, but sophisticated scene models [17] can improve the result


                                              113
                                                                       Internet of Things, Networks and Security

4         J.Wetzel et al.


           sensing                     sensor data                                local 2D and
                                      postprocessing                               3D analysis


          sensor               stereo              ToF depth           object                     object
       configuration          matching             estimation         detection                  tracking


      • exposure            • minimum           • multifrequency   • discriminativ         • gate size
      • sensitivity           disparity           face               classifier            • assignment
      • resolution          • maximum             unwrapping       • generativ               distance
      • frame rate            disparity         •…                   object model          • dynamics
      • sensor pos.         • …                                    • region of               model
      • emitter strength                                             interest              • flow field
      • synchronization                                            •…                      •…
      • calibration
      •…


Fig. 2. Non-exhaustive taxonomy of building blocks within a depth sensor network
which are suitable for scene-specific optimization.


e.g. by taking only the pixel intensities near regions of interest into account.
Sensor data post processing methods vary highly between different depth
sensing technologies. The depth estimation of a stereo sensor can be improved
by setting the minimum and maximum observable disparity based on geomet-
ric scene knowledge. In section 4.1 a approach for the task of scene-adaptive
disparity estimation is presented with an exemplary knowledge representation.
Many scene-adapative local data analysis methods have already been pub-
lished. Yang et al. [16] learn global appearance and motion models to improve
multiple target tracking. Masksai et al. [9] propose a context-aware optimiza-
tion strategy for multi object tracking. They learn the most likely trajectory
patterns with respect to a given scene layout to reduce incorrect assignments
between detections and tracks. In 4.2 we show how the task of people detection
can be optimized in a scene-specific fashion.

4     Application
In this section we show the applicability of our scheme on two exemplary use
cases.

4.1    3D model based disparity estimation
Our knowledge representation contains sensor knowledge in the form of a camera
model and existing camera calibration parameters π, scene geometry using a
ground plane assumption P (h) ⊂ R3 and a 3D morphable human surface model
parameterized by β. Scene semantics are represented as segmentations of a single
human sh and the ground plane sg in the image. Let Dπ (u) be a depth image
computed using the estimated disparity values u from the image pair (I1 , I2 ).
Classical stereo algorithms estimate the disparity values u minimizing a cost
function
                    E(u) = Ephotometric (u; I1 , I2 ) + Ereg (u) ,           (1)


                                                        114
Internet of Things, Networks and Security

                   Scene-Adaptive Optimization Scheme for Depth Sensor Networks               5

   where Ephotometric is the photometric error penalizing intensity deviation in the
   local neighborhood given u and Ereg regularizes the problem penalizing unlikely
   disparity values based on simple scene assumptions. We propose to employ a
   scene-adaptive optimization scheme reformulating (1) with
              Eadaptive (u) = Ephotometric (u; I1 , I2 ) + Emodel (u[sh ], u[sg ]; β, h) ,   (2)
   where Emodel uses our provided scene representation to measure the deviation
   from the estimated depth at the segmented pixel locations u[sh ] and u[sg ] to
   the explicit geometric scene representation consisting of the ground plane at
   height h and the human shape model parameterized by β. Scene-adaptive dis-
   parity estimation is then performed by estimating û = arg minu Eadaptive (u).
   Eq.(2) can be extended in various ways, which proves the generality of the pro-
   posed approach by e.g. introducing a human motion model to enforce temporal
   consistency constraints.

   4.2     People detection with multiple depth sensors
   The sensors have a top view on the scene and a significant overlap to each
   other. Additionally, we assume that the sensors are intrinsically and extrinsically
   calibrated in advance and that the common ground plane is known. We model
   the presence of a person on the ground floor as a discrete grid of Bernoulli
   random variables X = (x1 , .., xn ), xi ∈ {0, 1} where each xi maps to one specific
   ground plane grid location g i ∈ R2 . Our goal is to infer the likelihood of a scene
   configuration X given current depth observations O = (O1 , . . . , OC ) from C
   depth sensors.
               Qn Applying Bayes’ theorem and assuming that the prior factorizes
   as p(X) = i=1 p(xi ) we get the posterior distribution
                                                          Y
                                                          n
                                      p(X|O) ∝ p(O|X)           p(xi ).                      (3)
                                                          i=1

   For this application we assume that the likelihood p(O|X) is given (see [15]
   for details on the construction of the likelihood) and only focus on the scene-
   adaptive choice of the prior p(X). We start with an uninformative prior to make
   the detection of people at every location equally likely. In many real world scenes
   this is a crude assumption due to obstacles or preferred walking tracks which
   can be present in the scene. Thus, we propose to accumulate the detections
   over time to get the relative frequencies H = (h1 , . . . , hn ) of the presence of
   people for every ground plane grid location g i and fuse those information into
   the world model. This scene-specific knowledge can be used in the feedback step
   to continuously update the prior beliefs p(xi ) accordingly to H on regular time
   intervals.

   5      Conclusion
   In the present work we have proposed a scheme for scene-adaptive optimization
   of depth sensor networks. We have given an analysis of relevant knowledge rep-
   resentations and categorized identified optimization targets. Moreover, we have


                                                   115
                                                            Internet of Things, Networks and Security

6       J.Wetzel et al.

exemplarily applied our scheme on the use cases of disparity estimation as well
as people detection with multiple depth sensors. Future work will include the
investigation of more use cases as well as proof of concept implementations.


References
 1. Ahmed, E., Saint, A., Shabayek, A.E.R., Cherenkova, K., Das, R., Gusev, G.,
    Aouada, D., Ottersten, B.: Deep learning advances on different 3d data represen-
    tations: A survey. arXiv preprint arXiv:1808.01462 (2018)
 2. Bauer, A., Emter, T., Vagts, H., Beyerer, J.: Object-oriented world model for
    surveillance systems. In: Future Security: 4th Security Research Conference. pp.
    339–345. Fraunhofer Verlag (2009)
 3. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: Proceed-
    ings of the 26th annual conference on computer graphics and interactive techniques.
    pp. 187–194. ACM Press/Addison-Wesley Publishing Co. (1999)
 4. Fischer, Y., Beyerer, J.: A top-down-view on intelligent surveillance systems. Proc.
    of the 7th International Conference on Systems (c), 43–48 (2012)
 5. GheŢa, I., Heizmann, M., Belkin, A., Beyerer, J.: World modeling for autonomous
    systems. In: Dillmann, R., Beyerer, J., Hanebeck, U.D., Schultz, T. (eds.) KI 2010:
    Advances in Artificial Intelligence. pp. 176–183. Springer Berlin Heidelberg (2010)
 6. Hartz, J., Neumann, B.: Learning a knowledge base of ontological concepts for
    high-level scene interpretation. In: ICMLA. pp. 436–443. IEEE (2007)
 7. Hotz, L., Neumann, B.: Scene interpretation as a configuration task (2005)
 8. Kyrkou, C., Christoforou, E., Timotheou, S., Theocharides, T., Panayiotou, C.,
    Polycarpou, M.: Optimizing the detection performance of smart camera networks
    through a probabilistic image-based model. IEEE Transactions on Circuits and
    Systems for Video Technology 8215(c) (2017)
 9. Maksai, A., Wang, X., Fleuret, F., Fua, P.: Non-markovian globally consistent
    multi-object tracking. In: IEEE ICCV. pp. 2544–2554 (2017)
10. Marino, K., Salakhutdinov, R., Gupta, A.: The more you know: Using knowledge
    graphs for image classification. arXiv preprint arXiv:1612.04844 (2016)
11. Münch, D., IJsselmuiden, J., Arens, M., Stiefelhagen, R.: High-level situation recog-
    nition using fuzzy metric temporal logic, case studies in surveillance and smart
    environments. In: ICCV Workshops. pp. 882–889. IEEE (2011)
12. Rinner, B., Dieber, B., Esterle, L., Lewis, P.R., Yao, X.: Resource-aware configu-
    ration in smart camera networks. IEEE CVPR (1), 58–65 (2012)
13. Sanmiguel, J.C., Micheloni, C., Shoop, K., Foresti, G.L., Cavallaro, A.: Self-
    reconfigurable smart camera networks. IEEE Computer 47(5), 67–73 (2014)
14. Soro, S., Heinzelman, W.: A survey of visual sensor networks. Advances in Multi-
    media (2009)
15. Wetzel, J., Zeitvogel, S., Laubenheimer, A., Heizmann, M.: Towards global peo-
    ple detection and tracking using multiple depth sensors. IEEE ISETC, Timisoara
    (2018)
16. Yang, B., Nevatia, R.: An online learned crf model for multi-target tracking. IEEE
    CVPR pp. 2034–2041 (2012)
17. Yang, H., Wang, B., Vesdapunt, N., Guo, M., Kang, S.B.: Personalized attention-
    aware exposure control using reinforcement learning 14(8), 1–17 (2018)
18. Zhao, Z.Q., Zheng, P., Xu, S.t., Wu, X.: Object detection with deep learning: A
    review. arXiv preprint arXiv:1807.05511 (2018)


                                            116

</pre>