Scene-Adaptive Optimization Scheme for Depth Sensor Networks

Scene-Adaptive Optimization Scheme for Depth Sensor Networks JohannesWetzel johannes.wetzel@hs-karlsruhe.de Intelligent Systems Research Group (ISRG) Karlsruhe University of Applied Sciences

Karlsruhe Germany

SamuelZeitvogel samuel.zeitvogel@hs-karlsruhe.de Intelligent Systems Research Group (ISRG) Karlsruhe University of Applied Sciences

Karlsruhe Germany

AstridLaubenheimer astrid.laubenheimer@hs-karlsruhe.de Intelligent Systems Research Group (ISRG) Karlsruhe University of Applied Sciences

Karlsruhe Germany

MichaelHeizmann michael.heizmann@kit.edu Institute of Industrial Information Technology (IIIT) Karlsruhe Institute of Technology (KIT)

Karlsruhe Germany

Scene-Adaptive Optimization Scheme for Depth Sensor Networks C2F90469ADF3146E9D77A8A6526418A0 GROBID - A machine learning software for extracting information from scholarly documents depth sensor networks context aware knowledge based optimization scene-adaptive optimization

In this work a scheme for scene-adaptive depth sensor network optimization is presented. We propose to fuse the knowledge inferred by the sensor network into a common world model while at the same time exploiting this knowledge to improve the perception and post processing algorithms themselves. Moreover, we show how our optimization scheme can be applied to improve the use cases of disparity estimation as well as people detection with multiple depth sensors.

Introduction

Low cost commodity depth sensors are an emerging technology and are applied to a broad field of applications such as people detection and tracking, 3D reconstruction or emergency detection in an ambient assisted living context. However, depth sensor networks as well as modern vision algorithms have many parameters and require fine-tuned, scene-specific configurations to achieve optimal performance. Due to strongly varying scenes and changing conditions at run time it is very challenging to fine-tune those parameters manually in real world applications. To overcome the problem of scene-specific manual (re)configuration of depth sensor networks, we propose a scene-adaptive scheme which exploits the scene knowledge to improve perception and post processing vision algorithms. Our objective is not only to tune the given parameters but also to improve the vision algorithms, such as stereo block matching, detection or tracking by explicit exploitation of the scene knowledge, e.g. by building scene-specific object models. Therefore, we fuse the knowledge inferred from the sensor network into a common world model, representing our current context knowledge. This knowledge is then fed back to optimize sensor parameters and algorithms to improve the performance of a sensor network at run time.

Related work

The configuration of video sensor networks in the context of video surveillance has been widely studied in the literature. In [13] a general overview of the different aspects of sensor network reconfiguration is given. Rinner et al. [12] focus on the aspect of configuration of smart camera networks in the context of video surveillance. They review the configuration for a specific analysis task and evaluate different configuration methods. In [8] a flexible uncertainty model is presented to reconfigure the sensor network with the objective to optimize the detection performance. Fischer et al. [4] give an overview of intelligent surveillance systems, analyzing the information flow between sensors, world model and inference algorithms. In [14] an overview to visual sensor networks is given. However, prior work focuses on monocular camera networks and employs parameter reconfiguration. In contrast, our work deals with depth sensor networks and proposes a scheme for explicit exploitation of the given scene knowledge. This includes conventional parameter reconfiguration methods as well as methods that construct and use sophisticated world models to improve the integrated algorithms of sensor networks at run time.

Scene-adaptive sensor network optimization

In this section we present a scheme for scene-adaptive sensor network optimization. The general information flow in a depth sensor network is depicted in Fig. 1 While a world model can be used to do e.g. scene and situation analysis, we use it to optimize the parameters of each individual sensor online and support the data analysis methods e.g. by building scene-specific object models gradually.

Knowledge representation

The employed knowledge representation within the world model has to be expressive to solve the high-level task of the sensor network and the optimization of the sensor network itself. The fusion layer might provide sensor data as well as locally derived high-level knowledge and the world model therefore might need to cover low-level data up to high level information. Taking these aspects into account, several existing approaches for knowledge representations are qualified to serve as world model. For most tasks and networks, a world model consisting of geometric and semantic scene descriptions will be suitable. Geometric scene knowledge thereby encompasses information about the objects contained in the scene and their properties. This includes the object class (e.g. humans, furnature, floor plan), the object location and orientation in a global world coordinate system, dynamic properties e.g. a motion model, shape, material. Examples for such a world model are object oriented world models [2,5]. In order to enhance the quality of the world model, a knowledge base consisting of preprocessed information or prior knowledge can be used. This includes morphable shape models [3] for different object classes as well as common recognition, detection and segmentation models [18] which are applied on image and 3D data, e.g. RGB-D data, point clouds, voxels or triangulated surfaces [1]. In terms of semantic knowledge Fuzzy Metric Temporal Logic and Situation Graph Trees [11] or ontologies [10] can be incorporated. The semantic description might be data driven, e.g. Hartz and Neumann [6] use a scene interpretation system [7] and learn ontological concept descriptions from data.

Optimization possibilities

Depth sensor networks involve multiple algorithms which leads to a large amount of parameters. In this section we give an overview of parameters and methods which are suitable for automatic scene-adaptive sensor optimization. We assume that a suitable knowledge base (see section 3.1) exists and focus on algorithm and parameter optimization. Following our layered scheme, we categorize the optimization targets into three major categories, see Fig. 2. Sensing parameters have a direct impact on the measurement quality. Parts of this category have already been addressed. Auto exposure is state-of-the-art for decades in consumer cameras, but sophisticated scene models [17] can improve the result

• exposure • sensitivity • resolution • frame rate • sensor pos. • emitter strength • synchronization • calibration • … • gate size • assignment distance • dynamics model • flow field • … • discriminativ classifier • generativ object model • region of interest • … • multifrequency face unwrapping • … • minimum disparity • maximum disparity • …

e.g. by taking only the pixel intensities near regions of interest into account.

Sensor data post processing methods vary highly between different depth sensing technologies. The depth estimation of a stereo sensor can be improved by setting the minimum and maximum observable disparity based on geometric scene knowledge. In section 4.1 a approach for the task of scene-adaptive disparity estimation is presented with an exemplary knowledge representation. Many scene-adapative local data analysis methods have already been published. Yang et al. [16] learn global appearance and motion models to improve multiple target tracking. Masksai et al. [9] propose a context-aware optimization strategy for multi object tracking. They learn the most likely trajectory patterns with respect to a given scene layout to reduce incorrect assignments between detections and tracks. In 4.2 we show how the task of people detection can be optimized in a scene-specific fashion.

Application

In this section we show the applicability of our scheme on two exemplary use cases.

3D model based disparity estimation

Our knowledge representation contains sensor knowledge in the form of a camera model and existing camera calibration parameters π, scene geometry using a ground plane assumption P (h) ⊂ R 3 and a 3D morphable human surface model parameterized by β. Scene semantics are represented as segmentations of a single human s h and the ground plane s g in the image. Let D π (u) be a depth image computed using the estimated disparity values u from the image pair (I 1 , I 2 ). Classical stereo algorithms estimate the disparity values u minimizing a cost function

E(u) = E photometric (u; I 1 , I 2 ) + E reg (u) ,(1)

where E photometric is the photometric error penalizing intensity deviation in the local neighborhood given u and E reg regularizes the problem penalizing unlikely disparity values based on simple scene assumptions. We propose to employ a scene-adaptive optimization scheme reformulating (1) with

E adaptive (u) = E photometric (u; I 1 , I 2 ) + E model (u[s h ], u[s g ]; β, h) ,(2)

where E model uses our provided scene representation to measure the deviation from the estimated depth at the segmented pixel locations u[s h ] and u[s g ] to the explicit geometric scene representation consisting of the ground plane at height h and the human shape model parameterized by β. Scene-adaptive disparity estimation is then performed by estimating û = arg min u E adaptive (u).

Eq.( 2) can be extended in various ways, which proves the generality of the proposed approach by e.g. introducing a human motion model to enforce temporal consistency constraints.

People detection with multiple depth sensors

The sensors have a top view on the scene and a significant overlap to each other. Additionally, we assume that the sensors are intrinsically and extrinsically calibrated in advance and that the common ground plane is known. We model the presence of a person on the ground floor as a discrete grid of Bernoulli random variables X = (x 1 , .., x n ), x i ∈ {0, 1} where each x i maps to one specific ground plane grid location g i ∈ R

For this application we assume that the likelihood p(O|X) is given (see [15] for details on the construction of the likelihood) and only focus on the sceneadaptive choice of the prior p(X). We start with an uninformative prior to make the detection of people at every location equally likely. In many real world scenes this is a crude assumption due to obstacles or preferred walking tracks which can be present in the scene. Thus, we propose to accumulate the detections over time to get the relative frequencies H = (h 1 , . . . , h n ) of the presence of people for every ground plane grid location g i and fuse those information into the world model. This scene-specific knowledge can be used in the feedback step to continuously update the prior beliefs p(x i ) accordingly to H on regular time intervals.

Conclusion

In the present work we have proposed a scheme for scene-adaptive optimization of depth sensor networks. We have given an analysis of relevant knowledge representations and categorized identified optimization targets. Moreover, we have

Internet of Things, Networks and Security exemplarily applied our scheme on the use cases of disparity estimation as well as people detection with multiple depth sensors. Future work will include the investigation of more use cases as well as proof of concept implementations.

Fig. 1 .1Fig. 1. Information flow in a depth sensor network with scene-adapative optimization strategy.

Fig. 2 .2Fig.2. Non-exhaustive taxonomy of building blocks within a depth sensor network which are suitable for scene-specific optimization.

2 .2Our goal is to infer the likelihood of a scene configuration X given current depth observations O = (O 1 , . . . , O C ) from C depth sensors. Applying Bayes' theorem and assuming that the prior factorizes as p(X) = n i=1 p(x i ) we get the posterior distribution p(X|O) ∝ p(O|X) n i=1 p(x i ).

and separated into five different abstraction layers. The sensing layersensingsensor data postprocessinglocal 2D and 3D analysissensingsensor data postprocessinglocal 2D and 3D analysisdata and know-ledge fusionglobal data analysisworld modelsensingsensor data postprocessinglocal 2D and 3D analysisscene and situa-tion analysis

EAhmed ASaint AE RShabayek KCherenkova RDas GGusev DAouada BOttersten arXiv:1808.01462 Deep learning advances on different 3d data representations: A survey 2018 arXiv preprint Object-oriented world model for surveillance systems ABauer TEmter HVagts JBeyerer Future Security: 4th Security Research Conference Fraunhofer Verlag 2009 A morphable model for the synthesis of 3d faces VBlanz TVetter Proceedings of the 26th annual conference on computer graphics and interactive techniques the 26th annual conference on computer graphics and interactive techniques ACM Press/Addison-Wesley Publishing Co 1999 A top-down-view on intelligent surveillance systems YFischer JBeyerer Proc. of the 7th International Conference on Systems (c) of the 7th International Conference on Systems (c) 2012 World modeling for autonomous systems IGhet ¸a MHeizmann ABelkin JBeyerer Advances in Artificial Intelligence RDillmann JBeyerer UDHanebeck TSchultz

Berlin Heidelberg

Springer 2010. 2010 Learning a knowledge base of ontological concepts for high-level scene interpretation JHartz BNeumann ICMLA IEEE 2007 Scene interpretation as a configuration task LHotz BNeumann 2005 Optimizing the detection performance of smart camera networks through a probabilistic image-based model CKyrkou EChristoforou STimotheou TTheocharides CPanayiotou MPolycarpou IEEE Transactions on Circuits and Systems for Video Technology 8215 c 2017 Non-markovian globally consistent multi-object tracking AMaksai XWang FFleuret PFua IEEE ICCV 2017 The more you know: Using knowledge graphs for image classification KMarino RSalakhutdinov AGupta arXiv:1612.04844 2016 arXiv preprint High-level situation recognition using fuzzy metric temporal logic, case studies in surveillance and smart environments DMünch JIjsselmuiden MArens RStiefelhagen ICCV Workshops IEEE 2011 Resource-aware configuration in smart camera networks BRinner BDieber LEsterle PRLewis XYao IEEE CVPR 1 2012 Selfreconfigurable smart camera networks JCSanmiguel CMicheloni KShoop GLForesti ACavallaro IEEE Computer 47 5 2014 A survey of visual sensor networks SSoro WHeinzelman Advances in Multimedia 2009 Towards global people detection and tracking using multiple depth sensors JWetzel SZeitvogel ALaubenheimer MHeizmann IEEE ISETC 2018 An online learned crf model for multi-target tracking BYang RNevatia IEEE CVPR 2012 Personalized attentionaware exposure control using reinforcement learning HYang BWang NVesdapunt MGuo SBKang 2018 14 Object detection with deep learning: A review ZQZhao PZheng STXu XWu arXiv:1807.05511 2018 arXiv preprint