S-AV E: Semantic Active Vision Exploration and Mapping of Indoor Environments for Mobile Robots José V. Jaramillo1 , Roberto Capobianco2[0000−0002−2219−215X] , Francesco Riccio2[0000−0002−9112−8143] , and Daniele Nardi2[0000−0001−6606−200X] Sapienza University of Rome. Deptament of Computer, Control and Management Engineering “Antonio Ruberti” 1 josevicentejaramillo@gmail.com 2 {lastname}@diag.uniroma1.it Abstract. Semantic mapping is fundamental to enable cognition and high-level planning in robotics. It is a difficult task due to generalization to different sce- narios and sensory data types. Hence, most techniques do not obtain a rich and accurate semantic map of the environment and of the objects therein. To tackle this issue we present a novel approach that exploits active vision and drives envi- ronment exploration aiming at improving the quality of the semantic map. Keywords: Semantic mapping, Map exploration, Mobile robots. 1 Introduction Simultaneous exploration and map building are fundamental skills for mobile robots. However, building a comprehensive map of the environment – spanning from raw sensory observation to high-level semantic concepts [12] – is an extremely difficult task [13]. In literature, autonomous map building processes rely upon map exploration techniques that provide robots with an effective strategy to visit unknown portions of the environment. In this context, proposed approaches [14] maximize exploration, while minimizing the time spent in building the map. However, standard solutions to this prob- lem are limited to a geometric reconstruction of the environment and, most importantly, they formalize the exploration strategy by only considering geometric and topological landmarks of the environment [14]. Conversely, semantic mapping [6, 9] enhances geo- metric, metric and topological knowledge about the environment by means of semantic concepts, thus enabling improved robot cognition. In this context, classic exploration techniques are still generally used, resulting in inaccurate and incomplete semantic maps. To tackle this issue, and to improve robot capabilities in exhaustively exploring the environment at the semantic level, we introduce S-AV E (Semantic-Based Active Vi- sion Exploration), a new map exploration technique. We refer to map exploration as the exploration of an environment with the goal of building an internal representation of it including semantic knowledge. Standard map exploration techniques, such as frontier- based exploration [4, 7], do not focus on object reconstruction and the strategy that the Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 J.V. Jaramillo et al. robot executes aims at maximizing the amount of visited portions of the environment disregarding objects therein. Such approaches, moreover, do not exploit semantic labels to influence the map exploration strategy and often result in incomplete and poorly de- tailed maps, which limit robots’ autonomy and abilities. S-AV E, instead, explicitly uses detected objects to drive exploration by means of active vision [2, 3, 1], and it addition- ally combines semantic information to state-of-the-art map exploration techniques [8] to improve the 3D representation of the environment. We validate S-AV E in five differ- ent indoor environments in the Gazebo simulator against a frontier-based algorithm and human-driven robots. Summarizing, S-AV E is a novel technique that introduces a new paradigm in map exploration. S-AV E explicitly reasons about objects in the environment and their se- mantic label to improve its 3D reconstruction. The contributions to the-state-of-art are twofold: (1) it represents a novel technique for object-centered map exploration; and (2) it generates better semantic maps to support and improve robot reasoning and task execution. Moreover, it is important to highlight that this work contributes to the com- munity by providing a ready-to-use simulation environment for map exploration and a first baseline for future benchmarking. We provide five simulated environments as free- to-download packages which can be easily used to integrate new exploration strategies to perform benchmarking1 of semantic mapping approaches. We additionally release the source code for both S-AV E2 , as well as the data collected during our user study3 . 2 Semantic Active-Vision Exploration Most of the approaches to active vision focus on evaluating a set of candidate poses and ranking them by expected information gain [11]. To reduce the search space of the theoretically infinite number of poses, [10] introduced the use of tessellation for discretizing the search space. In this case, a candidate pose is generated in each vertex of the tessellated sphere, directed towards the center. This method has been widely adopted in literature [3, 5]. Since these studies have been designed to model similarly sized objects located in the center of a platform, the tessellated sphere radius and center can be constants, as the object is placed on a limited size platform. In order to use the same algorithms in an open environment, we generate the candidate poses with a tessellated sphere (or circle, depending on agent’s reachability) for each encountered object. Hence, the first phase of S-AV E consists in extracting, for each object in its current view, the objects sizes and centroids. We rely on a RGBD camera and an ideal object detector algorithm to acquire an object-segmented point cloud. Then, the portion of the point cloud corresponding to the closest object is used to compute the size of the bounding box and the centroid of the object. To compute the candidate pose, S-AV E generates a sphere, in the robot cartesian, space p that delimits the objects of interest. The radius R0 of such a sphere as R0 = X +Y 0 2 where X 0 = Xmax +o f f set +clearance and Y 0 = Ymax +o f f set +clearance. 0 2 Fig. 1(a) provides a visual feedback on Equation ??. Xmax and Ymax are the coordinates 1 https://github.com/JoseJaramillo/lucrezio simulation environments 2 https://github.com/JoseJaramillo/lucrezio semantic explorer 3 https://github.com/JoseJaramillo/S-AvE ExperimentalEval S-AV E: Semantic Active Vision Exploration 3 (a) Top view of an object with its Next- (b) Resulting object radius Best-View candidates poses Fig. 1. S-AV E adjusts the estimated radius of objects in accordance to a object ontology. of the max 2D points of the bounding box (green). o f f set is used to allow generat- ing unknown volumes outside of the bounding box limits, and clearance is a parame- ter used so that the generated next-best-view candidates are separated from the object (blue). However, due to a mismatch between the actual centroid and the size of the ob- ject suuch an estimation can easily reveal inaccurate. To overcame this limit, we exploit semantic information in the process of generating candidate poses, and specifically for adjusting the radius and the center of the sphere. The main idea consists first extracting the object missing parameters (i.e. bound- ing box size and centroid of the obejct) from the current view, and then searching the knowledge base of the object categories for the object’s typical size. Then S-AV E ad- justs the object parameters in accordance with a basic decision tree: if the observed object dimensions are smaller than expected, parameters are tuned accordingly; oth- erwise detected size and estimated centroid are preserved. The typical size of objects can be determined using several sources. In our case, we choose to simply compute the mean size among those provided on multiple online furniture stores. It is important to notice that, in case the height of the object is much less than typical height, the robot can compute the next-view in a position where the field of view can fully cover the expected height. To this end, we compute the minimum distance to cover the object with the camera’s field of view as Dmin = max(Dmintop , Dminbottom ) where Dmintop = (H − H)/tan(β + θ and Dminbottom = H 0 /tan(β + θ ). top and bottom refer to the minimum distance to cover the top and bottom of the object with the camera’s field of view. H is the typical height of the object, H 0 is the camera height. θ is the camera pitch, and β = α/2. α is the camera vertical viewing angle. Hence, we can ) recover the radius R as R = d + max(X,Y 2 , where d = min(Dmin, Dmax) and X, Y are the planar object sizes found on the object’s typical size database. For big objects or 4 J.V. Jaramillo et al. tilted cameras, the computed minimum distance can be extremely large or even infinite. Hence, a maximum distance Dmax is set according to the environment. Then, to approximate the extracted centroid of a point cloud to the true centroid of the object. We compute the adjusted center in the direction formed by the camera center and the object point cloud’s centroid, as shown in Fig.fig:distances. To compute the position of the center δ as δ = σ + l[cos(α), sin(α)]T , where α is the angle formed from the camera center σ to the point cloud’s centroid ε, computed as α = atan2(ε.Y − σ .Y, ε.X − σ .X). Here, l is the length from camera center σ and to the object center δ , which is computed as l = R − R0 + d 0 , where d 0 is the distance from the camera center to the point cloud centroid, and R0 is the radius previously computed. Finally, in order to get the goal pose, we evaluate the candidate poses by an active vision method. In this work, we use the unobserved voxel volumetric information in- troduced in [3], which returns the set of candidate poses ranked by information gain. Hence, by evaluating the poses in rank order, the next-best-view is the first reachable pose by the robot. 3 Experimental Evaluation S-AV E is evaluated in five different scenarios, and each of them is a replica of real world environments – Fig.fig:sim-real-env compares a real environment (top) and its simulated world (bottom). Moreover, in order to validate the performance improvement of our approach, we include in the experimental evaluation two baseline methods: a frontier-based exploration (FE) technique [4]; and results of a user-study (Human) in which we let humans drive the robot to explore the environment. In particular, FE aims at showing the improvement of our solution with respect standard in map exploration techniques – in which no semantic knowledge is exploited; while Human is used to define an interval on the spectrum of possible strategies. In fact, as observed during the user-study, humans rely on spatial semantic knowledge and perform differently to standard FE methods and disregard metric information. We include in the evaluation also a stripped down version of S-AV E which only considers the detected object size to leverage the exploration algorithm. We refer to such an algorithm as AV E. To assess S-AV E performance we rely on the object reconstruction index (ORI) which is computed against the groundtruth of the environments which contains the true number of objects, and objects shapes and dimensions as voxels. We compute the ORI index as in: ORI = N1 ∑Nn=0 pv(n)/Vn where n denotes the n-th object, N is the total numbe r of objects, Vn is the total number of voxels representing n and pv(·) is a func- tion that returns the number of perceived voxels of n. Each environment has a different structure, topology and a varying number of objects. Respectively for each environ- ment the robot is challenged with 7, 12, 17, 18 and again 18 objects. All objects are common indoor objects such as, chairs, desks, cabinets, tables, fridges and other alike furniture. To evaluate our solution, in each of these environments we deploy a simu- lated differential-drive wheeled robot equipped with a depth camera sensor and laser range-sensor. The robot is configured to execute each of our competing algorithms (one at time) and to store metrics parameters. The plots in Fig. 2 report the results in recon- structing all the objects for each of the environment. As it can be notice, S-AV E has a S-AV E: Semantic Active Vision Exploration 5 (a) env 0 (b) env 1 (c) env 2 (d) env 3 (e) env 4 (f) Sim. vs Real Fig. 2. Object reconstruction index (ORI). The plots represent the results obtained in reconstruct- ing the objects in the environments by each of the used approaches: user-study results (Human, red); frontier-based exploration (FE, orange); active-vision exploration (AV E, green); and seman- tic active-vision exploration (S-AV E, blue). Fig.fig:sim-real-env shows the comparison between a real world environment and its virtual replica. better performance in all of them. It is worth mentioning that, the last scenarion resem- bles a common apartment composed by three areas with a less predictable structure and characterized by different object categories. In is worth noticing that AV E is not able to reach the usual performance and compares with the FE approach. Conversely, humans perform better and compare with S-AV E both in reconstructing the objects. Intuitively, the two top-scoring approaches are the ones that explicitly model object semantics and estimate object dimensions to navigate the environment. This suggests that in richer 6 J.V. Jaramillo et al. scenarios, only relying on object appearance is not enough to exhaustively explore the environment. Moreover, by looking at the profile of the ORI and ONI indexes, S-AV E is able show a better performance with respect to all baselines. 4 Conclusion Our paper introduces a novel approach that integrates active-vision and autonomous map exploration and that enables a robot to generate more accurate and complete se- mantic maps. To alleviate the computational demand of AV methods, we aim at enhanc- ing S-AV E with generative networks that can be used to infer object dimensions after their classification. References 1. Bajcsy, R., Aloimonos, Y., Tsotsos, J.K.: Revisiting active perception. Autonomous Robots 42(2), 177–196 (02 2018). https://doi.org/10.1007/s10514-017-9615-3 2. Blake, A., Yuille, A. (eds.): Active Vision. MIT Press, Cambridge, MA, USA (1993) 3. Delmerico, J., Isler, S., Sabzevari, R., Scaramuzza, D.: A comparison of volumetric infor- mation gain metrics for active 3d object reconstruction. Autonomous Robots (04 2017). https://doi.org/10.1007/s10514-017-9634-0 4. Faria, M., Maza, I., Viguria, A.: Applying frontier cells based exploration and lazy theta* path planning over single grid-based world representation for autonomous inspection of large 3d structures with an uas. Journal of Intelligent & Robotic Systems 93(1-2), 113–133 (2019) 5. de Figueiredo, R.P., Bernardino, A., Santos-Victor, J., Araújo, H.: On the advantages of foveal mechanisms for active stereo systems in visual search tasks. Autonomous Robots 42(2), 459–476 (2018) 6. Gemignani, G., Capobianco, R., Bastianelli, E., Bloisi, D.D., Iocchi, L., Nardi, D.: Living with robots. Robot. Auton. Syst. 78(C), 1–16 (Apr 2016). https://doi.org/10.1016/j.robot.2015.11.001 7. Hidaka, K., Kameyama, N.: Hybrid sensor-based and frontier-based exploration algorithm for autonomous transport vehicle map generation. In: 2018 IEEE 14th International Confer- ence on Automation Science and Engineering (CASE). pp. 994–999. IEEE (2018) 8. Nardi, F.: High-Level Environment Representations for Mobile Robots. Ph.D. thesis, Sapienza University of Rome (2019) 9. Nüchter, A., Hertzberg, J.: Towards semantic maps for mobile robots. Robotics and Au- tonomous Systems 56(11), 915–926 (2008) 10. Panerai, F.M., Capurro, C., Sandini, G.: Space-variant vision for an active camera mount. In: Visual Information Processing IV. vol. 2488, pp. 284–297. International Society for Optics and Photonics (1995) 11. Pito, R.: A solution to the next best view problem for automated surface ac- quisition. IEEE Trans. Pattern Anal. Mach. Intell. 21(10), 1016–1030 (Oct 1999). https://doi.org/10.1109/34.799908 12. Pronobis, A., Riccio, F., Rao, R.P.N.: Deep Spatial Affordance Hierarchy: Spatial knowledge representation for planning in large-scale environments. In: RSS 2017 Workshop on Spatial- Semantic Representations in Robotics. Boston, MA, USA (Jul 2017) 13. Taketomi, T., Uchiyama, H., Ikeda, S.: Visual slam algorithms: a survey from 2010 to 2016. IPSJ Transactions on Computer Vision and Applications 9(1), 16 (2017) 14. Yamauchi, B.: A frontier-based approach for autonomous exploration. In: cira. vol. 97, p. 146 (1997)