Workshop on Multimodal Semantics for Robotic Systems (MuSRobS) IEEE/RSJ International Conference on Intelligent Robots and Systems 2015 Perceptive Parallel Processes Coordinating Geometry and Texture Marco A. Gutierrez1 , Rafael E. Banchs2 and Luis F. D'Haro2 Abstract— Finding and classifying specific objects is a key part in most of the tasks autonomous systems could face. Properly being able to reach objects and find their exact location is very important for successfully achieving higher level robotic behaviors. To perform full object detection and recognition tasks in a wide environment several perception approaches need to be brought together to achieve a good performance. In this paper we present a dual parallel system for object finding in wide environments. Our system implements two main parts. One texture based approach for wide scenes, composed by a Multimodal Deep Learning Neural Network and a syntactic distribution based parser. And another specific geometry based process, using three dimensional data and geometry constrains to look for specific objects and their position within a whole scene. Both systems run in parallel and compliment each other to fulfill an object search and locate task. The major contribution of this paper consists on the success of combining texture and geometry based solutions running in Fig. 1. Our system combines the best of 2D and 3D data information parallel and sharing information in real time to allow a full through to coordinated parallel process. generic solution to be able to find almost any present object in a wide environment. To validate our system we test it with real environment data injected into a simulated environment. We test 25 tasks in a household environment obtaining a 92% overall success rate finally delivering the correct position of the geometric information on these solutions could generally object. improve their results as well as enrich the information they deliver as an output. I. INTRODUCTION On the other hand, 3D model based approaches make Significant amount of work has been done in scene under- easy to reason about a variety of properties from volumes, standing from 2D images since the beginnings of computer 3D distances and local convexities. Solutions focusing on vision research, achieving significant results. Hand-designed object shapes and geometric characteristics have had also features such as SIFT [1], ORB [2] or HOG [3] underpin intense computer vision research focus, specially due to the many of these successful object recognition approaches. recent new range of inexpensive and fast RGB-D sensors They basically capture low-level textured information with available in the market. 3D features such as FPFH [6] or the difficulty on effectively capturing mid-level cues (like NARF [7] are some examples of robust features that describe edge intersections) or high-level representation (like dif- the local geometry around points for 3D point cloud datasets. ferent object parts). Recent developments in deep learning However 3D solutions have some drawbacks when dealing based solutions have shown how hierarchies of features with heavily clustered scenes or very general views of the can be learned in an unsupervised manner directly from environment. data. Learned features based solutions proved significant Although good solutions exist on both, image and point improvements on object recognition and detection, achieving cloud based approaches, when it comes to solving tasks in some of them up to around 90% success rates on differ- real environments, a more generic approach to achieve a ent benchmark training/testing sets (i.e. The Pascal VOC solution for the problem is needed. Systems with a use of Challenge [4]). Recently even full semantical well structured both 2D images based solutions and 3D geometry aware image descriptions are generated by the latest multimodal processes can provide a more generic purpose robotics neural language models [5]. Still when using 2D based scene architecture with more reliable and rich information. Our understanding a lot of valuable information about the shape approach combines the rich information obtained from new and geometric layout of objects is not considered. Adding multimodal neural networ object classification techniques on general 2D image scenes with a 3D geometric, distance and *This work was supported by by the A*STAR Research Attachment Programme. shape aware process (figure 1). This allows us to minimize 1 Marco A. Gutiérrez is with the robotics and artificial vision laboratorio the drawbacks of each of each approach with the strengths (RoboLab), University of Extremadura, Spain marcog@unex.es of the other. 2 Rafel E. Banchs and Luis F. D'Haro are with the HLT dept., I2R, A*STAR, Singapore. rembanchs@i2r.a-star.edu.sg, For the evaluation of our model we used a hybrid luisdhe@i2r.a-star.edu.sg simulation-real scenario. A simulator tool was used to man- 30 Workshop on Multimodal Semantics for Robotic Systems (MuSRobS) IEEE/RSJ International Conference on Intelligent Robots and Systems 2015 age the robot movements around the environment while sensor data was injected into the system from real scenario captures. This allowed us to test our approach with real environment data, since all perception information used as an input for our application comes from real sensors. As a result we obtain quite promising results on the object finding tasks tested. The remaining of the paper is organized as follows: in section II we provide an overview of some related works. Following section III gives a detailed general description of the perception system. Section IV and V explain more specific details regarding each of the two main processes, the texture aware process and the geometry aware one respec- tively. Finally we evaluate the system with an experiment in section VI and give some conclusions and future lines of work in section VII. Fig. 2. Overview of the architecture of the perception system. II. RELATED WORKS There is a wide range of research in the area of scene random forest (RF). The main difference with these works understanding and object recognition from 2D and point is that they restrict the search to a certain scene while our cloud data. With RGB-D increased popularity, bringing an solution provides a framework to solve a find and locate an easy to access way to RGB and depth data at the same time, object in an entire household environment. several researches have tried combining the two sources of The work on [11] combine high-resolution 3D laser sans information. with 2D images to improve object detection. Their solution Sensor fusion approaches are the most common ones, they relies in using a sliding window approach over a combination take both sources of data and combine them into one system of visual and depth channels and use those patches to train to improve performance. I.e. in [8] they associate groups of a classifier. It solves the same problem as the one presented pixels with 3D points into multimodal regions that they call here although they do not perform any optimization in terms regionlets, then they measure the structure of each regionlet of the path to reach the object most probably leading to a using bottom-up cues from image and range features. This slower solution for an object search and location like the one way they are able to determine the scene structure separating explained here. it into the meaningful parts discarding the background clutter. Also in [12] they use binary logistic classifiers on 2D and Although they do not relay on any rigid assumptions about 3D features. The 2D features are small patches selected from the scene like we do (we consider objects are placed on images on a training set. They, then, compute 3D features tables), the output provides a basic structure discovery over from distance from robot estimation, surface variation and a scene with detection of the main objects while our solution orientation and object dimensions. These features are then solves a specific object search and locate task on a wider learned by the classifier over two-split decision for each environment. object class. The difference with our solution is that they The machine learning based approaches take features from learn multimodal models per object while here the rgb and both depth and color data sources and combine them into point cloud data are used by two different process and the one multimodal space to perform later searches for a given outcome combined in a final solution. input. Koppula et al. [9] perform a labeling task on over- segmented 3D RGBDSLAM sensed scenario. They build III. THE PERCEPTION SYSTEM a graphical model capturing 2D images information (local As shown in figure 2, the system's architecture has a color, texture, gradients of interests, etc.) as well as local control manager for decisions and mediation among two shape and geometry, and geometrical context (where object perception parallel process. This manager takes care of most commonly lay to each other). This model then uses ap- the information shared between both processes and delivers proximate inference and is trained using a maximum-margin notifications according to them. learning approach. They show the benefits of using image The texture aware perceptive process (showed in figure 2 and shape against separated solutions. Also, Lai et al [10] in green) exploits 2D images information data. It runs a present an RGB-D Object Dataset and evaluate some object multimodal neural language model as described in [13] recognition and detection techniques. They combine 2D along with a syntactic frequency distribution based parser to SIFT descriptors with efficient match kernel (EMK) features process and evaluate the neural network output. The second computed over spin images on randomly subsamples set of one, the geometry aware perceptive process is exploiting the 3D points. These features are then used for the evaluation of geometric features of the environment. This one takes care three classifiers: a linear support vector machine (LinSVM), of two main tasks, looking for tables through the point cloud a gaussian kernel support vector machine (kSVM) and a data and segmenting tabletop setups, recognizing the object 31 Workshop on Multimodal Semantics for Robotic Systems (MuSRobS) IEEE/RSJ International Conference on Intelligent Robots and Systems 2015 the table. Once a table is reached the texture aware perceptive process keeps validating the appearance of this object in the scene, then a tabletop segmentation process will be started by the geometry aware perceptive process in order to segment, recognize and locate the object. If no object seems to be present when the tabletop segmen- tation is performed, the robot continues with the next table or with the next place in list if no more tables are available in the current place. We will only conclude that we cannot find an object once all places have been visited and no object has been found. IV. TEXTURE AWARE PERCEPTIVE PROCESS This texture based perceptive process is intended to get quick scene labeling from wide overviews of the environ- ment. It contains a previously trained multimodal neural model that outputs image descriptions. Then, taking into account the top nearest descriptions in the model, a parser extracts the object candidates and builds a frequency distri- bution histogram on the appearances of these objects class names. This frequency distribution histogram helps obtain a more robust output against false positives as the objects that are present in the scene tend to keep appearing with higher frequency over time in the sentences while the false positives have usually a much lower frequency. A. Mulitmodal neural model Fig. 3. Flow of the different states and process on the system. As previously mentioned the multimodal neural model follows the structure in [13]. This is a neural model pipeline that learns multimodal representations of images and text. and estimating its position through a shape and position The pipeline uses a long short-term memory [14] (LSTM) aware histogram based feature matching system. recurrent neural network for encoding sentences. We use a convolutional network architecture provided by the Toronto Figure 3 describes the states and process of the system Convnet [15] in order to extract 4096 dimensional image for a certain object search and locate task. Green tasks are features for the neural model. These image features are performed by the texture aware perceptive process while the then projected into the embedding space of the LSTM ones in red belong to the geometry aware perceptive process. hidden states. A pairwise ranking loss is minimized in A restriction the system assumes is that objects are placed order to learn to rank images and their descriptions. For on tables. A simple task is given to the robot in the form of decoding, the structure-content neural language model (SC- “Look for the OBJECT”, and the object name is extracted NLM) disentangles the structure of a sentence to its content, and passed to the texture aware perceptive process for the conditioned on distributed representations produced by the “look for places to visit” step. A list of generic images for encoder. Finally, the output is generated by sampling from each available place is stored in our database and evaluated the SC-NLM the image top descriptions. by this perceptive process. A frequency of appearance of possible objects histogram is built for each place. Places to visit are then ordered according to highest appearance B. Syntactic frequency distribution parser of label of the object on this output. Places with no object After the system obtains the top scenes generated descrip- appearances are left to visit last and ordered randomly. tions, it extracts potential object classes from them, using a Once the list of places to visit is ready, the robot visits syntactic parser. Using the Neural Language Toolkit [16] we them in order. When the first place is reached both processes syntactically analyze the sentences to extract object candi- start to work in parallel for the required object. The texture dates that could be present in the image. A frequency dis- aware perceptive process provides a frequency distribution tribution histogram is computed over this object candidates. of objects on images taken from the current place while the This histogram is then used to evaluate the believe that an geometry aware one will start looking for tables on the scene object is present in a scene, allowing us to compare different point cloud data. If the object is found in a scene image and scenes according to the probability of finding an object there a table has been detected the robot will start moving towards and therefore discriminate possible false positives. 32 Workshop on Multimodal Semantics for Robotic Systems (MuSRobS) IEEE/RSJ International Conference on Intelligent Robots and Systems 2015 Fig. 5. Example of one of the tabletop setup used in the experiment. plane point cloud. Then we obtain the convex hull of these plane point cloud and perform a bounding box on top of it up to a certain high. Points within the bounding box are then considered to correspond to objects sitting on top the table. Then it is performed an euclidean clustering extraction to segment the object candidates point clouds. Fig. 4. Tabletop segmentation and object recognition pipeline using point As the next step (figure 4.c) we compute these point clouds cloud data. Viewpoint Feature Histograms [18] (VFH) and look for the nearest match in our database. For this database we have a previously computed VFH of single views of objects. These V. GEOMETRY AWARE PERCEPTIVE PROCESS VFHs are stored and retrieved through fast approximate K-Nearest Neighbors (KNN) searches using kd-trees [19]. This process exploits the geometry present in the envi- The construction of the tree and the search of the nearest ronment to extract a wide variety of information. For our neighbors places an equal weight on each histogram bin in approach we have restricted the task of finding objects, the VFH and spin images features. to objects placed on top of tables. Therefore this process Finally the system would check if any of the labels from performs two main tasks, one is looking for tables in broad the objects correspond to the one we are looking for, see scenes and another one consists on a tabletop segmentation figure 4.d, and call it a success or not. with shape based object recognition and pose estimation. VI. EXPERIMENT A. Looking for tables We perform several experiments sending the robot to re- We describe tables as planes that are parallel to the trieve different objects in a wide household environment. For floor and found at a height between 40 and 110 centime- the experiment an hybrid simulator-real data environment has ters. Therefore, we use the RANdom SAmple Consensus been used. We used the simulator for the robot movements (RANSAC) [17] for plane model fitting in the scene point between places, while sensor data has been acquired with cloud data with a previous downsample of 1cm. Using real RGB and RGB-D cameras (i.e. the tabletop showed in this algorithms we recursively look for planes matching the figure 5) and matched to the specific locations on the virtual previously mentioned constrains and label them as tables. plane. When the robot needs to move around the simulator takes care of it, once certain positions in the map are reached, B. Object recognition and pose estimation the previously obtained real data is injected and used as input The tabletop segmentation is used when a table is ap- for the algorithms. The robot always starts at the entrance proached and in order to recognize the objects on top of it of the apartment and from there performs the most optimal as well as to estimate their final position. way to find the object and delivers its estimated position as In the first part, shown in figure 4.b, the RANSAC a final result. algorithm provides us with the plane equation and the points that match that equation. Since the RANSAC uses a threshold A. System setup to deal with sensor noise, points matching the model are The LSTM encoder and SC-NLM decoder from the mul- not in a perfect plane but within a certain range, so we first timodal neural model have been trained using a combina- project this points to fit the plane equation to obtain a perfect tion of the Flikr30k [20] dataset and the Microsoft COCO 33 Workshop on Multimodal Semantics for Robotic Systems (MuSRobS) IEEE/RSJ International Conference on Intelligent Robots and Systems 2015 TABLE I S UCCESS RATES ON THE DIFFERENT PARTS OF THE ALGORITHM OBJECT 1. Places to visit Ordering 2. False Negative 3. False Positive 4. Success Rate Cereal box 5 0 0 100% Cup 5 0 1 80% Bottle 5 0 1 80% Laptop 5 0 0 100% Monitor 5 0 0 100% Overall rate 100% 0% 8% 92% those rooms. For the point cloud analysis a kd-tree stores 3729 VFH from different views of 75 different objects. All the system is developed using the RoboComp robotics framework [22] and the simulation is performed in a virtual scenario using the RoboComp simulator tool. See figure 6 for an overview of the simulation environment. B. Results on the experiments We run 5 different tasks 5 times and collect the results in the table I. First we measure if the ordering of places to visit after the “Look for places to visit” step in our system was optimal (check figure 3 for details). This turned out to work perfect for all of our test cases, basically because some of Fig. 6. An overview of the simulation household environment. The rooms the description pictures of the places contained those items are labeled as follows: 1.- Entrance, 2.- Living room, 3.- Patio, 4.- Bathroom and the texture aware perceptive process was able to detect 5.- Hallway, 6.- Kitchen, 7.- Bedroom. Circled in yellow is the robot at its starting point. them. It is important for this step to select a good range of images representing the different places to visit (see figure 6), specially those images that clearly show an average of the objects you can usually find in those places. dataset [21]. The 4096 dimensional image features for the Then we count the false negatives occurrences, this is multimodal neural model training are extracted using the when we are done with the searching and no object was Toronto Convnet with their provided models. The frequency found. Along our testing this never happened and an object histogram is built using the NLTK toolbox on the top 5 was always found. However we obtained two false positives generated sentences over at least 5 frames, to achieve a when the system mistaken a cup for a bottle and when a robustness on the objects observed. This NLTK tagging and bottle was mistaken for a bottle of glue. Those mistakes are syntactic analysis is performed using the Treebank Part of basically due to the similarity on these objects shape. We Speech Tagger (Maximum entropy) they have available. For could avoid this in the future reinforcing this step with other the rooms representation, images in the house 5 generic object features. Specially since the objects to be found where different images of parts of a house are used for each actually present in the table being segmented at the time. of the places in the house: entrance, room, kitchen, living The final success rate on obtaining the proper location of room, bathroom, patio and bedroom. This images have been the object and pose estimation is quite high which results selected so they contain the usual set of items presents in promising for further real applications of the system. 34 Workshop on Multimodal Semantics for Robotic Systems (MuSRobS) IEEE/RSJ International Conference on Intelligent Robots and Systems 2015 VII. CONLUSIONS AND FUTURE WORK [7] Bastian Steder, Radu Bogdan, Rusu Kurt, and Konolige Wolfram Burgard. Narf: 3d range image features for object recognition. In We presented a hybrid perception system that combines Workshop on Defining and Solving Realistic Perception Problems in 2D data based solutions and approaches using point clouds Personal Robotics, Int. Conf. on Intelligent Robots and Systems, IROS running in parallel and sharing information in real time in ’11. IEEE Computer Society, 2010. [8] Alvaro Collet, Siddhartha S. Srinivasa, and Martial Hebert. Structure order to achieve a finding object task. The system is able to discovery in multi-modal data: A region-based approach. In ICRA, successfully predict a route through the places with higher pages 5695–5702. IEEE, 2011. probability of finding this objects. We obtained a high rate [9] Hema S Koppula, Abhishek Anand, Thorsten Joachims, and Ashutosh Saxena. Semantic labeling of 3d point clouds for indoor scenes. In of success in our experiments as we only obtained two false Advances in Neural Information Processing Systems, pages 244–252, positives among all our test cases. 2011. An interesting future work would be to perform further [10] Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox. A large- scale hierarchical multi-view rgb-d object dataset. In Robotics and testings with a wider range of objects. This could help Automation (ICRA), 2011 IEEE International Conference on, pages find some weak points on the system that we might have 1817–1824. IEEE, 2011. not found yet and that should be worth to strength with [11] M. Quigley, Siddharth Batra, S. Gould, E. Klingbeil, Quoc Le, Ashley Wellman, and A.Y. Ng. High-accuracy 3d sensing for mobile manip- more processes interaction. In the same line and although ulation: Improving object detection and door opening. In Robotics the sensor data used in the testing where taken from real and Automation, 2009. ICRA ’09. IEEE International Conference on, sensors, integrating the solution with a real robot could bring pages 2816–2822, May 2009. [12] Stephen Gould, Paul Baumstarck, Morgan Quigley, Andrew Y. Ng, a more accurate overview of how the system performs in real and Daphne Koller. Integrating Visual and Range Data for Robotic environments. Object Detection. In ECCV workshop on Multi-camera and Multi- False positives obtained during experiments are mainly be- modal Sensor Fusion Algorithms and Applications (M2SFA2), 2008. [13] Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. Unifying cause of a bad performance of the geometry aware perceptive visual-semantic embeddings with multimodal neural language models. process. Since similarity on the shape of different objects CoRR, abs/1411.2539, 2014. confuses the VFH search, exploiting texture based features [14] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, November 1997. on this last step could most probably benefit the whole [15] Toronto University. Convolutional Neural Nets. https:// system final output. Also, since we are using an euclidean torontodeeplearning.github.io/convnet/, 2015. [On- clustering extraction method for objects on top of the table, line; accessed 04-March-2015]. [16] Steven Bird. Nltk: The natural language toolkit. In Proceedings our system cannot deal with heavy cluttered scenes or objects of the COLING/ACL on Interactive Presentation Sessions, COLING- touching each other. Adding alternatives to the segmentation ACL ’06, pages 69–72, Stroudsburg, PA, USA, 2006. Association for process could help improve this in order to cover a more Computational Linguistics. [17] Martin A. Fischler and Robert C. Bolles. Random sample consensus: varied range of scenarios. It would be also desirable to avoid A paradigm for model fitting with applications to image analysis and the assumption that objects are always on tables, so we automated cartography. Commun. ACM, 24(6):381–395, June 1981. should look into new ways of scene segmentation to improve [18] R.B. Rusu, G. Bradski, R. Thibaux, and J. Hsu. Fast 3d recognition and pose using the viewpoint feature histogram. In Intelligent Robots and this step. Systems (IROS), 2010 IEEE/RSJ International Conference on, pages Finally adding a learning process in the system would 2155–2162, Oct 2010. be an interesting enhancement, both parallel process could [19] Marius Muja and David G. Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. In In VISAPP International complement each other, correcting each other mistakes and Conference on Computer Vision Theory and Applications, pages 331– providing the fixed mistake as a new source of learning, 340, 2009. leading to improvements in the following overall system [20] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics performances. for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014. R EFERENCES [21] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Per- ona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft [1] D.G. Lowe. Object recognition from local scale-invariant features. COCO: common objects in context. CoRR, abs/1405.0312, 2014. In Computer Vision, 1999. The Proceedings of the Seventh IEEE [22] P. Bustos Marco A. Gutiérrez, A. Romero-Garcés and J. Mart ńez. International Conference on, volume 2, pages 1150–1157 vol.2, 1999. Progress in robocomp. Journal of Physical Agents, 7(1), 2013. [2] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. In Proceedings of the 2011 International Conference on Computer Vision, ICCV ’11, pages 2564– 2571, Washington, DC, USA, 2011. IEEE Computer Society. [3] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Cordelia Schmid, Stefano Soatto, and Carlo Tomasi, editors, International Conference on Computer Vision & Pattern Recognition, volume 2, pages 886–893, INRIA Rhône-Alpes, ZIRST-655, av. de l’Europe, Montbonnot-38334, June 2005. [4] Mark Everingham, S.M.Ali Eslami, Luc Van Gool, ChristopherK.I. Williams, John Winn, and Andrew Zisserman. The Pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, 2015. [5] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Er- han. Show and tell: A neural image caption generator. CoRR, abs/1411.4555, 2014. [6] R.B. Rusu, N. Blodow, and M. Beetz. Fast point feature histograms (fpfh) for 3d registration. In Robotics and Automation, 2009. ICRA ’09. IEEE International Conference on, pages 3212–3217, May 2009. 35