=Paper=
{{Paper
|id=Vol-1782/paper_6
|storemode=property
|title=Next Best View Planning for Object Recognition in Mobile Robotics
|pdfUrl=https://ceur-ws.org/Vol-1782/paper_6.pdf
|volume=Vol-1782
|authors=Christopher McGreavy,Lars Kunze,Nick Hawes
|dblpUrl=https://dblp.org/rec/conf/plansig/McGreavyKH16
}}
==Next Best View Planning for Object Recognition in Mobile Robotics==
Next Best View Planning for Object Recognition in Mobile Robotics Christopher McGreavy, Lars Kunze and Nick Hawes Intelligent Robotics Lab School of Computer Science University of Birmingham United Kingdom {cam586|l.kunze|n.a.hawes}@cs.bham.ac.uk Abstract in particular rooms (Kunze et al. 2012) and in relation to other objects (Kunze, Doreswamy, and Hawes 2014). These Recognising objects in everyday human environments approaches guide robots to locations from which they can is a challenging task for autonomous mobile robots. However, actively planning the views from which an potentially observe an object. In this work, however, we object might be perceived can significantly improve the propose a complementary planning approach that selects the overall task performance. In this paper we have de- next best view after having identified a potential object can- signed, developed, and evaluated an approach for next didate. Such local, incremental view planning is crucial for best view planning. Our view planning approach is two reasons: (1) it allows robots to disambiguate between based on online aspect graphs and selects the next best objects which share a similar set of features, and (2) it im- view after having identified an initial object candidate. proves the overall performance of object recognition tasks The approach has two steps. First, we analyse the vis- as objects are observed and recognized from multiple views. ibility of the object candidate from a set of candidate Our local view planning approach addresses all of these views that are reachable by a robot. Secondly, we anal- above mentioned challenges. It is based on using a realistic yse the visibility of object features by projecting the model of the most likely object into the scene. Experi- sensor model of an RGB-D camera to generate online aspect mental results on a mobile robot platform show that our graphs (a set of poses around an object which describe object approach is (I) effective at finding a next view that leads visibility at that point) and takes the kinematic constraints of to recognition of an object in 82.5% of cases, (II) able a mobile robot platform into account. Hence, by changing to account for visual occlusions in 85% of the trials, and the sensor and/or the kinematic model our approach can eas- (III) able to disambiguate between objects that share a ily be transferred to robot platforms of different types. We similar set of features. Hence, overall, we believe that further consider two environmental constraints when gener- the proposed approach can provide a general methodol- ating robot and camera poses: (1) dynamic obstacles which ogy that is applicable to a range of tasks beyond object might hinder a robot to take particular views, and (2) oc- recognition such as inspection, reconstruction, and task clusions of object features which might be hidden by other outcome classification. objects (or by the object itself). Finally, we consider learned object models in the planning process to predict the location 1 Introduction and visibility of features of object candidates. An overview Autonomous mobile robots that operate in real-world envi- of the next best view planning approach is given in figure 1. ronments are often required to find and retrieve task-related Experimental results show that our next best view plan- objects to accomplish their tasks. However, perceiving and ning approach, which takes multiple views of an object, recognising objects in such environments poses several chal- allows us to improve the performance of recognition tasks lenges. Firstly, object locations are continuously changing. when compared to single-view object recognition. Further, That is, a mobile robot cannot simply rely on a fixed set of it enables robots to differentiate between objects that share views from which it can observe an object. The robot needs a similar set of features. to plan in a continuous space where to stand and where to To this end, this work contributes the following: look. That is, the robot has to select a view from the un- 1. a method for analysing potential views using online as- countably infinite set of possible views which allows it to pect graphs by taking both into account: (1) occlusions, observe the sought object. Matters are further complicated and (2) the visibility of features based on learned object by dynamic obstacles which might hinder the robot to take models. a particular view. Or, when taking a view, relevant features of an object might be occluded by other objects and/or by 2. a next best view planner that selects a view based on the the object itself (self-occlusion). Finally, other conditions method above and an executive that accounts for dynamic such as lighting and/or sensor noise can influence the per- obstacles during execution formance of object recognition tasks. 3. a set of experiments which demonstrates (a) how robots In previous work, we have enabled robots to find objects can disambiguate objects which share similar feature sets, The approach used by Stampfer et al. (Stampfer, Lutz, and Schlegel 2012) bares most resemblance to the current work. By taking candidates from the initial scene, they se- lected next best view locations that maximised the probabil- ity of recognising objects in the scene from a local object database, which is analogous to this project. But instead of planning next best view locations based on geometric anal- ysis they used photometric methods to locate specific visual features. A camera mounted on a manipulator arm is used to sequentially move to these feature rich locations. This how- ever, does rely on the object to have colour/contrast differ- ences, bar codes or text, which may not be true of all objects. Photometric based analysis has its merits, but geometic fea- ture analysis has also been shown to be effective (Vázquez et al. 2001; Roberts and Marshall 1998). This also requires a robot with a manipulator to move around the object and does not account for visual occlusions that may hamper the line of sight of the camera. Although reliable this approach re- quires several sequential before recognition. In this project we aim to minimise the amount of views taken. Figure 1: Next best view planning: Conceptual overview. Early work into next view planning was based on detect- The approach has two steps: (A) an environmental analysis ing which parts of an object were causing self occlusion (Ba- and (B) a model analysis. The analysis of the environment jcsy 1988)(Connelly 1985)(Maver and Bajcsy 1993). These reasons about the visibility of an object and is carried out methods were effective at obtaining high levels of coverage based on a set of navigable poses. Poses in suitable areas of an object for digital reconstruction, but high amounts of are then carried on into the model analysis (B), in which the new views were needed in order to achieve this. Methods on visibility of object features is evaluated. Conceptually, the the current work are inspired by these concepts for detecting resulting areas from (A) and (B) are combined to find the occlusions in the environment so as to avoid them in the next next best view (C). view. Okamoto (Okamoto, Milanova, and Bueker 1998) used a model free approach to move to a precomputed best pose and (b) how the performance of object recognition can be for recognition. A stream of video like images was taken improved by taking multiple views. en-route to this location and produced a recognition based on this stream. This method was able to disambiguate sim- The remainder of the paper is structured as follows. We ilar looking objects. However, this solution made no con- first discuss related work in Section 2, followed by a con- sideration for environmental obstacles and no backup if the ceptual overview of our approach in Section 3 and a detailed optimal pose was unavailable. In our work, the next view description of the implementation in Section 4. In Section 5, planner will plan to avoid environmental obstacles and will we present and analyse experimental results and discussion, not require the use of expensive visual processing methods before we conclude with a summary and conclusion in Sec- but will still be able to differentiate between similar objects. tion 7. Callari (Callari and Ferrie 2001) sought to use contextual information from a robot’s surroundings to identify an ob- 2 Related Work ject. This is a useful metric, as contextual information has Hutchinson et al. (Hutchinson and Kak 1989) used aspect been shown to be useful in directing object search (Hanheide graphs to determine the pose in which most unseen object et al. 2010; Kunze et al. 2014). Although, in order for this features were present when digitally reconstructing an ob- solution to work, a great deal of prior information is required ject. By storing a geometric model of an object they were to influence identification and is unlikely to cope well in a able to determine what features of the object could be seen novel environment. The current work does take in prior in- from various poses around it. The sensor would then move formation, but this is limited to a snapshot of the current to the pose in which the most features were available. How- environment which is used to detect visual occlusions. ever, geometric analysis accounted for minute edges and Wixson (Wixson 1994) proposed moving at set intervals ridges which are not necessarily visible to the camera. As- around a target for high surface coverage of the object within pect graphs were computed offline and used as a lookup ta- a fixed number of movements with little computational cost. ble for a mobile sensor. This work does not take into account Though this naive approach offers very low computing costs camera limitations and thus may provide a view which is with potentially high information gain, the cost of move- theoretically optimal, but practically unobtainable. To com- ment would potentially huge if recognition did not occur in bat this, the work presented in this paper seeks to model the the first couple of movements. The present work seeks to sensor used in task. Aspect graphs will also be built online use current information to limit the number of movements to account for accessibility of the environment. required to identify an object. The Need for Local View Planning In the context of object search tasks, a robot might seek ob- jects in certain rooms (Kunze et al. 2012) or in proximity to other objects (Kunze, Doreswamy, and Hawes 2014). How- ever, objects cannot always be recognized with high confi- dence from a first view. To verify the identity of an object the robot may have to take an additional view. We have identi- fied the following situations in which one or more additional views would be beneficial: 1. When the recognition service provides a low confidence estimate of an object’s identity. In this situation, another view would be required to confirm or deny this hypothe- sis. By moving the camera to a location where more of the object is visible there is a higher chance of obtaining a high confidence identification. Figure 2: Illustrates how the next best view planning in this paper fits within the perception-action cycle. Blue boxes 2. When a identification is returned with high confidence but represent action/perception stages. Green boxes represent not high enough to meet other task requirements; another planning stages. view could lead to a higher confidence. This may be use- ful in identifying high priority items in a service environ- ment, such as looking for the correct medicine. 3. In the event of more than one candidate identities being Vasquez-Gomez and Stampfer (Vasquez-Gomez, Sucar, returned for the same object. This can occur when the vis- and Murrieta-Cid 2014; Stampfer, Lutz, and Schlegel 2012) ible features match more than one modelled object. Fur- considered some of the restricting effects of an uncertain en- ther views of the object can lead to disambiguation. vironment using a mobile robot when digitally reconstruct- ing an object with a camera attached to a multi-joint ma- When any of these conditions are met, next view plan- nipulator on a mobile base. The main contribution of this ning is the initiated. The components of the approach are work to the current study is that not only did it consider the described in the following sections. placement of the sensor but also of the mobile base which was always planned to be placed in open space. These con- Perception siderations are extended in this project to account for agent Streams of images received from the robot’s camera are pro- placement and visual occlusions. cessed to detect candidate objects. Bottom-up perception al- gorithms used - making estimates of object identities based on information available in the scene and no contextual in- 3 Next Best View Planning formation. Segmented sections of the scene are compared with a model database; any matches between the two are re- This section provides a conceptual overview of the pro- turned with a confidence measure as estimated object iden- posed view planning approach. Implementation details can tities. be found in Section 4. Online Aspect Graph Building Figure 2 shows the next best view planner’s place in the perception-action cycle of a robot. A hypothesis about an An object cannot be seen in its entirety from one view- object’s identity and its estimated pose are the inputs into point and different viewpoints prnt different features to a the planner. As output, the planner provides the next best sensor. Aspect graphs are a method of simulating which view from which it determines the robot has the best chance features may be visible at different viewpoints around an of identifying the object. object. A typical aspect graph for next view planning consists of a sphere of poses around a model and geo- The following briefly describes the view planning process metric analysis determines which features are visible from after a candidate identity is received: (I) potential viewing each point. In past work, aspect graphs have been com- locations are checked for dynamic obstacles on the local cost puted offline (Cyr and Kimia 2004; Maver and Bajcsy 1993; map. (II) Views that are reachable by the robot are subjected Hutchinson and Kak 1989) which produces a set of coordi- to an environmental analysis to determine if any visual oc- nates for a sensor and an associated value to denote the num- clusions block the view of the candidate object. (III) Views ber of visible features at each pose. This is used as a lookup which survive environmental analysis undergo model anal- table and requires knowledge of the object being viewed and ysis to determine the amount of visible surface area visible its pose. from each view point. This project will instead build aspect graphs online which Before describing the individual components of the view will be built in two stages. By moving online, we can planning approach in detail, we motivate when next best account for real-time information about environmental ob- view planning is initiated and why. structions and their effect on potential new poses. Aspect graphs are generated for both Environmental Analysis and Prankl et al. 2015). Our implementation is based on ROS1 , Model Analysis, which will be discussed next. The shape of a message based operating system used for robot control. aspect graphs in this project are governed by the robot’s de- For object instance recognition, we use an adapted version grees of freedom. The robot used in this paper had 3-DOF of a ROS-based recognition service of the above mentioned (base movement and pan/tilt unit) and so graph nodes were framework. The service takes an RGB point cloud as input arranged in a disc surrounding the candidate object. and returns one of following: Environmental Analysis After receiving a candidate (1) list of candidate object hypotheses identities In case identity we need to decide which available pose offers the an object’s identity cannot be verified a list of hypotheses next best view. The first step is to determine if any part of the is returned. The hypotheses include the object’s potential environment lies between the sensor and the candidate ob- identity and a pose estimate in the form of a 4 × 4 trans- ject, thus creating a visual occlusion. To achieve this, a snap- formation matrix which aligns the object to the point of shot of the current environment is taken and converted into view of the camera. a volumetric representation. From various points around the (2) a verified object identity If an object is identified with candidate object the sensor (oriented towards the object) is high confidence, the object’s identity, pose and the confi- modelled. Within this model, any part of the sensor’s field dence level are returned. of view which does not reach the estimated position of the object is discarded. This leaves a circle of poses around the In Figure 3, the input to the object recognition service, the object, each containing its respective remaining field of view service itself, and a visualisation of the output is depicted in which allows unobstructed line-of-sight to the object. These Step A, B, and C respectively. In this case, the output is remaining parts of the field of view are then carried forward a hypothesis for a candidate object identity (here: a book). to model analysis. The object recognition service is used after moving to the next best view (see Step H, I, and J) this time, the outcome Model Analysis Surviving sections of the modelled cam- is a verified identification of the object. era at each pose do not necessarily represent the visibility of the object at that location. In order to establish which view View Generation provides the most information about the object we use the model of the object from the object database along with sur- After a candidate identity is received, a series of poses are viving sections of the modelled cameras at each pose from generated around it. These poses are generated in two uni- the previous step. formly distributed rings of robot poses around the estimated The object database contains a manipulable model of the location of the object, each oriented directly towards the lo- target object. This model is rotated to match the estimated cation of the object. pose of the candidate object. From here we again simulate The accessibility of each of these poses is assessed by col- each camera pose from the previous step, using only the sur- lision checking the views using the local cost map. Any viving fields of view. Each modelled camera at each pose views in which the robot would collide with environmental will be oriented towards the object model; the proportion of obstructions are discarded. This is seen in Figure 3 (Step the field of view of each camera which is filled by the object D), where views that make contact with the supporting sur- is then saved and represents the visibility of the object from face of the object are eliminated. The remaining views are each pose around it. carried forward to be assessed for visual occlusions. Next Best View Selection Environmental Analysis After environmental and model analysis, each pose is After a set of accessible views has been generated and matched with a score which represents its visual coverage tested, environmental analysis is performed on the initial of the candidate object. The pose with the highest score is point cloud to assess whether any regions of the environment determined to be the next best view; this is sent to the robot’s might occlude the view of the object. In order to do this, the navigation component. Once at the new location the robot current point cloud is converted into an octree representation will either accept or reject the initial identity estimate or be- (Figure 3 Step E) (using Octomap (Hornung et al. 2013)) an gin to determine another next best view if the first did not RGB-D camera is then modelled at every view. The sections lead to recognition. of each modelled camera that do not make contact with the bounding box of the candidate object are eliminated from 4 Implementation future analysis. The sections of each camera model that al- In this section we describe how the view planning approach lows an unobstructed of the candidate object are then carried is integrated with the data structures and algorithms of the forward to the next stage, all other sections are discarded. robot’s perception and control components. Figure 3 pro- vides an overview of the different components and explains Model Analysis the view planning process step-by-step. Figure 3 (Step F), shows the model analysis step. By com- pleting the previous two steps, the amount of possible view Object Recognition locations and camera fields of view have been reduced. Each In this work, we build on a state-of-the-art object modelling 1 and object recognition framework (Aldoma et al. 2013; http://www.ros.org/ Figure 3: The next best view planning process step-by-step: A: Receive initial view. B: Recognition service processes point cloud. C: Candidate object and pose identified. D: Collision detection around object location. E: Environmental analysis for occlusions (candidate object in pink, occlusions in white and blue). F: Model analysis with remaining rays. G: Move to best view pose. H: New point cloud sent to recognition service and object recognised. I: Point cloud from second view is sent to recognition service. J: Visualisation of recognition service correctly and fully recognising the target object. surviving part of the modelled camera is simulated and di- input resumes to the recognition service (Figure 3, Step I); rected towards a model of the candidate object. The remain- The response of the recognition service will denote different ing sections of each camera are simulated using ray-casting; things: this computes a line from the origin of the camera out in the 1. Verification: If a high confidence estimate of the object’s direction of the field of view. If the line makes contact with identity is returned, the next view is considered successful the model it is considered that the part of the object with and the object considered identified. which it made contact would be visible to the camera from that viewpoint. After the camera is modelled at each view, 2. No candidate: If the recognition service returns no high we are left with a measure of the amount of object visible at confidence identity after the next view it should either each location. The view which enables the highest visibility be considered successful in dispelling a false hypothesis is then considered the next best view. from the first view (true negative) or unsuccessful, as it has lowered the amount of information available to the Multi-object Disambiguation Note, if the recognition recognition service to deny further identification. service returns more than one candidate hypothesis, aspect graph building is performed for every object, each with its Depending on the result of the movement: the object own pose transformation. After this, each is analysed to find search will end, another view is taken or the candidate will the best compromise view—one which gives the best chance be discarded and the search continued. This gives a detailed for recognition of each model. A solution to this is, after overview of the process this next view selection undergoes calculating the sum of visible surface area for each view, the in order to produce a next best view pose. The components difference between them is then subtracted. This ensures no of this system will now be assessed and results examined. high surface area visibility from one pose dominates over a low score for the same pose on another object. 5 Experimentation and Evaluation Experiments were conducted to test the capabilities of this View Execution: Navigation & Recognition next best view planner. All experiments were carried out on When the next best view has been found, the pose associated a Scitos G5 robot equipped with pan/tilt unit. In each exper- with it is sent to the robot’s navigational packages, which iment the robot was located in an open area with few obsta- manoeuvres the robot to that view (Figure 3, Step G). In cles. The centre of this area contained a tall, thin plinth, the this case, the pose consists of movement by the base of the plateau of which was just below the robot’s camera height. robot and angular movement by pan/tilt unit to centre the Test objects and obstacles were placed on this supporting ta- camera on the object’s location. Once the goal is reached, ble and the robot was able to move around to reach a new inaccurate movement. In some cases this was be compen- Table 1: Results of trials when presented with a single object sated for: of the 33 successful trials, 7 required two views and no obstruction (Experiment 1) before high confidence recognition was achieved. The initial Result # Trials Percentage % inaccurate pose estimation led to poor next view selection; Moved. Recognised 33 82.5% however, the new view presented better quality information Moved. No Recognition 4 10.0% about the object to a point where the pose estimation im- No Initial Candidate 3 7.5% proved and the subsequent view resulted in recognition. The Total 40 100.0% nature of taking pose estimates from low confidence identity hypotheses does inherently hold the risk of inaccurate pose estimation, so this is to be expected. Circled poses in figure 4 show the final positions of 10% of the trials in which no recognition could be made after two views, which can be attributed to a succession of inaccurate pose estimations. The red pose arrows in figure 4 tend to cluster in areas with a view of large surface area of the ob- ject. When using the book this was a view from which both the spine and cover were visible and where the handle and body were visible on the mug. This demonstrates that given a reasonably accurate initial pose estimation of the object, the next best view planner is able to locate the view which presents one of the largest faces of the object to the camera and that doing this leads to reliable recognition. This shows that aspect graphs can be computed online to lead to high Figure 4: Graphical representation of selected next best view recognition accuracy and do not require excessive analysis. poses for a book and mug in experiment 1. Poses surrounded by black rings did not lead to recognition. Experiment 2: Confirmation Views view. Each experiment used its own specific of target ob- Set-up After making a high confidence identification of an jects. Details of the set-up of individual experiments are object, it may be useful to take another from a different po- given below. sition, as two high confidence identifications provide more certainty than one. This experiment followed the same pro- Experiment 1: Non-obstructed Next View Selection cedure as experiment 1, except for the starting position of the Set-up The primary function of this planner is to take in robot which was positioned to allow a high confidence iden- hypotheses about potential objects and select the best pose tification. Success in this experiment was measured firstly to enable their accurate confirmation or rejection. This func- on whether the next view led to another high confidence tion was tested during this experiment. No obstacles or oc- recognition and secondly on how much identification con- clusions were used apart from the object’s supporting plane. fidence increased from the first pose to the next. 40 trials were carried out in total, 20 each on two different objects: a large book and a standard mug. In all cases the Results Table 2 shows the descriptive statistics for the sec- robot was initially positioned with a view of a the object ond experiment. On average, moving from one verified which gave a low chance of recognition. Success in these viewpoint to another resulted in an increase in confidence trials were defined by the ability of the recognition service in 60% of cases, with an average increase in confidence of to make a high confidence identification of the target object 3.65%. The size of increase/decrease fluctuations was quite after the next best view planner selected a new pose and the large, but the average change in confidence shows an upward robot had moved. In each trial up to two next best view lo- trend over these 20 trials. cations were permitted. Results Table 1 shows the results of this experiment. The Discussion Increases in confidence were largest when the table shows a successful verification rate of 82.5%, meaning start position did not align with the largest face of the ob- that in most cases the planner was able to take an uncertain ject; next view selection was then able to find the largest hypothesis and verify it through moving to a new location. face in the subsequent view. On the contrary, in situations In 10% of trials, two next view poses were taken, but no where confidence decreased, the initial view coincided with recognition was achieved. the largest face of the object so the subsequent view moved away from this largest face. However, instances of reduced Discussion Failure to provide a verified hypothesis in 10% confidence should are not necessarily failures, as this still of cases can be attributed to the pose estimation provided by provided two high confidence identifications of the object. the recognition service. If the view of the candidate in the This block of experiments shows the planner is able to se- initial view is very low quality, pose estimation can be inac- lect a view to verify an object hypotheses with a good level curate; this leads to poorly aligned candidate poses and thus of reliability. Table 2: Descriptive statistics of the results of the confirma- tion views experiment (Experiment 2) Measure Result Attempts 20 Found New Confirmation 20 Increased 12 (60%) Decreased 8 (40%) Average Change + 3.65% Standard Deviation 6.14% Largest Decrease -9.67% Largest Increase +15.02% Mean Initial Confidence 62.35% Mean Final Confidence 64.53% Experiment 3: Visually Obstructed Views Set-up Environmental obstacles potentially act as visual occlusions when selecting a view. Next view planning must be able to recognise these in one view and assess their im- pact on the next. Without this ability the next view planner can select a view that would theoretically lead to a recogni- Figure 5: Objects which share common features. They ap- tion confidence of 100%, but but this view may be blocked pear identical from a front view, but are distinguishable from by another object, thus the actual recognition confidence is other angles (Experiment 4). closer to 0%. To test this functionality the robot is presented with a scene which contained a target object and potential occlusion to block all or large parts of the object. Over 20 experiment the robot will be placed with a view of the com- trials the robot was provided with a initial view of the object mon face of these two objects and is expected to decide on a which allowed low confidence recognition. Each trial would pose which increases the difference between the number of be deemed a success if the object was recognised with a high clusters recognised from each object. In all experiments the level of confidence after the first movement. cuboid object(figure 5) was the target object. The robot was required to select a new pose to increase the strength of one Results Results showed that in 17 of the 20 cases (85%), correct hypothesis and decrease that of the incorrect one in the planner was able to select a pose which both avoided the 20 trials. The number of visible features that the recognition occlusion and lead to recognition. Of the remaining trials, service matched to each candidate identity is a measure of the selected pose allowed a view of the target object but the the strength of that hypothesis. view was incomplete - being partially occluded by the ob- stacle. Results Figure 6 shows that when presented with a scene in which one of two objects is present, the next best view Discussion In most cases the planner was able to account planner can strengthen the hypothesis of the correct object for an environmental occlusion and choose a best view pose and weaken that of the incorrect identity. Of the clusters that avoided it. In the remainder of cases the next view pose recognised in the initial view, an average of 367.4 belonged lead to a partially obscured view of the object. This is due to to the correct object and 210.65 to the incorrect object. After occlusion modelling being based on the information gained one movement based on the selection of the next best view from a single frame during the initial view. If the potentially selection, the average available clusters for the correct object occluding object is occluded by the target object then the rose to 456.4 and 152.8 for the incorrect object. environmental analysis is detrimentally affected by partial observability of the obstacle. Discussion This shows that selecting one new view can increase the differentiation between two ambiguous objects Experiment 4: Ambiguous Objects and lead to a reliable identification. This suggests much sim- Set-up A new view of an object can be taken to differen- pler and less computationally expensive methods of hypoth- tiate between objects that share similar features, which was esis differentiation that in Okamoto (Okamoto, Milanova, the basis of this experiment. When presented with a view and Bueker 1998) and that by only taking one view rather of an object that could belong to a target object or unre- than a constant stream, the process is also much simpler. lated object, it would be best to disambiguate this view to decide if the target object has been found. To test this, two 6 Operation Time custom objects were used. Figure 5 shows the two modelled The time for making one movement: from receiving a candi- objects share a face from which the are almost indistinguish- date identity and pose estimation to arriving at the next loca- able, but are structurally different from other angles. In this tion is 1 minute 44 seconds. For two cycles the completion obstacles during execution 3. a set of experiments which demonstrates (a) how robots can disambiguate objects which share a similar set of fea- tures, and (b) how the performance of object recognition can be improved by taking multiple views. Results of experiment 1 show that the online aspect graphs analysis is able to verify candidates put forward by the recognition service with an accuracy of 82.5%; how- ever this also showed that the next best view planner is also highly dependent on the accuracy of the pose estimation pro- vided to it. Experimentation also showed that dynamic col- lision detection was able to eliminate unavailable poses; re- moving around 17 of 38 poses on average during every trial. We can also show that in 85% of cases, the planner was able to avoid visual occlusions in the environment, but this was heavily dependent on the visibility of the obstruction during Figure 6: Results for experiment 4. Percentage of available the initial view. This was dually confirmed when two iden- clusters for two ambiguous objects before and after move- tical starting poses in experiment 1 & 3 both arrived at dif- ment. ferent final poses, as the best view when no occlusions are present was unavailable when clutter was introduced. Fi- nally, we showed that the planner was able to decrease am- time jumps to 4 minutes 58. This is due to the large amount biguity in objects that have identical faces. of data needed to compute each camera model at each pose. To achieve these aims we used online aspect graph build- This is clearly a number that needs to be reduced and can be ing and octree based visual occlusion detection. These a subject for future work. were new ways of approaching next best view planning and showed that online aspect graph analysis for view planning General Discussion was possible and unlike offline examples (Hutchinson and Experimental results show this is a strong next view algo- Kak 1989) could account for full or partial occlusions in the rithm for object recognition that can work reliably in clut- environment and thus avoid these when planning the next tered, unpredictable environments. best view. Also, online aspect graph building allows mod- In order to improve this solution further some areas can be els to be added during autonomous patrol and immediately enhanced to make it more robust and generalisable. Poten- available for recognition, whereas offline building would re- tial next view locations are currently set at a fixed distance quire a period of down-time. By decreasing the ambiguity from the candidate object; this can be a hindrance in certain between two identical looking objects, we showed that ex- topological layouts. Developing adaptable next view loca- pensive image streaming methods (Okamoto, Milanova, and tions which, rather than test which of a fixed set of locations Bueker 1998) are not necessary and a more intelligent ap- are in free space and therefore available, the potential loca- proach that fixed angle movements (Wixson 1994) was pos- tions should be instead generated in exclusively free space sible, using no more than two views, with an identification and then environmental and model analysis take place from rate of 82.5%. there. The work presented in this paper was successful in its In adopting a greedy approach, this work selected only aims. From online aspect graph building and collision de- poses with the highest visible portion of the object; future tection to camera modelling and near real time occlusion work should focus on including a cost function to form a analysis, the way this planner was designed allows it to be utility between movement distance and amount of the model plugged into any robot using any model based recognition which is visible. system, meaning this planner is available for a variety of robots that conduct object search in cluttered environments. 7 Summary & Conclusion A summary of the contributions of this project are shown Acknowledgement below. The results shown in the previous section will be The research leading to these results has received fund- presented in support of these. ing from the European Union Seventh Framework Pro- In summary, the aims of this study were to show that: gramme (FP7/2007-2013) under grant agreement No 1. a method for analysing potential views using online as- 600623, STRANDS. pect graphs by taking both into account: (1) occlusions, and (2) the visibility of features based on learned object References models. Aldoma, A.; Tombari, F.; Prankl, J.; Richtsfeld, A.; Di Ste- 2. a next best view planner that selects a view based on the fano, L.; and Vincze, M. 2013. Multimodal cue integra- method above and an executive that accounts for dynamic tion through hypotheses verification for rgb-d object recog- nition and 6dof pose estimation. In Robotics and Automa- Kunze, L.; Doreswamy, K. K.; and Hawes, N. 2014. Us- tion (ICRA), 2013 IEEE International Conference on, 2104– ing qualitative spatial relations for indirect object search. In 2111. IEEE. 2014 IEEE International Conference on Robotics and Au- Bajcsy, R. 1988. Active perception. Proceedings of the tomation (ICRA), 163–168. IEEE. IEEE 76(8):966–1005. Maver, J., and Bajcsy, R. 1993. Occlusions as a Guide Callari, F. G., and Ferrie, F. P. 2001. Active object recog- for Planning the Next View. IEEE transactions on pattern nition: Looking for differences. International Journal of analysis and machine intelligence 15(5):417–433. Computer Vision 43(3):189–204. Okamoto, J.; Milanova, M.; and Bueker, U. 1998. Ac- Connelly. 1985. The Determination of next best view. IEEE tive perception system for recognition of 3D objects in im- 432–435. age sequences. Advanced Motion Control, 1998. AMC ’98- Cyr, C. M., and Kimia, B. B. 2004. A similarity-based Coimbra., 1998 5th International Workshop on 700–705. aspect-graph approach to 3d object recognition. Interna- tional Journal of Computer Vision 57(1):5–22. Prankl, J.; Aldoma, A.; Svejda, A.; and Vincze, M. 2015. Hanheide, M.; Hawes, N.; Wyatt, J.; Göbelbecker, M.; Bren- Rgb-d object modelling for object recognition and tracking. ner, M.; Sjöö, K.; Aydemir, A.; Jensfelt, P.; Zender, H.; and In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ Kruijff, G.-J. 2010. A framework for goal generation and International Conference on, 96–103. IEEE. management. In Proceedings of the AAAI workshop on goal- Roberts, D., and Marshall, A. D. 1998. Viewpoint selection directed autonomy. for complete surface coverage of three dimensional objects. Hornung, A.; Wurm, K. M.; Bennewitz, M.; Stachniss, C.; In BMVC, 1–11. and Burgard, W. 2013. Octomap: An efficient probabilis- Stampfer, D.; Lutz, M.; and Schlegel, C. 2012. Information tic 3d mapping framework based on octrees. Autonomous driven sensor placement for robust active object recognition Robots 34(3):189–206. based on multiple views. In 2012 IEEE International Con- Hutchinson, S. a., and Kak, A. C. 1989. Planning sensing ference on Technologies for Practical Robot Applications strategies in a robot work cell with multi-sensor capabilities. (TePRA), 133–138. IEEE. IEEE Transactions on Robotics and Automation 5(6):765– Vasquez-Gomez, J. I.; Sucar, L. E.; and Murrieta-Cid, R. 783. 2014. View planning for 3D object reconstruction with a Kunze, L.; Beetz, M.; Saito, M.; Azuma, H.; Okada, K.; mobile manipulator robot. 2014 IEEE/RSJ International and Inaba, M. 2012. Searching objects in large-scale in- Conference on Intelligent Robots and Systems (Iros):4227– door environments: A decision-theoretic approach. In 2012 4233. IEEE International Conference on Robotics and Automation (ICRA), 4385–4390. IEEE. Vázquez, P.-P.; Feixas, M.; Sbert, M.; and Heidrich, W. 2001. Viewpoint selection using viewpoint entropy. In VMV, Kunze, L.; Burbridge, C.; Alberti, M.; Thippur, A.; volume 1, 273–280. Folkesson, J.; Jensfelt, P.; and Hawes, N. 2014. Combin- ing top-down spatial reasoning and bottom-up object class Wixson, L. 1994. Viewpoint selection for visual search. recognition for scene understanding. In 2014 IEEE/RSJ In- Computer Vision and Pattern Recognition, 1994. Proceed- ternational Conference on Intelligent Robots and Systems, ings CVPR ’94., 1994 IEEE Computer Society Conference 2910–2915. IEEE. on 800–805.