=Paper= {{Paper |id=Vol-1782/paper_6 |storemode=property |title=Next Best View Planning for Object Recognition in Mobile Robotics |pdfUrl=https://ceur-ws.org/Vol-1782/paper_6.pdf |volume=Vol-1782 |authors=Christopher McGreavy,Lars Kunze,Nick Hawes |dblpUrl=https://dblp.org/rec/conf/plansig/McGreavyKH16 }} ==Next Best View Planning for Object Recognition in Mobile Robotics== https://ceur-ws.org/Vol-1782/paper_6.pdf
            Next Best View Planning for Object Recognition in Mobile Robotics
                                 Christopher McGreavy, Lars Kunze and Nick Hawes
                                                  Intelligent Robotics Lab
                                                School of Computer Science
                                                 University of Birmingham
                                                      United Kingdom
                                      {cam586|l.kunze|n.a.hawes}@cs.bham.ac.uk


                            Abstract                               in particular rooms (Kunze et al. 2012) and in relation to
                                                                   other objects (Kunze, Doreswamy, and Hawes 2014). These
     Recognising objects in everyday human environments            approaches guide robots to locations from which they can
     is a challenging task for autonomous mobile robots.
     However, actively planning the views from which an
                                                                   potentially observe an object. In this work, however, we
     object might be perceived can significantly improve the       propose a complementary planning approach that selects the
     overall task performance. In this paper we have de-           next best view after having identified a potential object can-
     signed, developed, and evaluated an approach for next         didate. Such local, incremental view planning is crucial for
     best view planning. Our view planning approach is             two reasons: (1) it allows robots to disambiguate between
     based on online aspect graphs and selects the next best       objects which share a similar set of features, and (2) it im-
     view after having identified an initial object candidate.     proves the overall performance of object recognition tasks
     The approach has two steps. First, we analyse the vis-        as objects are observed and recognized from multiple views.
     ibility of the object candidate from a set of candidate          Our local view planning approach addresses all of these
     views that are reachable by a robot. Secondly, we anal-
                                                                   above mentioned challenges. It is based on using a realistic
     yse the visibility of object features by projecting the
     model of the most likely object into the scene. Experi-       sensor model of an RGB-D camera to generate online aspect
     mental results on a mobile robot platform show that our       graphs (a set of poses around an object which describe object
     approach is (I) effective at finding a next view that leads   visibility at that point) and takes the kinematic constraints of
     to recognition of an object in 82.5% of cases, (II) able      a mobile robot platform into account. Hence, by changing
     to account for visual occlusions in 85% of the trials, and    the sensor and/or the kinematic model our approach can eas-
     (III) able to disambiguate between objects that share a       ily be transferred to robot platforms of different types. We
     similar set of features. Hence, overall, we believe that      further consider two environmental constraints when gener-
     the proposed approach can provide a general methodol-         ating robot and camera poses: (1) dynamic obstacles which
     ogy that is applicable to a range of tasks beyond object      might hinder a robot to take particular views, and (2) oc-
     recognition such as inspection, reconstruction, and task
                                                                   clusions of object features which might be hidden by other
     outcome classification.
                                                                   objects (or by the object itself). Finally, we consider learned
                                                                   object models in the planning process to predict the location
                      1    Introduction                            and visibility of features of object candidates. An overview
Autonomous mobile robots that operate in real-world envi-          of the next best view planning approach is given in figure 1.
ronments are often required to find and retrieve task-related         Experimental results show that our next best view plan-
objects to accomplish their tasks. However, perceiving and         ning approach, which takes multiple views of an object,
recognising objects in such environments poses several chal-       allows us to improve the performance of recognition tasks
lenges. Firstly, object locations are continuously changing.       when compared to single-view object recognition. Further,
That is, a mobile robot cannot simply rely on a fixed set of       it enables robots to differentiate between objects that share
views from which it can observe an object. The robot needs         a similar set of features.
to plan in a continuous space where to stand and where to             To this end, this work contributes the following:
look. That is, the robot has to select a view from the un-         1. a method for analysing potential views using online as-
countably infinite set of possible views which allows it to           pect graphs by taking both into account: (1) occlusions,
observe the sought object. Matters are further complicated            and (2) the visibility of features based on learned object
by dynamic obstacles which might hinder the robot to take             models.
a particular view. Or, when taking a view, relevant features
of an object might be occluded by other objects and/or by          2. a next best view planner that selects a view based on the
the object itself (self-occlusion). Finally, other conditions         method above and an executive that accounts for dynamic
such as lighting and/or sensor noise can influence the per-           obstacles during execution
formance of object recognition tasks.                              3. a set of experiments which demonstrates (a) how robots
   In previous work, we have enabled robots to find objects           can disambiguate objects which share similar feature sets,
                                                                   The approach used by Stampfer et al. (Stampfer, Lutz,
                                                                and Schlegel 2012) bares most resemblance to the current
                                                                work. By taking candidates from the initial scene, they se-
                                                                lected next best view locations that maximised the probabil-
                                                                ity of recognising objects in the scene from a local object
                                                                database, which is analogous to this project. But instead of
                                                                planning next best view locations based on geometric anal-
                                                                ysis they used photometric methods to locate specific visual
                                                                features. A camera mounted on a manipulator arm is used to
                                                                sequentially move to these feature rich locations. This how-
                                                                ever, does rely on the object to have colour/contrast differ-
                                                                ences, bar codes or text, which may not be true of all objects.
                                                                Photometric based analysis has its merits, but geometic fea-
                                                                ture analysis has also been shown to be effective (Vázquez
                                                                et al. 2001; Roberts and Marshall 1998). This also requires a
                                                                robot with a manipulator to move around the object and does
                                                                not account for visual occlusions that may hamper the line
                                                                of sight of the camera. Although reliable this approach re-
                                                                quires several sequential before recognition. In this project
                                                                we aim to minimise the amount of views taken.
Figure 1: Next best view planning: Conceptual overview.            Early work into next view planning was based on detect-
The approach has two steps: (A) an environmental analysis       ing which parts of an object were causing self occlusion (Ba-
and (B) a model analysis. The analysis of the environment       jcsy 1988)(Connelly 1985)(Maver and Bajcsy 1993). These
reasons about the visibility of an object and is carried out    methods were effective at obtaining high levels of coverage
based on a set of navigable poses. Poses in suitable areas      of an object for digital reconstruction, but high amounts of
are then carried on into the model analysis (B), in which the   new views were needed in order to achieve this. Methods on
visibility of object features is evaluated. Conceptually, the   the current work are inspired by these concepts for detecting
resulting areas from (A) and (B) are combined to find the       occlusions in the environment so as to avoid them in the next
next best view (C).                                             view.
                                                                   Okamoto (Okamoto, Milanova, and Bueker 1998) used a
                                                                model free approach to move to a precomputed best pose
  and (b) how the performance of object recognition can be      for recognition. A stream of video like images was taken
  improved by taking multiple views.                            en-route to this location and produced a recognition based
                                                                on this stream. This method was able to disambiguate sim-
   The remainder of the paper is structured as follows. We      ilar looking objects. However, this solution made no con-
first discuss related work in Section 2, followed by a con-     sideration for environmental obstacles and no backup if the
ceptual overview of our approach in Section 3 and a detailed    optimal pose was unavailable. In our work, the next view
description of the implementation in Section 4. In Section 5,   planner will plan to avoid environmental obstacles and will
we present and analyse experimental results and discussion,     not require the use of expensive visual processing methods
before we conclude with a summary and conclusion in Sec-        but will still be able to differentiate between similar objects.
tion 7.                                                            Callari (Callari and Ferrie 2001) sought to use contextual
                                                                information from a robot’s surroundings to identify an ob-
                   2   Related Work                             ject. This is a useful metric, as contextual information has
Hutchinson et al. (Hutchinson and Kak 1989) used aspect         been shown to be useful in directing object search (Hanheide
graphs to determine the pose in which most unseen object        et al. 2010; Kunze et al. 2014). Although, in order for this
features were present when digitally reconstructing an ob-      solution to work, a great deal of prior information is required
ject. By storing a geometric model of an object they were       to influence identification and is unlikely to cope well in a
able to determine what features of the object could be seen     novel environment. The current work does take in prior in-
from various poses around it. The sensor would then move        formation, but this is limited to a snapshot of the current
to the pose in which the most features were available. How-     environment which is used to detect visual occlusions.
ever, geometric analysis accounted for minute edges and            Wixson (Wixson 1994) proposed moving at set intervals
ridges which are not necessarily visible to the camera. As-     around a target for high surface coverage of the object within
pect graphs were computed offline and used as a lookup ta-      a fixed number of movements with little computational cost.
ble for a mobile sensor. This work does not take into account   Though this naive approach offers very low computing costs
camera limitations and thus may provide a view which is         with potentially high information gain, the cost of move-
theoretically optimal, but practically unobtainable. To com-    ment would potentially huge if recognition did not occur in
bat this, the work presented in this paper seeks to model the   the first couple of movements. The present work seeks to
sensor used in task. Aspect graphs will also be built online    use current information to limit the number of movements
to account for accessibility of the environment.                required to identify an object.
                                                                 The Need for Local View Planning
                                                                 In the context of object search tasks, a robot might seek ob-
                                                                 jects in certain rooms (Kunze et al. 2012) or in proximity to
                                                                 other objects (Kunze, Doreswamy, and Hawes 2014). How-
                                                                 ever, objects cannot always be recognized with high confi-
                                                                 dence from a first view. To verify the identity of an object the
                                                                 robot may have to take an additional view. We have identi-
                                                                 fied the following situations in which one or more additional
                                                                 views would be beneficial:
                                                                 1. When the recognition service provides a low confidence
                                                                    estimate of an object’s identity. In this situation, another
                                                                    view would be required to confirm or deny this hypothe-
                                                                    sis. By moving the camera to a location where more of
                                                                    the object is visible there is a higher chance of obtaining
                                                                    a high confidence identification.
Figure 2: Illustrates how the next best view planning in this
paper fits within the perception-action cycle. Blue boxes        2. When a identification is returned with high confidence but
represent action/perception stages. Green boxes represent           not high enough to meet other task requirements; another
planning stages.                                                    view could lead to a higher confidence. This may be use-
                                                                    ful in identifying high priority items in a service environ-
                                                                    ment, such as looking for the correct medicine.
                                                                 3. In the event of more than one candidate identities being
   Vasquez-Gomez and Stampfer (Vasquez-Gomez, Sucar,                returned for the same object. This can occur when the vis-
and Murrieta-Cid 2014; Stampfer, Lutz, and Schlegel 2012)           ible features match more than one modelled object. Fur-
considered some of the restricting effects of an uncertain en-      ther views of the object can lead to disambiguation.
vironment using a mobile robot when digitally reconstruct-
ing an object with a camera attached to a multi-joint ma-          When any of these conditions are met, next view plan-
nipulator on a mobile base. The main contribution of this        ning is the initiated. The components of the approach are
work to the current study is that not only did it consider the   described in the following sections.
placement of the sensor but also of the mobile base which
was always planned to be placed in open space. These con-        Perception
siderations are extended in this project to account for agent    Streams of images received from the robot’s camera are pro-
placement and visual occlusions.                                 cessed to detect candidate objects. Bottom-up perception al-
                                                                 gorithms used - making estimates of object identities based
                                                                 on information available in the scene and no contextual in-
            3   Next Best View Planning                          formation. Segmented sections of the scene are compared
                                                                 with a model database; any matches between the two are re-
This section provides a conceptual overview of the pro-          turned with a confidence measure as estimated object iden-
posed view planning approach. Implementation details can         tities.
be found in Section 4.
                                                                 Online Aspect Graph Building
   Figure 2 shows the next best view planner’s place in the
perception-action cycle of a robot. A hypothesis about an        An object cannot be seen in its entirety from one view-
object’s identity and its estimated pose are the inputs into     point and different viewpoints prnt different features to a
the planner. As output, the planner provides the next best       sensor. Aspect graphs are a method of simulating which
view from which it determines the robot has the best chance      features may be visible at different viewpoints around an
of identifying the object.                                       object. A typical aspect graph for next view planning
                                                                 consists of a sphere of poses around a model and geo-
   The following briefly describes the view planning process     metric analysis determines which features are visible from
after a candidate identity is received: (I) potential viewing    each point. In past work, aspect graphs have been com-
locations are checked for dynamic obstacles on the local cost    puted offline (Cyr and Kimia 2004; Maver and Bajcsy 1993;
map. (II) Views that are reachable by the robot are subjected    Hutchinson and Kak 1989) which produces a set of coordi-
to an environmental analysis to determine if any visual oc-      nates for a sensor and an associated value to denote the num-
clusions block the view of the candidate object. (III) Views     ber of visible features at each pose. This is used as a lookup
which survive environmental analysis undergo model anal-         table and requires knowledge of the object being viewed and
ysis to determine the amount of visible surface area visible     its pose.
from each view point.                                               This project will instead build aspect graphs online which
   Before describing the individual components of the view       will be built in two stages. By moving online, we can
planning approach in detail, we motivate when next best          account for real-time information about environmental ob-
view planning is initiated and why.                              structions and their effect on potential new poses. Aspect
graphs are generated for both Environmental Analysis and            Prankl et al. 2015). Our implementation is based on ROS1 ,
Model Analysis, which will be discussed next. The shape of          a message based operating system used for robot control.
aspect graphs in this project are governed by the robot’s de-       For object instance recognition, we use an adapted version
grees of freedom. The robot used in this paper had 3-DOF            of a ROS-based recognition service of the above mentioned
(base movement and pan/tilt unit) and so graph nodes were           framework. The service takes an RGB point cloud as input
arranged in a disc surrounding the candidate object.                and returns one of following:
Environmental Analysis After receiving a candidate                  (1) list of candidate object hypotheses identities In case
identity we need to decide which available pose offers the             an object’s identity cannot be verified a list of hypotheses
next best view. The first step is to determine if any part of the      is returned. The hypotheses include the object’s potential
environment lies between the sensor and the candidate ob-              identity and a pose estimate in the form of a 4 × 4 trans-
ject, thus creating a visual occlusion. To achieve this, a snap-       formation matrix which aligns the object to the point of
shot of the current environment is taken and converted into            view of the camera.
a volumetric representation. From various points around the         (2) a verified object identity If an object is identified with
candidate object the sensor (oriented towards the object) is           high confidence, the object’s identity, pose and the confi-
modelled. Within this model, any part of the sensor’s field            dence level are returned.
of view which does not reach the estimated position of the
object is discarded. This leaves a circle of poses around the          In Figure 3, the input to the object recognition service, the
object, each containing its respective remaining field of view      service itself, and a visualisation of the output is depicted in
which allows unobstructed line-of-sight to the object. These        Step A, B, and C respectively. In this case, the output is
remaining parts of the field of view are then carried forward       a hypothesis for a candidate object identity (here: a book).
to model analysis.                                                  The object recognition service is used after moving to the
                                                                    next best view (see Step H, I, and J) this time, the outcome
Model Analysis Surviving sections of the modelled cam-              is a verified identification of the object.
era at each pose do not necessarily represent the visibility of
the object at that location. In order to establish which view       View Generation
provides the most information about the object we use the
model of the object from the object database along with sur-        After a candidate identity is received, a series of poses are
viving sections of the modelled cameras at each pose from           generated around it. These poses are generated in two uni-
the previous step.                                                  formly distributed rings of robot poses around the estimated
   The object database contains a manipulable model of the          location of the object, each oriented directly towards the lo-
target object. This model is rotated to match the estimated         cation of the object.
pose of the candidate object. From here we again simulate              The accessibility of each of these poses is assessed by col-
each camera pose from the previous step, using only the sur-        lision checking the views using the local cost map. Any
viving fields of view. Each modelled camera at each pose            views in which the robot would collide with environmental
will be oriented towards the object model; the proportion of        obstructions are discarded. This is seen in Figure 3 (Step
the field of view of each camera which is filled by the object      D), where views that make contact with the supporting sur-
is then saved and represents the visibility of the object from      face of the object are eliminated. The remaining views are
each pose around it.                                                carried forward to be assessed for visual occlusions.

Next Best View Selection                                            Environmental Analysis
After environmental and model analysis, each pose is                After a set of accessible views has been generated and
matched with a score which represents its visual coverage           tested, environmental analysis is performed on the initial
of the candidate object. The pose with the highest score is         point cloud to assess whether any regions of the environment
determined to be the next best view; this is sent to the robot’s    might occlude the view of the object. In order to do this, the
navigation component. Once at the new location the robot            current point cloud is converted into an octree representation
will either accept or reject the initial identity estimate or be-   (Figure 3 Step E) (using Octomap (Hornung et al. 2013)) an
gin to determine another next best view if the first did not        RGB-D camera is then modelled at every view. The sections
lead to recognition.                                                of each modelled camera that do not make contact with the
                                                                    bounding box of the candidate object are eliminated from
                  4    Implementation                               future analysis. The sections of each camera model that al-
In this section we describe how the view planning approach          lows an unobstructed of the candidate object are then carried
is integrated with the data structures and algorithms of the        forward to the next stage, all other sections are discarded.
robot’s perception and control components. Figure 3 pro-
vides an overview of the different components and explains          Model Analysis
the view planning process step-by-step.                             Figure 3 (Step F), shows the model analysis step. By com-
                                                                    pleting the previous two steps, the amount of possible view
Object Recognition
                                                                    locations and camera fields of view have been reduced. Each
In this work, we build on a state-of-the-art object modelling
                                                                       1
and object recognition framework (Aldoma et al. 2013;                      http://www.ros.org/
Figure 3: The next best view planning process step-by-step: A: Receive initial view. B: Recognition service processes point
cloud. C: Candidate object and pose identified. D: Collision detection around object location. E: Environmental analysis for
occlusions (candidate object in pink, occlusions in white and blue). F: Model analysis with remaining rays. G: Move to best
view pose. H: New point cloud sent to recognition service and object recognised. I: Point cloud from second view is sent to
recognition service. J: Visualisation of recognition service correctly and fully recognising the target object.


surviving part of the modelled camera is simulated and di-       input resumes to the recognition service (Figure 3, Step I);
rected towards a model of the candidate object. The remain-      The response of the recognition service will denote different
ing sections of each camera are simulated using ray-casting;     things:
this computes a line from the origin of the camera out in the
                                                                 1. Verification: If a high confidence estimate of the object’s
direction of the field of view. If the line makes contact with
                                                                    identity is returned, the next view is considered successful
the model it is considered that the part of the object with
                                                                    and the object considered identified.
which it made contact would be visible to the camera from
that viewpoint. After the camera is modelled at each view,       2. No candidate: If the recognition service returns no high
we are left with a measure of the amount of object visible at       confidence identity after the next view it should either
each location. The view which enables the highest visibility        be considered successful in dispelling a false hypothesis
is then considered the next best view.                              from the first view (true negative) or unsuccessful, as it
                                                                    has lowered the amount of information available to the
Multi-object Disambiguation Note, if the recognition                recognition service to deny further identification.
service returns more than one candidate hypothesis, aspect
graph building is performed for every object, each with its         Depending on the result of the movement: the object
own pose transformation. After this, each is analysed to find    search will end, another view is taken or the candidate will
the best compromise view—one which gives the best chance         be discarded and the search continued. This gives a detailed
for recognition of each model. A solution to this is, after      overview of the process this next view selection undergoes
calculating the sum of visible surface area for each view, the   in order to produce a next best view pose. The components
difference between them is then subtracted. This ensures no      of this system will now be assessed and results examined.
high surface area visibility from one pose dominates over a
low score for the same pose on another object.                          5    Experimentation and Evaluation
                                                                 Experiments were conducted to test the capabilities of this
View Execution: Navigation & Recognition                         next best view planner. All experiments were carried out on
When the next best view has been found, the pose associated      a Scitos G5 robot equipped with pan/tilt unit. In each exper-
with it is sent to the robot’s navigational packages, which      iment the robot was located in an open area with few obsta-
manoeuvres the robot to that view (Figure 3, Step G). In         cles. The centre of this area contained a tall, thin plinth, the
this case, the pose consists of movement by the base of the      plateau of which was just below the robot’s camera height.
robot and angular movement by pan/tilt unit to centre the        Test objects and obstacles were placed on this supporting ta-
camera on the object’s location. Once the goal is reached,       ble and the robot was able to move around to reach a new
                                                                 inaccurate movement. In some cases this was be compen-
Table 1: Results of trials when presented with a single object   sated for: of the 33 successful trials, 7 required two views
and no obstruction (Experiment 1)                                before high confidence recognition was achieved. The initial
   Result                       # Trials Percentage %            inaccurate pose estimation led to poor next view selection;
   Moved. Recognised                 33            82.5%         however, the new view presented better quality information
   Moved. No Recognition              4            10.0%         about the object to a point where the pose estimation im-
   No Initial Candidate               3             7.5%         proved and the subsequent view resulted in recognition. The
   Total                             40           100.0%         nature of taking pose estimates from low confidence identity
                                                                 hypotheses does inherently hold the risk of inaccurate pose
                                                                 estimation, so this is to be expected.
                                                                    Circled poses in figure 4 show the final positions of 10%
                                                                 of the trials in which no recognition could be made after two
                                                                 views, which can be attributed to a succession of inaccurate
                                                                 pose estimations. The red pose arrows in figure 4 tend to
                                                                 cluster in areas with a view of large surface area of the ob-
                                                                 ject. When using the book this was a view from which both
                                                                 the spine and cover were visible and where the handle and
                                                                 body were visible on the mug. This demonstrates that given
                                                                 a reasonably accurate initial pose estimation of the object,
                                                                 the next best view planner is able to locate the view which
                                                                 presents one of the largest faces of the object to the camera
                                                                 and that doing this leads to reliable recognition. This shows
                                                                 that aspect graphs can be computed online to lead to high
Figure 4: Graphical representation of selected next best view
                                                                 recognition accuracy and do not require excessive analysis.
poses for a book and mug in experiment 1. Poses surrounded
by black rings did not lead to recognition.
                                                                 Experiment 2: Confirmation Views
view. Each experiment used its own specific of target ob-        Set-up After making a high confidence identification of an
jects. Details of the set-up of individual experiments are       object, it may be useful to take another from a different po-
given below.                                                     sition, as two high confidence identifications provide more
                                                                 certainty than one. This experiment followed the same pro-
Experiment 1: Non-obstructed Next View Selection                 cedure as experiment 1, except for the starting position of the
Set-up The primary function of this planner is to take in        robot which was positioned to allow a high confidence iden-
hypotheses about potential objects and select the best pose      tification. Success in this experiment was measured firstly
to enable their accurate confirmation or rejection. This func-   on whether the next view led to another high confidence
tion was tested during this experiment. No obstacles or oc-      recognition and secondly on how much identification con-
clusions were used apart from the object’s supporting plane.     fidence increased from the first pose to the next.
40 trials were carried out in total, 20 each on two different
objects: a large book and a standard mug. In all cases the       Results Table 2 shows the descriptive statistics for the sec-
robot was initially positioned with a view of a the object       ond experiment. On average, moving from one verified
which gave a low chance of recognition. Success in these         viewpoint to another resulted in an increase in confidence
trials were defined by the ability of the recognition service    in 60% of cases, with an average increase in confidence of
to make a high confidence identification of the target object    3.65%. The size of increase/decrease fluctuations was quite
after the next best view planner selected a new pose and the     large, but the average change in confidence shows an upward
robot had moved. In each trial up to two next best view lo-      trend over these 20 trials.
cations were permitted.
Results Table 1 shows the results of this experiment. The        Discussion Increases in confidence were largest when the
table shows a successful verification rate of 82.5%, meaning     start position did not align with the largest face of the ob-
that in most cases the planner was able to take an uncertain     ject; next view selection was then able to find the largest
hypothesis and verify it through moving to a new location.       face in the subsequent view. On the contrary, in situations
In 10% of trials, two next view poses were taken, but no         where confidence decreased, the initial view coincided with
recognition was achieved.                                        the largest face of the object so the subsequent view moved
                                                                 away from this largest face. However, instances of reduced
Discussion Failure to provide a verified hypothesis in 10%       confidence should are not necessarily failures, as this still
of cases can be attributed to the pose estimation provided by    provided two high confidence identifications of the object.
the recognition service. If the view of the candidate in the     This block of experiments shows the planner is able to se-
initial view is very low quality, pose estimation can be inac-   lect a view to verify an object hypotheses with a good level
curate; this leads to poorly aligned candidate poses and thus    of reliability.
Table 2: Descriptive statistics of the results of the confirma-
tion views experiment (Experiment 2)
           Measure                           Result
           Attempts                             20
           Found New Confirmation               20
           Increased                     12 (60%)
           Decreased                      8 (40%)
           Average Change                 + 3.65%
           Standard Deviation               6.14%
           Largest Decrease                -9.67%
           Largest Increase              +15.02%
           Mean Initial Confidence         62.35%
           Mean Final Confidence           64.53%


Experiment 3: Visually Obstructed Views
Set-up Environmental obstacles potentially act as visual
occlusions when selecting a view. Next view planning must
be able to recognise these in one view and assess their im-
pact on the next. Without this ability the next view planner
can select a view that would theoretically lead to a recogni-     Figure 5: Objects which share common features. They ap-
tion confidence of 100%, but but this view may be blocked         pear identical from a front view, but are distinguishable from
by another object, thus the actual recognition confidence is      other angles (Experiment 4).
closer to 0%. To test this functionality the robot is presented
with a scene which contained a target object and potential
occlusion to block all or large parts of the object. Over 20      experiment the robot will be placed with a view of the com-
trials the robot was provided with a initial view of the object   mon face of these two objects and is expected to decide on a
which allowed low confidence recognition. Each trial would        pose which increases the difference between the number of
be deemed a success if the object was recognised with a high      clusters recognised from each object. In all experiments the
level of confidence after the first movement.                     cuboid object(figure 5) was the target object. The robot was
                                                                  required to select a new pose to increase the strength of one
Results Results showed that in 17 of the 20 cases (85%),          correct hypothesis and decrease that of the incorrect one in
the planner was able to select a pose which both avoided the      20 trials. The number of visible features that the recognition
occlusion and lead to recognition. Of the remaining trials,       service matched to each candidate identity is a measure of
the selected pose allowed a view of the target object but the     the strength of that hypothesis.
view was incomplete - being partially occluded by the ob-
stacle.                                                           Results Figure 6 shows that when presented with a scene
                                                                  in which one of two objects is present, the next best view
Discussion In most cases the planner was able to account          planner can strengthen the hypothesis of the correct object
for an environmental occlusion and choose a best view pose        and weaken that of the incorrect identity. Of the clusters
that avoided it. In the remainder of cases the next view pose     recognised in the initial view, an average of 367.4 belonged
lead to a partially obscured view of the object. This is due to   to the correct object and 210.65 to the incorrect object. After
occlusion modelling being based on the information gained         one movement based on the selection of the next best view
from a single frame during the initial view. If the potentially   selection, the average available clusters for the correct object
occluding object is occluded by the target object then the        rose to 456.4 and 152.8 for the incorrect object.
environmental analysis is detrimentally affected by partial
observability of the obstacle.                                    Discussion This shows that selecting one new view can
                                                                  increase the differentiation between two ambiguous objects
Experiment 4: Ambiguous Objects                                   and lead to a reliable identification. This suggests much sim-
Set-up A new view of an object can be taken to differen-          pler and less computationally expensive methods of hypoth-
tiate between objects that share similar features, which was      esis differentiation that in Okamoto (Okamoto, Milanova,
the basis of this experiment. When presented with a view          and Bueker 1998) and that by only taking one view rather
of an object that could belong to a target object or unre-        than a constant stream, the process is also much simpler.
lated object, it would be best to disambiguate this view to
decide if the target object has been found. To test this, two                       6    Operation Time
custom objects were used. Figure 5 shows the two modelled         The time for making one movement: from receiving a candi-
objects share a face from which the are almost indistinguish-     date identity and pose estimation to arriving at the next loca-
able, but are structurally different from other angles. In this   tion is 1 minute 44 seconds. For two cycles the completion
                                                                     obstacles during execution
                                                                  3. a set of experiments which demonstrates (a) how robots
                                                                     can disambiguate objects which share a similar set of fea-
                                                                     tures, and (b) how the performance of object recognition
                                                                     can be improved by taking multiple views.
                                                                     Results of experiment 1 show that the online aspect
                                                                  graphs analysis is able to verify candidates put forward by
                                                                  the recognition service with an accuracy of 82.5%; how-
                                                                  ever this also showed that the next best view planner is also
                                                                  highly dependent on the accuracy of the pose estimation pro-
                                                                  vided to it. Experimentation also showed that dynamic col-
                                                                  lision detection was able to eliminate unavailable poses; re-
                                                                  moving around 17 of 38 poses on average during every trial.
                                                                  We can also show that in 85% of cases, the planner was able
                                                                  to avoid visual occlusions in the environment, but this was
                                                                  heavily dependent on the visibility of the obstruction during
Figure 6: Results for experiment 4. Percentage of available       the initial view. This was dually confirmed when two iden-
clusters for two ambiguous objects before and after move-         tical starting poses in experiment 1 & 3 both arrived at dif-
ment.                                                             ferent final poses, as the best view when no occlusions are
                                                                  present was unavailable when clutter was introduced. Fi-
                                                                  nally, we showed that the planner was able to decrease am-
time jumps to 4 minutes 58. This is due to the large amount       biguity in objects that have identical faces.
of data needed to compute each camera model at each pose.            To achieve these aims we used online aspect graph build-
This is clearly a number that needs to be reduced and can be      ing and octree based visual occlusion detection. These
a subject for future work.                                        were new ways of approaching next best view planning and
                                                                  showed that online aspect graph analysis for view planning
General Discussion                                                was possible and unlike offline examples (Hutchinson and
Experimental results show this is a strong next view algo-        Kak 1989) could account for full or partial occlusions in the
rithm for object recognition that can work reliably in clut-      environment and thus avoid these when planning the next
tered, unpredictable environments.                                best view. Also, online aspect graph building allows mod-
   In order to improve this solution further some areas can be    els to be added during autonomous patrol and immediately
enhanced to make it more robust and generalisable. Poten-         available for recognition, whereas offline building would re-
tial next view locations are currently set at a fixed distance    quire a period of down-time. By decreasing the ambiguity
from the candidate object; this can be a hindrance in certain     between two identical looking objects, we showed that ex-
topological layouts. Developing adaptable next view loca-         pensive image streaming methods (Okamoto, Milanova, and
tions which, rather than test which of a fixed set of locations   Bueker 1998) are not necessary and a more intelligent ap-
are in free space and therefore available, the potential loca-    proach that fixed angle movements (Wixson 1994) was pos-
tions should be instead generated in exclusively free space       sible, using no more than two views, with an identification
and then environmental and model analysis take place from         rate of 82.5%.
there.                                                               The work presented in this paper was successful in its
   In adopting a greedy approach, this work selected only         aims. From online aspect graph building and collision de-
poses with the highest visible portion of the object; future      tection to camera modelling and near real time occlusion
work should focus on including a cost function to form a          analysis, the way this planner was designed allows it to be
utility between movement distance and amount of the model         plugged into any robot using any model based recognition
which is visible.                                                 system, meaning this planner is available for a variety of
                                                                  robots that conduct object search in cluttered environments.
            7    Summary & Conclusion
A summary of the contributions of this project are shown                            Acknowledgement
below. The results shown in the previous section will be          The research leading to these results has received fund-
presented in support of these.                                    ing from the European Union Seventh Framework Pro-
  In summary, the aims of this study were to show that:           gramme (FP7/2007-2013) under grant agreement No
1. a method for analysing potential views using online as-        600623, STRANDS.
   pect graphs by taking both into account: (1) occlusions,
   and (2) the visibility of features based on learned object                            References
   models.                                                        Aldoma, A.; Tombari, F.; Prankl, J.; Richtsfeld, A.; Di Ste-
2. a next best view planner that selects a view based on the      fano, L.; and Vincze, M. 2013. Multimodal cue integra-
   method above and an executive that accounts for dynamic        tion through hypotheses verification for rgb-d object recog-
nition and 6dof pose estimation. In Robotics and Automa-          Kunze, L.; Doreswamy, K. K.; and Hawes, N. 2014. Us-
tion (ICRA), 2013 IEEE International Conference on, 2104–         ing qualitative spatial relations for indirect object search. In
2111. IEEE.                                                       2014 IEEE International Conference on Robotics and Au-
Bajcsy, R. 1988. Active perception. Proceedings of the            tomation (ICRA), 163–168. IEEE.
IEEE 76(8):966–1005.                                              Maver, J., and Bajcsy, R. 1993. Occlusions as a Guide
Callari, F. G., and Ferrie, F. P. 2001. Active object recog-      for Planning the Next View. IEEE transactions on pattern
nition: Looking for differences. International Journal of         analysis and machine intelligence 15(5):417–433.
Computer Vision 43(3):189–204.
                                                                  Okamoto, J.; Milanova, M.; and Bueker, U. 1998. Ac-
Connelly. 1985. The Determination of next best view. IEEE
                                                                  tive perception system for recognition of 3D objects in im-
432–435.
                                                                  age sequences. Advanced Motion Control, 1998. AMC ’98-
Cyr, C. M., and Kimia, B. B. 2004. A similarity-based             Coimbra., 1998 5th International Workshop on 700–705.
aspect-graph approach to 3d object recognition. Interna-
tional Journal of Computer Vision 57(1):5–22.                     Prankl, J.; Aldoma, A.; Svejda, A.; and Vincze, M. 2015.
Hanheide, M.; Hawes, N.; Wyatt, J.; Göbelbecker, M.; Bren-       Rgb-d object modelling for object recognition and tracking.
ner, M.; Sjöö, K.; Aydemir, A.; Jensfelt, P.; Zender, H.; and   In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ
Kruijff, G.-J. 2010. A framework for goal generation and          International Conference on, 96–103. IEEE.
management. In Proceedings of the AAAI workshop on goal-          Roberts, D., and Marshall, A. D. 1998. Viewpoint selection
directed autonomy.                                                for complete surface coverage of three dimensional objects.
Hornung, A.; Wurm, K. M.; Bennewitz, M.; Stachniss, C.;           In BMVC, 1–11.
and Burgard, W. 2013. Octomap: An efficient probabilis-           Stampfer, D.; Lutz, M.; and Schlegel, C. 2012. Information
tic 3d mapping framework based on octrees. Autonomous             driven sensor placement for robust active object recognition
Robots 34(3):189–206.                                             based on multiple views. In 2012 IEEE International Con-
Hutchinson, S. a., and Kak, A. C. 1989. Planning sensing          ference on Technologies for Practical Robot Applications
strategies in a robot work cell with multi-sensor capabilities.   (TePRA), 133–138. IEEE.
IEEE Transactions on Robotics and Automation 5(6):765–            Vasquez-Gomez, J. I.; Sucar, L. E.; and Murrieta-Cid, R.
783.                                                              2014. View planning for 3D object reconstruction with a
Kunze, L.; Beetz, M.; Saito, M.; Azuma, H.; Okada, K.;            mobile manipulator robot. 2014 IEEE/RSJ International
and Inaba, M. 2012. Searching objects in large-scale in-          Conference on Intelligent Robots and Systems (Iros):4227–
door environments: A decision-theoretic approach. In 2012         4233.
IEEE International Conference on Robotics and Automation
(ICRA), 4385–4390. IEEE.                                          Vázquez, P.-P.; Feixas, M.; Sbert, M.; and Heidrich, W.
                                                                  2001. Viewpoint selection using viewpoint entropy. In VMV,
Kunze, L.; Burbridge, C.; Alberti, M.; Thippur, A.;               volume 1, 273–280.
Folkesson, J.; Jensfelt, P.; and Hawes, N. 2014. Combin-
ing top-down spatial reasoning and bottom-up object class         Wixson, L. 1994. Viewpoint selection for visual search.
recognition for scene understanding. In 2014 IEEE/RSJ In-         Computer Vision and Pattern Recognition, 1994. Proceed-
ternational Conference on Intelligent Robots and Systems,         ings CVPR ’94., 1994 IEEE Computer Society Conference
2910–2915. IEEE.                                                  on 800–805.