Back-Projective Priming: Toward Efficient 3d Model-based Object
                     Recognition via Preemptive Top-down Constraints
                                                            Ryan Dellana
                                                     Department of Computer Science
                                                        East Carolina University
                                                      dellanar04@students.ecu.edu


                           Abstract                                    velopment and testing of the system is currently in-
  This paper describes a novel framework for context-based             progress.
  object-recognition/pose-estimation. High-level geometric
  constraints are used to optimize fitting of a 3d model to a 2d
  image through a process termed “back-projective priming”.                              Problem Scenario
  A practical problem in robotics, electrical outlet discovery,
  is used for testing. The robot, experimental setup, and ongo-        A mobile robot in an unknown building must recharge it-
  ing/future work are described.                                       self by locating an electrical outlet and recovering the pose
                                                                       of said outlet with sufficient accuracy so as to be able to
                                                                       guide its arm to plug in. Sensors consist of joint/wheel en-
                       Introduction                                    coders and a single on-board camera. There are no laser
Robust object recognition is one of the central goals of the           scanners or other depth sensors.
discipline of Computer Vision. Yet object recognition is an               What the robot knows a-priori is essentially a subset of
AI-complete problem, its solution ultimately depending on              North American building code. This provides it with a
solving the broader problems of general perception, inde-              general set of constraints without a detailed map of the
pendent of any particular modality. To this end, much can              building or expectation of outlet visual characteristics such
be learned from work in the areas that comprise Cognitive              as specific configuration, shape, or color. The lack of depth
Science. However, most computer vision research is con-                sensors requires the robot to actively perceive the 3d struc-
ducted solely within the domain of Computer Science, and               ture of the world based off the stream of 2d images from
seeks to produce targeted solutions to specific problems.              the single camera, a perceptual task known to be easy for a
Consistent with the software engineering best practice of              human teleoperating the robot.
creating modules with high cohesion and low coupling,
these solutions focus on the intrinsic features of objects,
and rarely take advantage of context.                                                       Related Work
   Taking general inspiration from Gestalt psychology,                 Several notable electrical-outlet-seeking robots have been
Neuroanatomical findings, and the success of Hierar-                   developed since 2000. Most recently, (Meeussen et al.
chical/Deep Machine Learning approaches, I seek to ex-                 2010) and (Eruhimov et al. 2011), were developed at Wil-
plore possible object recognition frameworks that utilize              low Garage using the PR2 robotics platform. (Meeussen et
context for improved efficiency and accuracy. Mobile Ro-               al. 2010) uses stereo-vision to identify outlet candidates on
botics provides an excellent test-bed for said frameworks              a texture-less wall, followed by perspective rectification of
given the abundance of diverse contextual information to               candidates using wall pose obtained from a laser range-
draw from. As a specific, well-studied, problem within                 finder (lidar). To identify outlets, template matching is
mobile robotics, “electrical outlet discovery” was chosen to           used on the candidates in the rectified image. Pose estima-
make it easier to benchmark the performance of my sys-                 tion is accomplished by using color tracking to find the
tem. An experimental setup consisting of a mobile robot                centers of four orange sockets, and then applying PnP
with a “plug arm”, and a collection of interchangeable                 Solve. (Eruhimov et al. 2011) is notable for the sub-
prop-walls and outlets, was constructed for validation. De-            millimeter accuracy of its pose-recovery, but also requires
                                                                       the wall-pose obtained from a lidar in addition to sufficient
contrast between the outlet holes and socket. In order to get
within the general region of an outlet, both systems utilize
a map of the building, pre-annotated with approximate
positions of outlets. The use of a map and depth sensors
means that neither of these systems address the problem
scenario of the previous section.
   The systems described in (Torres-Jara 2002) and
(Bustamante and Gu 2007) both wander around without a
map and so perform actual outlet discovery as opposed to
mere pose-recovery. (Torres-Jara 2002) uses a Viola Jones
cascade detector trained on a manually-labeled dataset of
846 positive and 1400 negative instances. (Bustamante and
Gu 2007) scans along walls with the aid of a lidar that is
used in conjunction with a zoom camera to maintain a con-
sistent field-of-view. This enables it to use a single fixed-
size socket template for pattern matching. Both systems are
thwarted by perspective distortion of more than 30 degrees
relative to the frontal view, as well as partial occlusion, and
deviation from the training-set/template. It should also be
noted that neither system integrates outlet-discovery with
pose-recovery, instead achieving the latter with separate
custom-tailored algorithms. The use of a depth sensor and
specialized camera places (Bustamante and Gu 2007) out-
side of our problem scenario. All things considered, the
problem solved by (Torres-Jara 2002) is the most similar to
our own.


                 Experimental Setup
The robot (Fig. 1) consists of a differential drive platform
for mobility, elevator to adjust the height of the plug, and                   Figure 1: The Robot (“Marvin”)
pivoting arm to control the pitch angle of the plug. When
eventually completed, the arm will include a gripper as-
sembly and provide general pick-and-place capabilities.
This is why it features the otherwise unnecessary pitch
control. Its senses include monocular vision and basic pro-
prioception provided by a collection of encoders, limit
switches, and a potentiometer. The plug is directly mount-
ed to the end of the arm, in view of the single arm-mounted
camera.
   Marvin was constructed from used power-wheelchair
parts plus various odds-and-ends obtained from local
hardware stores. It's main electrical components include
two 12V, 31Ah SLA gel cell batteries wired in parallel, 1
Deltran 12V/5A smart charger, 3 Dimension Engineering
dual-channel Sabertooth motor drivers, 1 Arduino Uno, 1
Arduino Mega, 3 Fairchild photo-reflectors, 6 Maxbotix
EZ0 ultrasonic range-finders, and 1 Freescale 3-axis accel-
erometer. Note that the range-finders are for safety only              Figure 2: URDF model of Marvin rendered in RViz
and do not supply any depth information to the vision sys-
tem. The camera is a 720p Microsoft LifeCam Cinema                   The on-board computer is a System 76 laptop running
with adjustable focal length (kept fixed at 980mm).               Ubuntu 12.04 with 8GB of DDR3 and a 2.5GH i7 CPU
                                                                  with 8 logical cores. The Robot Operating System (ROS)
                                                                  framework is used for concurrency and inter-process
communication. The system is self-contained with all pro-         the floor where outlets typically occur. These suggest an
cesses running inside the laptop. ROS tools/packages used         active, top-down, context-drive mode of perception.
include the OpenCV computer vision library, ROS-TF                   The notion of top-down is certainly not new, being espe-
(TF) coordinate frame transform library, and RViz for 3d          cially prominent in the unified “whole is greater than sum
visualization. TF serves a central role in keeping track of       of parts” view of perception put forth in Gestalt psycholo-
the robot's kinematic chain as well as the poses of external      gy. Take for example illusory contours such as in Kanizsa's
objects. The kinematic model of the robot was built using                                   Triangle. Since most would con-
the Unified Robot Description Format (URDF).                                                sider line detection to necessarily
   ROS's robot_state_publisher package automatically                                        precede triangle detection, illusory
translates changes in joint position to changes in the URDF                                 contours suggest that higher level
kinematic tree in TF (Fig. 2). This enables us to query TF                                  pattern recognizers exert top-
for useful information such as the pose of the camera rela-                                 down influence on lower level
tive to any other feature of the 3d world that happens to be                                modules. (Murray et al. 2002)
bound to a TF coordinate frame. A relative pose between                                     believe this effect is due to feed-
two TF frames is referred to as a transform. Transforms in                                  back modulation of areas V2 and
TF have a translational and rotational component. The             V1 from “higher-tier lateral-occipital areas, where illusory
translational component is represented as X=forward               contour sensitivity first occurs.” Indeed, there is a growing
Y=left Z=up. Instead of TF's native representation of rota-       body of evidence and general consensus among neurosci-
tion as quaternions, I use Euler angles with yaw/pitch/roll       entists for the importance of top-down feedback connec-
about ZYX respectively.                                           tions in the human visual system (Gilbert and Li 2015).
   To represent the 3d position of a detected external fea-       There have even been machine learning algorithms directly
ture such as an electrical outlet, a TF frame for the outlet is   inspired by the wiring diagram of the cerebral cortex, for
spawned relative to the camera frame (Fig. 2). TF could           instance Jeff Hawkins Hierarchical Temporal Memory
then be queried for the pose of said outlet relative to other     (Hawkins and Blakeslee 2007).
important frames such as that of the differential drive base,        While deep neural networks and other non-symbolic
or the plug. We can also use TF to spawn “hypothetical            hierarchical learning systems show great promise (Cadieu
frames” and subsequently get their poses relative to the          et al. 2014), the downside is that it isn't explicitly obvious
camera for use in back-projecting hypotheses (more about          what features or rules they use. This black box effect
that later).                                                      makes it difficult to integrate them with other systems,
   In order to capture a reasonable amount of variability in      presenting a barrier to synergy. However, it's relatively
attributes such as wall texture, outlet appearance, and light-    easy to go the other direction, taking an existing symbolic
ing, a prop-wall was constructed with interchangeable parts       system and augmenting it with non-symbolic machine
(top of Fig. 1). Ground truth for outlet pose in each image       learning. For this reason, I choose to first see how far I can
frame will be calculated from tracking a set of colored           get with an explicit constraint-based approach to modeling
markers placed at specific spots on the prop-wall. Prepro-        context.
cessing will remove the markers from the image so the
robot can't use them to cheat during test runs.
                                                                                Back-Projective Priming
                                                                  It can be useful to view a building as a hierarchy of 3d
         Drawing from Cognitive Science
                                                                  structural features. At the top of the hierarchy is the build-
There is strong experimental evidence that people recog-          ing as a whole, which can be decomposed into the floor,
nize objects with greater speed and accuracy when they            ceiling, and walls. Walls, in turn, may contain other fea-
occur within the expected context (Auckland et al. 2007).         tures such as doors, windows, baseboards, light switches,
This is also supported by introspection. When one attempts        phone jacks, and electrical outlets, which themselves can
to actively locate an outlet, the minds-eye is flooded with       be broken down further.
associations including visual/spatial memories of past de-           Some features, such as the outlet cover, are easily de-
tections, but also things only indirectly related to outlets      scribed by a static 3d model. Others, like walls, have some
such as structural components of a typical building, plugs,       invariant attributes (ex: planar, rectangular, span floor to
appliances, extension cords, and maybe forks. This can be         ceiling), but do not have a fixed 3d structure, instead being
taken as subjective evidence of priming, not just for out-        defined by a set of structural constraints yielding the space
lets, but for contextually related items. Note also the search    of possible 3d configurations of a wall. Perhaps this im-
pattern one uses, which consists of first locating a wall and     plies the concept of “wall” should be discarded in favor of
then scanning across the section of it about a foot above         a set of features that can each be assigned a fixed 3d mod-
el. Solving these subtle ontological problems will be rele-
gated to future work. For now we will look at an easily
defined subset of building features to demonstrate the gen-
eral concept.
   The important point is that, given knowledge about the
relative locations of some of these map features, we can
constrain the space of possibilities in the search for other
features. Take, for instance, the constraint that any wall
should have a pitch angle exactly 90 degrees greater than
that of the floor, and an outlet, in-turn, will have the exact
same rotational vector as the wall that it's in. Since a wall
will always have a fixed roll value of 0, both the pitch and
roll values of any potential outlet are known a-priori. The z
                                                                     Figure 3: Back-Projections Produced for “Weak” Priming
coordinate of the outlet is expected to be 12 inches above
the floor plane, so that, over-all, there are only three varia-   of this pose space is required to capture the variation in 2d
ble components of pose for any outlet, x, y, and yaw. If,         features produced from different combinations of position,
however, we've already found a wall, then, relative to the        orientation, and scale of the outlet. The requisite sample
wall's coordinate frame, the outlet can only vary in terms        size is very large, easily requiring hundreds of back-
of the y component of translation. Finding a wall dramati-        projection operations, a computational cost outweighing
cally shrinks the space of possible outlet poses, and, con-       the gains in model fitting efficiency. We could, of course,
trariwise, finding an outlet would automatically indicate         do this computation only once and cache the result. How-
the presence/pose of a wall. When searching for objects in        ever, if any of the geometric constraints were to change,
isolation, each new object added to the database reduces          such as camera z, an entirely new set of back-projections
speed and accuracy in the search for any one object. But          would need to be computed.
when the constraints between objects are also modeled, a             In order to avoid the large overhead of “strong” priming,
larger object database actually increases speed and accura-       we can select a subset of the pose space that captures vari-
cy.                                                               ability in perspective while neglecting scale and position.
   Modeling the 3d structural constraints within a building       Five translational vectors (Fig. 3) are selected to adequate-
is one problem, while establishing correspondences be-            ly sample the effects of translation on perspective. At each
tween a given 3d configuration and 2d image is another.           of these translations, 7 values of yaw are sampled, produc-
Back-projective priming is a technique that works at the          ing a manageable total of 35 back-projections. One caveat
interface between these two problems. Given a 3d model            of this is that, in order for matching to work, the 2d fea-
for a target object, plus constraints on its pose space, we       tures used must be scale invariant. For polyhedra, oriented-
can generate a representative sample of its possible poses        edges work well, given that their midpoint and orientation
and back-project them to 2d. The rendered back-                   components are scale-invariant, yet scale can still be
projections are then run through a set of 2d image feature        known from their length component. The 35 poses are used
extraction algorithms. The resulting collection of 2d mod-        to build an expectations map (exp-map) as follows:
el-pose-feature correspondences forms an “expectations
map”. The process of building an expectations map is re-          For each of the 35 poses:
ferred to as “priming.” The same set of feature extraction          Render pose.
algorithms are than run on the input image and the expecta-         Extract features from render.
tions map is used to guide model fitting. Pose estimation of        Store model render points and features into a
the detections is refined through additional back-projective           model-pose-feature-binder object.
iterations with finer granularity.                                  Add binder object to exp-map:
                                                                       For each feature in binder:
                                                                         Add feature to exp-map feature-index.
                    Implementation
For the outlet-detection problem, a very minimalistic 3d
model of the outlet cover is used, which is a simple rectan-
gle consisting of four points and four edges. Initially, no
wall poses are known, which produces a space of possible
outlet poses based on different combinations of the un-
bound variables x, y, and yaw. A sufficiently large sample
Once the expectations map has been built, we attempt to                          Preliminary Results
find the target model in the input image as follows:
                                                              The robot is mechanically and electrically complete and
For each feature extracted from input image:                  has successfully plugged itself in under teleoperation. Out-
  Get k best matches from exp-map feature-index.              let pose recovery (Fig. 5) and geometric constraint post-
  For each match above a certain confidence threshold:        validation with TF has been demonstrated using a color-
     Retrieve model-pose-feature-binder that matching         coded outlet.
        feature belongs to.
     Create a copy of the binder, denoted b2.
     Scale b2's model points and feature points to
        match the scale of the matching input image
        feature.
     Translate b2's points so the pair of matching
        features overlap (Fig. 4).
     After translating the binder, calculate the feature-
        space distance between the other feature-points of         Figure 5: Tracking and Pose Constraint Post-validation
        the binder and their nearest neighbor in the input
        image.
     These distances are aggregated to produce a                                      Future Work
        composite score determining the overall strength
                                                                       Complete implementation and testing of “weak”
        of the hypothesis.
                                                                        priming/model-fitting.
     If the hypothesis score is above the required
                                                                       Make use of constraint programming in generat-
        confidence threshold, then apply PnP solve to the
                                                                        ing pose-space samples.
        transformed 2d model points to recover the 3d pose
                                                                       Explore the feasibility of “strong” priming.
        of b2, and add it to the hypothesis collection.
                                                                       Find a faster alternative to TF for generating and
For each hypothesis returned:
                                                                        calculating the relative poses of hypothetical
  Validate pose based on geometric constraints (Fig. 5).
                                                                        frames.
  Discard hypothesis if it deviates by more than tolerance.
                                                                       Add OpenGL integration allowing use of more
                                                                        detailed CAD models for back-projection render-
                                                                        ing.
                                                                       Model light switches, phone jacks, walls, and oth-
                                                                        er context.


                                                                                 Acknowledgements
                                                              I’d like to thank my thesis advisor Dr. Ronnie Smith for his
                                                              encouragement and guidance. Financial support for this
                                                              work has been provided by East Carolina University and
                                                              the Department of Computer Science.


                                                                                        References
                                                              Auckland, M. E., Cave, K. R., & Donnelly, N. (2007). Nontarget
                   Figure 4: Model Fitting                    objects can influence perceptual processes during object recogni-
                                                              tion. Psychonomic bulletin & review, 14(2), 332-337.
  All remaining hypotheses are considered detections. For
                                                              Bustamante, L., & Gu, J. (2007, April). Localization of electrical
any detections, a process of iterative refinement can be
                                                              outlet for a mobile robot using visual servoing. In Electrical and
applied to improve pose estimation accuracy.                  Computer Engineering, 2007. CCECE 2007. Canadian Confer-
                                                              ence on (pp. 1211-1214). IEEE.
                                                              Cadieu, C. F., Hong, H., Yamins, D. L., Pinto, N., Ardila, D.,
                                                              Solomon, E. A., ... & DiCarlo, J. J. (2014). Deep neural networks
                                                              rival the representation of primate IT cortex for core visual object
                                                              recognition. PLoS computational biology, 10(12), e1003963.
Eruhimov, V., & Meeussen, W. (2011, September). Outlet detec-
tion and pose estimation for robot continuous operation. In Intel-
ligent Robots and Systems (IROS), 2011 IEEE/RSJ International
Conference on (pp. 2941-2946). IEEE.
Foote, T. (2013, April). tf: The transform library. In Technologies
for Practical Robot Applications (TePRA), 2013 IEEE Interna-
tional Conference on (pp. 1-6). IEEE.
Gilbert, C. D., & Li, W. (2013). Top-down influences on visual
processing. Nature Reviews Neuroscience, 14(5), 350-363.
Hawkins, J., & Blakeslee, S. (2007). On intelligence. Macmillan.
Meeussen, W., Wise, M., Glaser, S., Chitta, S., McGann, C., Mi-
helich, P., ... & Berger, E. (2010, May). Autonomous door open-
ing and plugging in with a personal robot. In Robotics and Auto-
mation (ICRA), 2010 IEEE International Conference on (pp. 729-
736). IEEE.
Murray, M. M., Wylie, G. R., Higgins, B. A., Javitt, D. C.,
Schroeder, C. E., & Foxe, J. J. (2002). The spatiotemporal dy-
namics of illusory contour processing: combined high-density
electrical mapping, source analysis, and functional magnetic res-
onance imaging. The Journal of Neuroscience, 22(12), 5055-
5073.
Torres-Jara, E. R. (2002). A self-feeding robot (Doctoral disserta-
tion, Massachusetts Institute of Technology).