Back-Projective Priming: Toward Efficient 3d Model-based Object Recognition via Preemptive Top-down Constraints Ryan Dellana Department of Computer Science East Carolina University dellanar04@students.ecu.edu Abstract velopment and testing of the system is currently in- This paper describes a novel framework for context-based progress. object-recognition/pose-estimation. High-level geometric constraints are used to optimize fitting of a 3d model to a 2d image through a process termed “back-projective priming”. Problem Scenario A practical problem in robotics, electrical outlet discovery, is used for testing. The robot, experimental setup, and ongo- A mobile robot in an unknown building must recharge it- ing/future work are described. self by locating an electrical outlet and recovering the pose of said outlet with sufficient accuracy so as to be able to guide its arm to plug in. Sensors consist of joint/wheel en- Introduction coders and a single on-board camera. There are no laser Robust object recognition is one of the central goals of the scanners or other depth sensors. discipline of Computer Vision. Yet object recognition is an What the robot knows a-priori is essentially a subset of AI-complete problem, its solution ultimately depending on North American building code. This provides it with a solving the broader problems of general perception, inde- general set of constraints without a detailed map of the pendent of any particular modality. To this end, much can building or expectation of outlet visual characteristics such be learned from work in the areas that comprise Cognitive as specific configuration, shape, or color. The lack of depth Science. However, most computer vision research is con- sensors requires the robot to actively perceive the 3d struc- ducted solely within the domain of Computer Science, and ture of the world based off the stream of 2d images from seeks to produce targeted solutions to specific problems. the single camera, a perceptual task known to be easy for a Consistent with the software engineering best practice of human teleoperating the robot. creating modules with high cohesion and low coupling, these solutions focus on the intrinsic features of objects, and rarely take advantage of context. Related Work Taking general inspiration from Gestalt psychology, Several notable electrical-outlet-seeking robots have been Neuroanatomical findings, and the success of Hierar- developed since 2000. Most recently, (Meeussen et al. chical/Deep Machine Learning approaches, I seek to ex- 2010) and (Eruhimov et al. 2011), were developed at Wil- plore possible object recognition frameworks that utilize low Garage using the PR2 robotics platform. (Meeussen et context for improved efficiency and accuracy. Mobile Ro- al. 2010) uses stereo-vision to identify outlet candidates on botics provides an excellent test-bed for said frameworks a texture-less wall, followed by perspective rectification of given the abundance of diverse contextual information to candidates using wall pose obtained from a laser range- draw from. As a specific, well-studied, problem within finder (lidar). To identify outlets, template matching is mobile robotics, “electrical outlet discovery” was chosen to used on the candidates in the rectified image. Pose estima- make it easier to benchmark the performance of my sys- tion is accomplished by using color tracking to find the tem. An experimental setup consisting of a mobile robot centers of four orange sockets, and then applying PnP with a “plug arm”, and a collection of interchangeable Solve. (Eruhimov et al. 2011) is notable for the sub- prop-walls and outlets, was constructed for validation. De- millimeter accuracy of its pose-recovery, but also requires the wall-pose obtained from a lidar in addition to sufficient contrast between the outlet holes and socket. In order to get within the general region of an outlet, both systems utilize a map of the building, pre-annotated with approximate positions of outlets. The use of a map and depth sensors means that neither of these systems address the problem scenario of the previous section. The systems described in (Torres-Jara 2002) and (Bustamante and Gu 2007) both wander around without a map and so perform actual outlet discovery as opposed to mere pose-recovery. (Torres-Jara 2002) uses a Viola Jones cascade detector trained on a manually-labeled dataset of 846 positive and 1400 negative instances. (Bustamante and Gu 2007) scans along walls with the aid of a lidar that is used in conjunction with a zoom camera to maintain a con- sistent field-of-view. This enables it to use a single fixed- size socket template for pattern matching. Both systems are thwarted by perspective distortion of more than 30 degrees relative to the frontal view, as well as partial occlusion, and deviation from the training-set/template. It should also be noted that neither system integrates outlet-discovery with pose-recovery, instead achieving the latter with separate custom-tailored algorithms. The use of a depth sensor and specialized camera places (Bustamante and Gu 2007) out- side of our problem scenario. All things considered, the problem solved by (Torres-Jara 2002) is the most similar to our own. Experimental Setup The robot (Fig. 1) consists of a differential drive platform for mobility, elevator to adjust the height of the plug, and Figure 1: The Robot (“Marvin”) pivoting arm to control the pitch angle of the plug. When eventually completed, the arm will include a gripper as- sembly and provide general pick-and-place capabilities. This is why it features the otherwise unnecessary pitch control. Its senses include monocular vision and basic pro- prioception provided by a collection of encoders, limit switches, and a potentiometer. The plug is directly mount- ed to the end of the arm, in view of the single arm-mounted camera. Marvin was constructed from used power-wheelchair parts plus various odds-and-ends obtained from local hardware stores. It's main electrical components include two 12V, 31Ah SLA gel cell batteries wired in parallel, 1 Deltran 12V/5A smart charger, 3 Dimension Engineering dual-channel Sabertooth motor drivers, 1 Arduino Uno, 1 Arduino Mega, 3 Fairchild photo-reflectors, 6 Maxbotix EZ0 ultrasonic range-finders, and 1 Freescale 3-axis accel- erometer. Note that the range-finders are for safety only Figure 2: URDF model of Marvin rendered in RViz and do not supply any depth information to the vision sys- tem. The camera is a 720p Microsoft LifeCam Cinema The on-board computer is a System 76 laptop running with adjustable focal length (kept fixed at 980mm). Ubuntu 12.04 with 8GB of DDR3 and a 2.5GH i7 CPU with 8 logical cores. The Robot Operating System (ROS) framework is used for concurrency and inter-process communication. The system is self-contained with all pro- the floor where outlets typically occur. These suggest an cesses running inside the laptop. ROS tools/packages used active, top-down, context-drive mode of perception. include the OpenCV computer vision library, ROS-TF The notion of top-down is certainly not new, being espe- (TF) coordinate frame transform library, and RViz for 3d cially prominent in the unified “whole is greater than sum visualization. TF serves a central role in keeping track of of parts” view of perception put forth in Gestalt psycholo- the robot's kinematic chain as well as the poses of external gy. Take for example illusory contours such as in Kanizsa's objects. The kinematic model of the robot was built using Triangle. Since most would con- the Unified Robot Description Format (URDF). sider line detection to necessarily ROS's robot_state_publisher package automatically precede triangle detection, illusory translates changes in joint position to changes in the URDF contours suggest that higher level kinematic tree in TF (Fig. 2). This enables us to query TF pattern recognizers exert top- for useful information such as the pose of the camera rela- down influence on lower level tive to any other feature of the 3d world that happens to be modules. (Murray et al. 2002) bound to a TF coordinate frame. A relative pose between believe this effect is due to feed- two TF frames is referred to as a transform. Transforms in back modulation of areas V2 and TF have a translational and rotational component. The V1 from “higher-tier lateral-occipital areas, where illusory translational component is represented as X=forward contour sensitivity first occurs.” Indeed, there is a growing Y=left Z=up. Instead of TF's native representation of rota- body of evidence and general consensus among neurosci- tion as quaternions, I use Euler angles with yaw/pitch/roll entists for the importance of top-down feedback connec- about ZYX respectively. tions in the human visual system (Gilbert and Li 2015). To represent the 3d position of a detected external fea- There have even been machine learning algorithms directly ture such as an electrical outlet, a TF frame for the outlet is inspired by the wiring diagram of the cerebral cortex, for spawned relative to the camera frame (Fig. 2). TF could instance Jeff Hawkins Hierarchical Temporal Memory then be queried for the pose of said outlet relative to other (Hawkins and Blakeslee 2007). important frames such as that of the differential drive base, While deep neural networks and other non-symbolic or the plug. We can also use TF to spawn “hypothetical hierarchical learning systems show great promise (Cadieu frames” and subsequently get their poses relative to the et al. 2014), the downside is that it isn't explicitly obvious camera for use in back-projecting hypotheses (more about what features or rules they use. This black box effect that later). makes it difficult to integrate them with other systems, In order to capture a reasonable amount of variability in presenting a barrier to synergy. However, it's relatively attributes such as wall texture, outlet appearance, and light- easy to go the other direction, taking an existing symbolic ing, a prop-wall was constructed with interchangeable parts system and augmenting it with non-symbolic machine (top of Fig. 1). Ground truth for outlet pose in each image learning. For this reason, I choose to first see how far I can frame will be calculated from tracking a set of colored get with an explicit constraint-based approach to modeling markers placed at specific spots on the prop-wall. Prepro- context. cessing will remove the markers from the image so the robot can't use them to cheat during test runs. Back-Projective Priming It can be useful to view a building as a hierarchy of 3d Drawing from Cognitive Science structural features. At the top of the hierarchy is the build- There is strong experimental evidence that people recog- ing as a whole, which can be decomposed into the floor, nize objects with greater speed and accuracy when they ceiling, and walls. Walls, in turn, may contain other fea- occur within the expected context (Auckland et al. 2007). tures such as doors, windows, baseboards, light switches, This is also supported by introspection. When one attempts phone jacks, and electrical outlets, which themselves can to actively locate an outlet, the minds-eye is flooded with be broken down further. associations including visual/spatial memories of past de- Some features, such as the outlet cover, are easily de- tections, but also things only indirectly related to outlets scribed by a static 3d model. Others, like walls, have some such as structural components of a typical building, plugs, invariant attributes (ex: planar, rectangular, span floor to appliances, extension cords, and maybe forks. This can be ceiling), but do not have a fixed 3d structure, instead being taken as subjective evidence of priming, not just for out- defined by a set of structural constraints yielding the space lets, but for contextually related items. Note also the search of possible 3d configurations of a wall. Perhaps this im- pattern one uses, which consists of first locating a wall and plies the concept of “wall” should be discarded in favor of then scanning across the section of it about a foot above a set of features that can each be assigned a fixed 3d mod- el. Solving these subtle ontological problems will be rele- gated to future work. For now we will look at an easily defined subset of building features to demonstrate the gen- eral concept. The important point is that, given knowledge about the relative locations of some of these map features, we can constrain the space of possibilities in the search for other features. Take, for instance, the constraint that any wall should have a pitch angle exactly 90 degrees greater than that of the floor, and an outlet, in-turn, will have the exact same rotational vector as the wall that it's in. Since a wall will always have a fixed roll value of 0, both the pitch and roll values of any potential outlet are known a-priori. The z Figure 3: Back-Projections Produced for “Weak” Priming coordinate of the outlet is expected to be 12 inches above the floor plane, so that, over-all, there are only three varia- of this pose space is required to capture the variation in 2d ble components of pose for any outlet, x, y, and yaw. If, features produced from different combinations of position, however, we've already found a wall, then, relative to the orientation, and scale of the outlet. The requisite sample wall's coordinate frame, the outlet can only vary in terms size is very large, easily requiring hundreds of back- of the y component of translation. Finding a wall dramati- projection operations, a computational cost outweighing cally shrinks the space of possible outlet poses, and, con- the gains in model fitting efficiency. We could, of course, trariwise, finding an outlet would automatically indicate do this computation only once and cache the result. How- the presence/pose of a wall. When searching for objects in ever, if any of the geometric constraints were to change, isolation, each new object added to the database reduces such as camera z, an entirely new set of back-projections speed and accuracy in the search for any one object. But would need to be computed. when the constraints between objects are also modeled, a In order to avoid the large overhead of “strong” priming, larger object database actually increases speed and accura- we can select a subset of the pose space that captures vari- cy. ability in perspective while neglecting scale and position. Modeling the 3d structural constraints within a building Five translational vectors (Fig. 3) are selected to adequate- is one problem, while establishing correspondences be- ly sample the effects of translation on perspective. At each tween a given 3d configuration and 2d image is another. of these translations, 7 values of yaw are sampled, produc- Back-projective priming is a technique that works at the ing a manageable total of 35 back-projections. One caveat interface between these two problems. Given a 3d model of this is that, in order for matching to work, the 2d fea- for a target object, plus constraints on its pose space, we tures used must be scale invariant. For polyhedra, oriented- can generate a representative sample of its possible poses edges work well, given that their midpoint and orientation and back-project them to 2d. The rendered back- components are scale-invariant, yet scale can still be projections are then run through a set of 2d image feature known from their length component. The 35 poses are used extraction algorithms. The resulting collection of 2d mod- to build an expectations map (exp-map) as follows: el-pose-feature correspondences forms an “expectations map”. The process of building an expectations map is re- For each of the 35 poses: ferred to as “priming.” The same set of feature extraction Render pose. algorithms are than run on the input image and the expecta- Extract features from render. tions map is used to guide model fitting. Pose estimation of Store model render points and features into a the detections is refined through additional back-projective model-pose-feature-binder object. iterations with finer granularity. Add binder object to exp-map: For each feature in binder: Add feature to exp-map feature-index. Implementation For the outlet-detection problem, a very minimalistic 3d model of the outlet cover is used, which is a simple rectan- gle consisting of four points and four edges. Initially, no wall poses are known, which produces a space of possible outlet poses based on different combinations of the un- bound variables x, y, and yaw. A sufficiently large sample Once the expectations map has been built, we attempt to Preliminary Results find the target model in the input image as follows: The robot is mechanically and electrically complete and For each feature extracted from input image: has successfully plugged itself in under teleoperation. Out- Get k best matches from exp-map feature-index. let pose recovery (Fig. 5) and geometric constraint post- For each match above a certain confidence threshold: validation with TF has been demonstrated using a color- Retrieve model-pose-feature-binder that matching coded outlet. feature belongs to. Create a copy of the binder, denoted b2. Scale b2's model points and feature points to match the scale of the matching input image feature. Translate b2's points so the pair of matching features overlap (Fig. 4). After translating the binder, calculate the feature- space distance between the other feature-points of Figure 5: Tracking and Pose Constraint Post-validation the binder and their nearest neighbor in the input image. These distances are aggregated to produce a Future Work composite score determining the overall strength  Complete implementation and testing of “weak” of the hypothesis. priming/model-fitting. If the hypothesis score is above the required  Make use of constraint programming in generat- confidence threshold, then apply PnP solve to the ing pose-space samples. transformed 2d model points to recover the 3d pose  Explore the feasibility of “strong” priming. of b2, and add it to the hypothesis collection.  Find a faster alternative to TF for generating and For each hypothesis returned: calculating the relative poses of hypothetical Validate pose based on geometric constraints (Fig. 5). frames. Discard hypothesis if it deviates by more than tolerance.  Add OpenGL integration allowing use of more detailed CAD models for back-projection render- ing.  Model light switches, phone jacks, walls, and oth- er context. Acknowledgements I’d like to thank my thesis advisor Dr. Ronnie Smith for his encouragement and guidance. Financial support for this work has been provided by East Carolina University and the Department of Computer Science. References Auckland, M. E., Cave, K. R., & Donnelly, N. (2007). Nontarget Figure 4: Model Fitting objects can influence perceptual processes during object recogni- tion. Psychonomic bulletin & review, 14(2), 332-337. All remaining hypotheses are considered detections. For Bustamante, L., & Gu, J. (2007, April). Localization of electrical any detections, a process of iterative refinement can be outlet for a mobile robot using visual servoing. In Electrical and applied to improve pose estimation accuracy. Computer Engineering, 2007. CCECE 2007. Canadian Confer- ence on (pp. 1211-1214). IEEE. Cadieu, C. F., Hong, H., Yamins, D. L., Pinto, N., Ardila, D., Solomon, E. A., ... & DiCarlo, J. J. (2014). Deep neural networks rival the representation of primate IT cortex for core visual object recognition. PLoS computational biology, 10(12), e1003963. Eruhimov, V., & Meeussen, W. (2011, September). Outlet detec- tion and pose estimation for robot continuous operation. In Intel- ligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on (pp. 2941-2946). IEEE. Foote, T. (2013, April). tf: The transform library. In Technologies for Practical Robot Applications (TePRA), 2013 IEEE Interna- tional Conference on (pp. 1-6). IEEE. Gilbert, C. D., & Li, W. (2013). Top-down influences on visual processing. Nature Reviews Neuroscience, 14(5), 350-363. Hawkins, J., & Blakeslee, S. (2007). On intelligence. Macmillan. Meeussen, W., Wise, M., Glaser, S., Chitta, S., McGann, C., Mi- helich, P., ... & Berger, E. (2010, May). Autonomous door open- ing and plugging in with a personal robot. In Robotics and Auto- mation (ICRA), 2010 IEEE International Conference on (pp. 729- 736). IEEE. Murray, M. M., Wylie, G. R., Higgins, B. A., Javitt, D. C., Schroeder, C. E., & Foxe, J. J. (2002). The spatiotemporal dy- namics of illusory contour processing: combined high-density electrical mapping, source analysis, and functional magnetic res- onance imaging. The Journal of Neuroscience, 22(12), 5055- 5073. Torres-Jara, E. R. (2002). A self-feeding robot (Doctoral disserta- tion, Massachusetts Institute of Technology).