=Paper=
{{Paper
|id=None
|storemode=property
|title=Building a Real-Time System for High-Level Perception
|pdfUrl=https://ceur-ws.org/Vol-818/paper3.pdf
|volume=Vol-818
}}
==Building a Real-Time System for High-Level Perception==
Building a Real-Time System for High-Level Person Recognition Denver Dash Long Quoc Tran Intel Embedded Systems Science and Technology Center Georgia Institute of Technology Carnegie Mellon University Pittsburgh, PA 15213 denver.h.dash@intel.com Abstract In recent years there has been an explosion of work on ex- ploiting in-frame context for entity classification in images (Torralba, 2003, Kumar and Hebert, 2005, Torralba et al., We describe a system that uses graphical mod- 2005, Heitz and Koller, 2008, Heitz et al., 2008, Gupta and els to perform real-time high-level perception. Davis, 2008, Rabinovich and Belongie, 2009, Gould et al., Our system uses Markov Logic Networks to re- 2009). The work typically involves finding some useful late entities in images via first-order logical sen- relations for the specific domain at hand, e.g., “the sky is tences to perform real-time semi-supervised per- usually above the ground”(Gupta and Davis, 2008), build- son recognition. The system is a collection of ing a customized conditional random field model over the “commodity-level” vision algorithms such as the entities in a frame and jointly classifying each entity in an Viola-Jones face detector, histogram matching image given the observed pixel values. Despite these suc- and even low-level pixel comparisons, together cesses, at present few if any practical real-time systems ex- with logical relationships such as mutual ex- ist that attempt to do high-level reasoning by integrating clusivity and entity confusion combined with a context at a high-level. In this paper, we discuss our at- small number of labeled examples into a Markov tempts at building such a system using Markov Logic Net- random field which can be solved to provide la- works (MLNs) and by constructing a database of logical bels for faces in the images. We describe the relationships that are useful for relating entities to be iden- methodology for constructing the logical rela- tified. tions used for the system, and the (many) pit- falls we encountered despite the small number This paper also makes the point that MLNs provide a uni- of relations used. We also discuss several fu- form, intuitive and modular interface for performing high- ture approaches to achieve interactive speeds for level perception. More importantly, we show that MLNs such a system, including bounding the size of the can provide a newer more global sense of context that al- graph using temporal weighting of instances, ap- lows them to jointly classify an entire dataset of images proximating the structure of the graphical model, (entities), using meaningful relations between these enti- parallelizing graphical model inference, and low- ties, in a manner similar to the collective classification of level hardware acceleration. citation entries done by Singla and Domingos (2006). The image representation provides a wealth of relations that can be brought to bear on the problem, such as mutual exclu- sivity of multiple faces in an image, temporal and spatial Introduction stratification, personal traits that may relate people to vari- ous objects or distinctive clothes, etc. We thus expect that In this paper, we describe a real-time system1 (Figure 1) this application is even more suited for the use of a power- for recognition using a small labeled dataset plus first-order ful tool like MLNs than the case of citation matching. logic relations. The system assumes a constrained environ- ment, i.e., one in which the same people generally occur This use of MLNs for collective classification resembles and that the instances that we want to classify are localized graph-based semi-supervised learning (SSL) approaches in time. (c.f., Fergus et al., 2009), which relate entities across a cor- pus via a distance or similarity measure. However, com- 1 pared to SSL approaches, MLNs provides a much richer A batch version of this system along with empirical results was described by Chechetka et al. (2010). Here we focus more on way of connecting labeled/unlabeled instances, allowing the implementation details as well as the knowledge management one to combine multiple similarity metrics at the same time of the system. Figure 1: System Overveiw as well as incorporate arbitrary logical relationships. In fact tion (and object recognition more generally). In particu- we argue that MLNs can provide an approximate gener- lar, SSL approaches (Fergus et al., 2009) exploit similar- alization to some of the standard SSL approaches by dis- ity in object appearances in different images to propagate cretizing a distance/similarity measure and incorporating label information from labeled to unlabeled blobs, and be- them into the MLN. In addition one can continue to ex- tween unlabeled blobs. In a supervised setting, typically ploit other relations that would not fit well within the SSL a low-dimensional representation of blob appearances is framework, such as contextual information that relates enti- extracted (e.g., Turk and Pentland, 1991) and a standard ties within frames. We show empirically that this approach technique such as a support vector machine (Vapnik, 1995) yields favorable results for face recognition in images of is then applied. Besides the blob appearance information, three datasets collected by us, and that the use of the addi- it has been shown that taking context in which the blob tional logical relations, which would be difficult in standard appears, such as blob location within the frame or labels SSL, is crucial for the best classification accuracy. of other objects in the scene, is crucial to accurate object recognition. In this section, we show that all the above MLN Background sources of information can be combined efficiently using a Markov logic network. Our approach thus combines the advantages of the diverse existing approaches to improve Markov logic (Richardson and Domingos, 2006) is a prob- face recognition accuracy. In the MLN described below, we abilistic generalization of finite first-order logic. A Markov will use the query predicate Label(b, o), which is true if logic network (MLN) consists of a set of weighted first- and only if blob b has label o. The evidence predicates will order clauses. Given a set of constants, an MLN defines a be introduced gradually, as they are needed for the MLN Markov network with one binary variable for every ground rules. atom and one potential for every possible grounding of every first-order clause. The joint probability distribution We assume that face detection has already been performed over the ground atom variables is defined as by some standard approach, such as that of Viola and Jones (2001). The input to our system thus consists of a set of 1 X X images, and for each image, a set of bounding boxes for P (x) = exp wf f (xi ) , (1) the detected faces, some of which are labeled with people’s Z x f i names. The goal is to assign labels to the remaining unla- beled face blobs. where f is an indicator function corresponding to a first- order clause (1 if that clause is true and 0 otherwise), wf is a weight of that clause, and xi is the set of ground atom Label propagation: semi-supervised component variables in a particular grounding of that clause. The inner summation in (1) is over all possible groundings. There- A key idea of the SSL approaches is to classify all the ob- fore, for every grounding of every first-order clause, the jects of the test set simultaneously by rewarding the cases higher the weight for that clause, the more favored are as- of similar-looking objects having the same label (equiv- signments to x where that grounding is true. alently, penalizing labels mismatches for similarly look- ing objects). Let xi and xj be the appearances of blobs Two fundamental problems in Markov logic that apply bi and bj respectively. Denote kxi − xj k to be the dis- to our application are those of learning optimal weights tance between xi and xj . We define the evidence predicate for the known set of first-order clauses given the knowl- SimilarFace(bi , bj ) that is true if and only if kxi −xj k < edge base of known ground atoms, and inference, or find- ∆f , where ∆f is a threshold. Then the rule to favor match- ing the most likely assignment to unknown ground atoms ing labels for similar faces is simply given the knowledge base. Even though both problems are intractable in general, well-performing approximate algo- SimilarFace(bi , bj ) ∧ Label(bi , o) ⇒ Label(bj , o) rithms are available. For weight learning, we used precon- (2) ditioned conjugate gradient with MC-SAT sampling imple- We selected threshold P ∆ f so as to get precision 0.9 on the mented in the Alchemy package (Kok et al., 2009). For I(SimilarFace(bi ,bj )=true) training data: P i,j I(Label(bi ,o)=Label(bj ,o)) = 0.9, where inference, we used a high-performance implementation of i,j,o residual belief propagation (Gonzalez et al., 2009) along I(·) is the indicator function. For simplicity of implementa- with a lazy instantiation of MLN structure as recommended tion, we used 16-bin color histograms as representations for P#bins (xi (k)−xj (k))2 by Poon et al. (2008). xi and χ2 distance kxi − xj kχ2 ≡ k=1 xi (k)+xj (k) . Naturally, any other choice of representation and distance can be used instead. Model Description Observe that similar face appearance is not the only possi- In the existing literature, many types of very different ble clue that two image fragments actually depict the same features have been shown to be useful for face recogni- person. For example, similar clothing appearance is an- other useful channel of information, as was demonstrated by Sivic et al. (2006). In our approach, information about clothing appearance similarity is used in the same way to the face similarity: for every face blob bi , we define the corresponding torso blob ti to be a rectangle right under bi ; the scale of the rectangle is determined by the size of bi . Let yi be the appearance representation of ti . We de- fine the evidence predicate SimilarTorso(bi , bj ) which is true if and only if kyi − yj k < ∆t and introduce the corresponding label smoothing rule SimilarTorso(bi , bj ) ∧ Label(bi , o) ⇒ Label(bj , o) (3) into the MLN. One can see that we have two versions of essentially the same rule exploiting different channels of information for label propagation. Even though it is possi- ble in principle to achieve the same effect in standard graph Laplacian-based SSL approaches (Fergus et al., 2009), one would need to use costly cross-validation to find a good way to combine the two separate distance metrics into one (alternatively, find the relative importance of the torso dis- tance and face distance metrics). In contrast, standard algo- rithms for MLN weight learning provide our approach with the relative importance of the two rules automatically. More fine-grained label smoothing. One advantage of the graph Laplacian-based unsupervised methods over our approach is that the former naturally support real-valued blob similarity values, while our approach requires thresh- olding. However, our approach can also be adapted to handle varying degrees of similarity: instead of a single similarity threshold, one can use multiple different sim- ilarity thresholds and introduce corresponding similarity predicates. For example, suppose we want to use two (1) (2) different thresholds, ∆f < ∆f , for face blob similar- ity. Then we would introduce two similarity predicates, Figure 2: Example security images for datasets 1–3 (top SimilarFace(1) (bi , bj ), which is true if and only if kxi − (1) to bottom). The top image shows an example of torso ex- xj k < ∆f , and analogously SimilarFace(2) (bi , bj ), for traction (faces on this set have been blurred by subjects’ (2) ∆f . Then for highly similar blobs, those with kxi − request). The middle image shows a view of a kitchen area (1) xj k < ∆f , both versions of the formula in Eq. 2 for where a coffee machine (red) is in the middle of the frame, while the refrigerator (green) is on the right; thus coffee SimilarFace(1) and SimilarFace(2) will have the left- drinkers might be more likely to appear in the middle. hand side to be true, providing a higher reward for match- ing the labels. On the other hand, for weakly similar blobs, (1) (2) those with ∆f < kxi − xj k < ∆f , only the version of with monitors) and broad scene context (fridges usually oc- Eq. 2 corresponding to SimilarFace(2) will have the LHS cur in kitchen scenes) have all been shown to enable dra- to be true, providing a weaker reward for matching labels. matic improvements in recognition accuracy. Here, we de- scribe the MLN rules used by our system to take single- Exploiting single-image context image context into account. In addition to the appearance of the blob of interest itself A person only occurs once in an image. In the absence and the labels of similar blobs in other images, power- of mirrors, for every person at most one occurrence of their ful contextual cues often exist in the image containing the respective face is possible in a single image. Therefore, blob. In the broader context of object recognition, spacial if two faces are present in the same image, they necessar- context (e.g. sky is usually in the top part of an image), ily have to either have different labels, or be both labeled co-occurrence (computer keyboards tend to occur together as unknown. Hence we introduce an evidence predicate SameImage(bi , bj ) which is true if and only if bi and bj accuracy significantly over simpler baselines in a super- are in the same image, and the following MLN rule: vised setting. If such a recognition system is available, it is desirable to be able to leverage its results in our framework SameImage(bi , bj ) ⇒!Label(bj , o1 )∨ instead of completely discarding the existing system and !Label(bj , o2 ) ∨ (o1 ! = o2 ) ∨ (o1 == Unknown) (4) replacing it with the MLN model. Fortunately, it is easy to combine any existing face recognition system with our Face location. For multiple images taken with the same approach by using the face labels produced by the existing camera pose, such as images from a security camera, of- system as observations in our model. Formally, we use an ten different people will tend to occupy different parts of evidence predicate ObservedLabel(b, observedLabel), the frame. For example, in the middle image of Fig. 2 which is true if and only if the external face recognition the refrigerator is in the right part of the frame, and the system assigned observedLabel as the label for blob b. coffee machine is in the middle. Therefore, faces of cof- The MLN rule fee drinkers may be more likely to appear in the middle ObservedLabel(b, +observedLabel) ⇒ Label(b, +o) of the frame, while those preferring soft drinks may spend (5) more time in the right part. In addition, false-positive face then provides the observation model. Observe that sev- detections (which are given the label “junk”) will appear eral different external classifiers can be used as observa- randomly whereas actual faces appear in more constrained tions simultaneously, by mapping the labels produced by locations. Using the spacial prior in such settings will bene- different classifiers to disjoint sets of atoms. For example, fit the recognition accuracy. In our approach, we subdivide if there are two different classifiers, clf1 and clf2 , and every image into 9 tiles of the same size, arranged in a 3×3 both label blob b1 as John, then one would set two ground grid and introduce an evidence predicate InTile(b, tile) predicates to true: ObservedLabel(b1 , John clf1 ) and and an MLN rule capturing the spacial prior: ObservedLabel(b1 , John clf2 ). Again, as in the case of InTile(b, +tile) ⇒ Label(b, +o) multiple measures of blob similarity, MLN weight learning would automatically determine the relative importance and Notice we use the Alchemy convention +tile and +o, reliability of the two classifiers by assigning corresponding meaning that for every combination of the tile and label a weights to the groundings of the observation model. separate formula weight will be learned, yielding different priors over the face labels for different regions of the image. We used a boosted cascade of Haar features as given by Viola and Jones (2001) for face detection, and face recog- Time of the day. Similar to face location, a time- nizer of Kveton et al. (2010) as observations for the MLN dependent label prior is also useful when processing im- rule in Eq. 5. This classifier is based on calculating the L2 ages from security cameras: “early birds” will be more distance in pixel space for down-sampled (92 × 92 resolu- likely to occur in images taken earlier in the day and vice tion) and normalized images. This method was shown by versa. We subdivide the duration of the day into 3 in- Sim et al. (2000) to be generally superior to the more com- tervals: morning (before 11AM), noon (11AM to 2PM) mon method based on PCA for face classification in sin- and evening (after 2PM), introduce an evidence predicate gle images. For evaluating torso similarity for SameTorso TimeOfDay(b, time) and the corresponding MLN rule: evidence predicate, simple torso occlusion handling was performed by assuming that larger faces were in the fore- TimeOfDay(b, +time) ⇒ Label(b, +o) ground. Thus, larger-faced torsos were assumed to lie in Again, to obtain a time-dependent label prior we force the front of smaller-faced torsos, and the resulting torso bound- system to learn a separate weight for every combination of ing boxes did not intersect (see Fig. 2 for an example). the time interval and face label. One can see that extracting the relations introduced in this Results section requires little preprocessing, and it is possible to Quantitative results for a batch version of this model were come up with similar common-sense relations to improve presented by Chechetka et al. (2010). Here, for some added accuracy for settings other than security camera image se- context, we just present some of the qualitative lessons quences. learned from those experiments. Plugging in existing face recognizers Exploiting additional information channels dramati- cally improves accuracy. Classification error is reduced The relations and predicates described so far only use sim- by our approach by a factor from 1.35 to 5.2 compared to ple representations and similarity metrics. However, there the baseline of Kveton et al. (2010). Such an improvement is a large amount of existing literature and expert knowl- confirms the long-standing observation that using the con- edge dealing with design of representations, distance met- text, such as time of the day, is crucial for achieving high rics and integrated face recognition systems that improve recognition accuracy. It also shows that the framework of Markov logic is an efficient way to combine the multiple Online inference was triggered whenever a new unlabeled sources of information, both within a single image, and instance was observed. In this case, the structure of the multiple types of relations between different images, for graph was altered. Specifically, new nodes correspond- the goal of face recognition. ing to new instantiations of all propositions involving the new observed faces will be added to the network. At this No single relation accounts for the majority of the im- point, since the structure of the network has changed, the provement. Over all the dataset, the most extreme single- beliefs of the network are necessarily invalidated. Thus, an relation accuracy improvement over the baseline of Kveton exact algorithm would run belief propagation over the en- et al. (2010) (InTile predicate and the corresponding lo- tire graph after such events occurred. In our system, we cation prior is less than 40% of the total performance im- avoided this with the following heuristic: we maintained provement of the full model over the baseline. Therefore, the current beliefs of the network (as of the last iteration), the multiple relations of our full model are not redundant and we pushed the beliefs of the new nodes on the top of and represent information channels that complement each the priority queue in the Residual BP calculation. This had other. It is the interaction of multiple relations that enables the effect of focusing the next round of computation on the significant accuracy improvements. new nodes until convergence was reached. Relation importance is not uniform across datasets. One can see that the effect of the same relation can be Related Work dramatically different for different datasets, depending on those datasets’ properties. Only label propagation via the There exists quite a lot of work now on incorporating re- SimilarTorso relations provides a consistently signifi- lations into image classification. Rabinovich and Belongie cant performance improvement, the effect of other relations (2009) provides a good overall review of this work, and is much more varied. The varying degree of relation im- contrasts “scene-based” and “object-based” context. The portance for different datasets makes it important for a face former methods are represented by (Torralba, 2003, Ku- recognition approach to be easily adjustable to emphasize mar and Hebert, 2005, Heitz and Koller, 2008, Heitz et al., important relations and ignore the unimportant ones. Fortu- 2008), which all attempt to understand the scene (“the nately, the Markov logic framework makes such adjustabil- gist”) before trying to recognize objects. Gould et al. ity extremely easy on two levels. First, learning the weights (2009) and Torralba et al. (2005) use MRFs to do joint of the formulas automatically assigns large weights to im- segmentation and object recognition by exploiting physical portant formulas and close to zero weight to irrelevant ones. relations between entities. Gupta and Davis (2008) uses Second, any relation or formula can be easily taken out of prepositions present in annotated images to help determine the model or put back in, enabling the search for the opti- relative positions of objects in images. For example, if an mal set of relations using cross-validation. image is annotated with “car on the street”, one might in- fer that a car is above a street in the image. Many of these Building a Real-time system efforts have a different aim from our work. Namely, they attempt to do object class detection, i.e., detect all the ob- In this section, we explore how the ideas in this paper can jects of some given classes in an image; whereas in our be augmented into a real-time system. There are two broad face recognition application, we are doing object-instance objectives that need to be addressed: recognition: given the presence of objects of a given type, find specific labels for those objects. On the other hand, these methods have in common with us the intent to ex- 1. Updating the model as new instances come in (online ploit physical relations between objects and abstract rela- learning). tions between a set of objects and the gist of a scene to 2. Performing graphical model inference at interactive improve their results. The difference between their appli- speeds (online inference). cation of this principle and ours is that they all attempt to relate entities across a single image; whereas we use cross- image relationships. Second, by using the framework of To perform these two tasks simultaneously, we chose an Markov Logic, we have a unified, automated mechanism to asynchronous architecture (Figure 3) where learning and add arbitrary relations and automatically generate the CRF. inference are performed in separate processes. This pro- vided a natural parallelism for the whole system. Even Fergus et al. (2009) and Kveton et al. (2010) present with this parallelized approach, both learning and infer- approximations to the graph Laplacian-based semi- ence components required special enhancements to enable supervised learning solution for classifying images. These real-time operation. Online learning was necessary when- methods in general have the advantage over our method ever a new labeled instance was observed by the system. that they allow continuous similarity measures rather than This could happen whenever Incorporating new instances our discretized version, and they can be solved effi- caused the graph structure to change. ciently. However, these approaches are typically restricted Figure 3: Real time system architecture to similarity-based classification; whereas we can incorpo- G. Heitz and D. Koller. Learning spatial context: Using stuff to rate much more general relations such as our mutual exclu- find things. In ECCV, 2008. sivity. Furthermore, our approach can easily incorporate G. Heitz, S. Gould, A. Saxena, and D. Koller. Cascaded classi- any of these classifiers (as we do in this paper by taking the fication models: Combining models for holistic scene under- classifier of Kveton et al. (2010)) and use them as core face standing. In NIPS. 2008. recognizers in an object model. Finally, our approach can S. Kok, M. Sumner, M. Richardson, P. Singla, H. Poon, D. Lowd, approximate these approaches (albeit much less efficiently) and P. Domingos. The alchemy system for statistical relational AI. Technical report, Department of Computer Science and En- by using a discretized version of a similarity-measure, as gineering, University of Washington, Seattle, WA., 2009. URL we do using face and torso histograms in this work. http://alchemy.cs.washington.edu/. S. Kumar and M. Hebert. A hierarchical field framework for uni- Conclusions fied context-based classification. In ICCV, 2005. B. Kveton, M. Valko, A. Rahimi, and L. Huang. Semi-supervised learning with max-margin graph cuts. In to appear, AISTATS, Our contributions in this paper are as follows: First, 2010. we present a real-time perception system that incorpo- H. Poon, P. Domingos, and M. Sumner. A general method for rates Markov Logic for multilabel classification in images. reducing the complexity of relational inference and its applica- Whereas there has been much existing research showing tion to mcmc. In AAAI. AAAI Press, 2008. the benefits of exploiting local and global in-frame context, A. Rabinovich and S. Belongie. Scenes vs. objects: a comparative they all have involved custom-made graphical models and study of two approaches to context based recognition. In In- therefore are less accessible as a general modeling tool for ternational Workshop on Visual Scene Understanding (ViSU), specific domains. Second, we show that Markov Logic can Miami, FL, 2009. also provide a powerful new type of context for collective M. Richardson and P. Domingos. Markov logic networks. Ma- classification across frames, especially when the database chine Learning, 62(1–2):107–136, Feb 2006. is expected to have many repeated shots of the same entity T. Sim, R. Sukthankar, M. Mullin, and S. Baluja. Memory-based in different circumstances. We have argued that this type face recognition for visitor identification. In Proceedings of In- of context generalizes graph-based SSL approaches, and ternational Conference on Automatic Face and Gesture Recog- nition, 2000. adds much to these approaches in the expressibility of the relations across frames that can guide the collective clas- P. Singla and P. Domingos. Entity resolution with markov logic. In ICDM, 2006. sification of entities. Thus, we show that Markov Logic can provide a beneficial unification of two quite dissimi- J. Sivic, C. L. Zitnick, and R. Szeliski. Finding people in repeated shots of the same scene. In Proceedings of the British Machine lar cutting-edge techniques for entity classification in im- Vision Conference, 2006. ages. Finally, for the specific case of person identification, A. Torralba. Contextual priming for object detection. Interna- we have shown empirically that relations such as clothing tional Journal of Computer Vision, 53(2):169–191, July 2003. preferences, mutual exclusivity, spatial and temporal strati- A. Torralba, K. P. Murphy, and W. T. Freeman. Contextual mod- fication as well as multiple similarity channels can dramati- els for object detection using boosted random fields. In NIPS. cally improve face recognition over the state-of-the-art. Al- 2005. though much work remains to be done, we present some of M. A. Turk and A. P. Pentland. Eigenfaces for recognition. Jour- the specific modeling issues involved with this system, as nal of Cognitive Neuroscience, 3(1):71–86, 1991. well as some of the obstacles to making the system operate V. N. Vapnik. The nature of statistical learning theory. Springer- at interactive speeds. Verlag New York, Inc., New York, NY, USA, 1995. P. Viola and M. Jones. Robust real-time object detection. In In- References ternational Journal of Computer Vision, 2001. A. Chechetka, D. Dash, and M. Philipose. Relational learning for collective classification of entities in images. In Workshop on Statistical Relational AI in conjunction with the Twenty-Fourth Conference on Artificial Intelligence (AAAI-10), Atlanta, Geor- gia, 2010. Copyright c 2010, Intel Corporation. All rights reserved. R. Fergus, Y. Weiss, and A. Torralba. Semi-supervised learning in gigantic image collections. In NIPS. 2009. J. Gonzalez, Y. Low, and C. Guestrin. Residual splash for opti- mally parallelizing belief propagation. In AISTATS, 2009. S. Gould, T. Gao, and D. Koller. Region-based segmentation and object detection. In NIPS. 2009. A. Gupta and L. S. Davis. Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In ECCV, 2008.