=Paper= {{Paper |id=None |storemode=property |title=Building a Real-Time System for High-Level Perception |pdfUrl=https://ceur-ws.org/Vol-818/paper3.pdf |volume=Vol-818 }} ==Building a Real-Time System for High-Level Perception== https://ceur-ws.org/Vol-818/paper3.pdf

Building a Real-Time System for High-Level Person Recognition

Denver Dash Long Quoc Tran
Intel Embedded Systems Science and Technology Center Georgia Institute of Technology
Carnegie Mellon University
Pittsburgh, PA 15213
denver.h.dash@intel.com

Abstract In recent years there has been an explosion of work on ex-
ploiting in-frame context for entity classification in images
(Torralba, 2003, Kumar and Hebert, 2005, Torralba et al.,
We describe a system that uses graphical mod- 2005, Heitz and Koller, 2008, Heitz et al., 2008, Gupta and
els to perform real-time high-level perception. Davis, 2008, Rabinovich and Belongie, 2009, Gould et al.,
Our system uses Markov Logic Networks to re- 2009). The work typically involves finding some useful
late entities in images via first-order logical sen- relations for the specific domain at hand, e.g., “the sky is
tences to perform real-time semi-supervised per- usually above the ground”(Gupta and Davis, 2008), build-
son recognition. The system is a collection of ing a customized conditional random field model over the
“commodity-level” vision algorithms such as the entities in a frame and jointly classifying each entity in an
Viola-Jones face detector, histogram matching image given the observed pixel values. Despite these suc-
and even low-level pixel comparisons, together cesses, at present few if any practical real-time systems ex-
with logical relationships such as mutual ex- ist that attempt to do high-level reasoning by integrating
clusivity and entity confusion combined with a context at a high-level. In this paper, we discuss our at-
small number of labeled examples into a Markov tempts at building such a system using Markov Logic Net-
random field which can be solved to provide la- works (MLNs) and by constructing a database of logical
bels for faces in the images. We describe the relationships that are useful for relating entities to be iden-
methodology for constructing the logical rela- tified.
tions used for the system, and the (many) pit-
falls we encountered despite the small number This paper also makes the point that MLNs provide a uni-
of relations used. We also discuss several fu- form, intuitive and modular interface for performing high-
ture approaches to achieve interactive speeds for level perception. More importantly, we show that MLNs
such a system, including bounding the size of the can provide a newer more global sense of context that al-
graph using temporal weighting of instances, ap- lows them to jointly classify an entire dataset of images
proximating the structure of the graphical model, (entities), using meaningful relations between these enti-
parallelizing graphical model inference, and low- ties, in a manner similar to the collective classification of
level hardware acceleration. citation entries done by Singla and Domingos (2006). The
image representation provides a wealth of relations that can
be brought to bear on the problem, such as mutual exclu-
sivity of multiple faces in an image, temporal and spatial
Introduction stratification, personal traits that may relate people to vari-
ous objects or distinctive clothes, etc. We thus expect that
In this paper, we describe a real-time system1 (Figure 1) this application is even more suited for the use of a power-
for recognition using a small labeled dataset plus first-order ful tool like MLNs than the case of citation matching.
logic relations. The system assumes a constrained environ-
ment, i.e., one in which the same people generally occur This use of MLNs for collective classification resembles
and that the instances that we want to classify are localized graph-based semi-supervised learning (SSL) approaches
in time. (c.f., Fergus et al., 2009), which relate entities across a cor-
pus via a distance or similarity measure. However, com-
1 pared to SSL approaches, MLNs provides a much richer
A batch version of this system along with empirical results
was described by Chechetka et al. (2010). Here we focus more on way of connecting labeled/unlabeled instances, allowing
the implementation details as well as the knowledge management one to combine multiple similarity metrics at the same time
of the system.
Figure 1: System Overveiw
as well as incorporate arbitrary logical relationships. In fact tion (and object recognition more generally). In particu-
we argue that MLNs can provide an approximate gener- lar, SSL approaches (Fergus et al., 2009) exploit similar-
alization to some of the standard SSL approaches by dis- ity in object appearances in different images to propagate
cretizing a distance/similarity measure and incorporating label information from labeled to unlabeled blobs, and be-
them into the MLN. In addition one can continue to ex- tween unlabeled blobs. In a supervised setting, typically
ploit other relations that would not fit well within the SSL a low-dimensional representation of blob appearances is
framework, such as contextual information that relates enti- extracted (e.g., Turk and Pentland, 1991) and a standard
ties within frames. We show empirically that this approach technique such as a support vector machine (Vapnik, 1995)
yields favorable results for face recognition in images of is then applied. Besides the blob appearance information,
three datasets collected by us, and that the use of the addi- it has been shown that taking context in which the blob
tional logical relations, which would be difficult in standard appears, such as blob location within the frame or labels
SSL, is crucial for the best classification accuracy. of other objects in the scene, is crucial to accurate object
recognition. In this section, we show that all the above
MLN Background sources of information can be combined efficiently using
a Markov logic network. Our approach thus combines the
advantages of the diverse existing approaches to improve
Markov logic (Richardson and Domingos, 2006) is a prob-
face recognition accuracy. In the MLN described below, we
abilistic generalization of finite first-order logic. A Markov
will use the query predicate Label(b, o), which is true if
logic network (MLN) consists of a set of weighted first-
and only if blob b has label o. The evidence predicates will
order clauses. Given a set of constants, an MLN defines a
be introduced gradually, as they are needed for the MLN
Markov network with one binary variable for every ground
rules.
atom and one potential for every possible grounding of
every first-order clause. The joint probability distribution We assume that face detection has already been performed
over the ground atom variables is defined as by some standard approach, such as that of Viola and Jones
  (2001). The input to our system thus consists of a set of
1  X X  images, and for each image, a set of bounding boxes for
P (x) = exp wf f (xi ) , (1) the detected faces, some of which are labeled with people’s
Z  x

f i
names. The goal is to assign labels to the remaining unla-
beled face blobs.
where f is an indicator function corresponding to a first-
order clause (1 if that clause is true and 0 otherwise), wf
is a weight of that clause, and xi is the set of ground atom Label propagation: semi-supervised component
variables in a particular grounding of that clause. The inner
summation in (1) is over all possible groundings. There- A key idea of the SSL approaches is to classify all the ob-
fore, for every grounding of every first-order clause, the jects of the test set simultaneously by rewarding the cases
higher the weight for that clause, the more favored are as- of similar-looking objects having the same label (equiv-
signments to x where that grounding is true. alently, penalizing labels mismatches for similarly look-
ing objects). Let xi and xj be the appearances of blobs
Two fundamental problems in Markov logic that apply bi and bj respectively. Denote kxi − xj k to be the dis-
to our application are those of learning optimal weights tance between xi and xj . We define the evidence predicate
for the known set of first-order clauses given the knowl- SimilarFace(bi , bj ) that is true if and only if kxi −xj k <
edge base of known ground atoms, and inference, or find- ∆f , where ∆f is a threshold. Then the rule to favor match-
ing the most likely assignment to unknown ground atoms ing labels for similar faces is simply
given the knowledge base. Even though both problems are
intractable in general, well-performing approximate algo- SimilarFace(bi , bj ) ∧ Label(bi , o) ⇒ Label(bj , o)
rithms are available. For weight learning, we used precon- (2)
ditioned conjugate gradient with MC-SAT sampling imple- We selected threshold
P
∆ f so as to get precision 0.9 on the
mented in the Alchemy package (Kok et al., 2009). For I(SimilarFace(bi ,bj )=true)
training data: P i,j I(Label(bi ,o)=Label(bj ,o)) = 0.9, where
inference, we used a high-performance implementation of i,j,o

residual belief propagation (Gonzalez et al., 2009) along I(·) is the indicator function. For simplicity of implementa-
with a lazy instantiation of MLN structure as recommended tion, we used 16-bin color histograms as representations for
P#bins (xi (k)−xj (k))2
by Poon et al. (2008). xi and χ2 distance kxi − xj kχ2 ≡ k=1 xi (k)+xj (k) .
Naturally, any other choice of representation and distance
can be used instead.
Model Description
Observe that similar face appearance is not the only possi-
In the existing literature, many types of very different ble clue that two image fragments actually depict the same
features have been shown to be useful for face recogni- person. For example, similar clothing appearance is an-
other useful channel of information, as was demonstrated
by Sivic et al. (2006). In our approach, information about
clothing appearance similarity is used in the same way to
the face similarity: for every face blob bi , we define the
corresponding torso blob ti to be a rectangle right under
bi ; the scale of the rectangle is determined by the size of
bi . Let yi be the appearance representation of ti . We de-
fine the evidence predicate SimilarTorso(bi , bj ) which
is true if and only if kyi − yj k < ∆t and introduce the
corresponding label smoothing rule

SimilarTorso(bi , bj ) ∧ Label(bi , o) ⇒ Label(bj , o)
(3)
into the MLN. One can see that we have two versions of
essentially the same rule exploiting different channels of
information for label propagation. Even though it is possi-
ble in principle to achieve the same effect in standard graph
Laplacian-based SSL approaches (Fergus et al., 2009), one
would need to use costly cross-validation to find a good
way to combine the two separate distance metrics into one
(alternatively, find the relative importance of the torso dis-
tance and face distance metrics). In contrast, standard algo-
rithms for MLN weight learning provide our approach with
the relative importance of the two rules automatically.
More fine-grained label smoothing. One advantage of
the graph Laplacian-based unsupervised methods over our
approach is that the former naturally support real-valued
blob similarity values, while our approach requires thresh-
olding. However, our approach can also be adapted to
handle varying degrees of similarity: instead of a single
similarity threshold, one can use multiple different sim-
ilarity thresholds and introduce corresponding similarity
predicates. For example, suppose we want to use two
(1) (2)
different thresholds, ∆f < ∆f , for face blob similar-
ity. Then we would introduce two similarity predicates,
Figure 2: Example security images for datasets 1–3 (top
SimilarFace(1) (bi , bj ), which is true if and only if kxi −
(1) to bottom). The top image shows an example of torso ex-
xj k < ∆f , and analogously SimilarFace(2) (bi , bj ), for traction (faces on this set have been blurred by subjects’
(2)
∆f . Then for highly similar blobs, those with kxi − request). The middle image shows a view of a kitchen area
(1)
xj k < ∆f , both versions of the formula in Eq. 2 for where a coffee machine (red) is in the middle of the frame,
while the refrigerator (green) is on the right; thus coffee
SimilarFace(1) and SimilarFace(2) will have the left-
drinkers might be more likely to appear in the middle.
hand side to be true, providing a higher reward for match-
ing the labels. On the other hand, for weakly similar blobs,
(1) (2)
those with ∆f < kxi − xj k < ∆f , only the version of
with monitors) and broad scene context (fridges usually oc-
Eq. 2 corresponding to SimilarFace(2) will have the LHS
cur in kitchen scenes) have all been shown to enable dra-
to be true, providing a weaker reward for matching labels.
matic improvements in recognition accuracy. Here, we de-
scribe the MLN rules used by our system to take single-
Exploiting single-image context image context into account.
In addition to the appearance of the blob of interest itself A person only occurs once in an image. In the absence
and the labels of similar blobs in other images, power- of mirrors, for every person at most one occurrence of their
ful contextual cues often exist in the image containing the respective face is possible in a single image. Therefore,
blob. In the broader context of object recognition, spacial if two faces are present in the same image, they necessar-
context (e.g. sky is usually in the top part of an image), ily have to either have different labels, or be both labeled
co-occurrence (computer keyboards tend to occur together as unknown. Hence we introduce an evidence predicate
SameImage(bi , bj ) which is true if and only if bi and bj accuracy significantly over simpler baselines in a super-
are in the same image, and the following MLN rule: vised setting. If such a recognition system is available, it is
desirable to be able to leverage its results in our framework
SameImage(bi , bj ) ⇒!Label(bj , o1 )∨ instead of completely discarding the existing system and
!Label(bj , o2 ) ∨ (o1 ! = o2 ) ∨ (o1 == Unknown) (4) replacing it with the MLN model. Fortunately, it is easy
to combine any existing face recognition system with our
Face location. For multiple images taken with the same approach by using the face labels produced by the existing
camera pose, such as images from a security camera, of- system as observations in our model. Formally, we use an
ten different people will tend to occupy different parts of evidence predicate ObservedLabel(b, observedLabel),
the frame. For example, in the middle image of Fig. 2 which is true if and only if the external face recognition
the refrigerator is in the right part of the frame, and the system assigned observedLabel as the label for blob b.
coffee machine is in the middle. Therefore, faces of cof- The MLN rule
fee drinkers may be more likely to appear in the middle
ObservedLabel(b, +observedLabel) ⇒ Label(b, +o)
of the frame, while those preferring soft drinks may spend
(5)
more time in the right part. In addition, false-positive face
then provides the observation model. Observe that sev-
detections (which are given the label “junk”) will appear
eral different external classifiers can be used as observa-
randomly whereas actual faces appear in more constrained
tions simultaneously, by mapping the labels produced by
locations. Using the spacial prior in such settings will bene-
different classifiers to disjoint sets of atoms. For example,
fit the recognition accuracy. In our approach, we subdivide
if there are two different classifiers, clf1 and clf2 , and
every image into 9 tiles of the same size, arranged in a 3×3
both label blob b1 as John, then one would set two ground
grid and introduce an evidence predicate InTile(b, tile)
predicates to true: ObservedLabel(b1 , John clf1 ) and
and an MLN rule capturing the spacial prior:
ObservedLabel(b1 , John clf2 ). Again, as in the case of
InTile(b, +tile) ⇒ Label(b, +o) multiple measures of blob similarity, MLN weight learning
would automatically determine the relative importance and
Notice we use the Alchemy convention +tile and +o, reliability of the two classifiers by assigning corresponding
meaning that for every combination of the tile and label a weights to the groundings of the observation model.
separate formula weight will be learned, yielding different
priors over the face labels for different regions of the image. We used a boosted cascade of Haar features as given by
Viola and Jones (2001) for face detection, and face recog-
Time of the day. Similar to face location, a time- nizer of Kveton et al. (2010) as observations for the MLN
dependent label prior is also useful when processing im- rule in Eq. 5. This classifier is based on calculating the L2
ages from security cameras: “early birds” will be more distance in pixel space for down-sampled (92 × 92 resolu-
likely to occur in images taken earlier in the day and vice tion) and normalized images. This method was shown by
versa. We subdivide the duration of the day into 3 in- Sim et al. (2000) to be generally superior to the more com-
tervals: morning (before 11AM), noon (11AM to 2PM) mon method based on PCA for face classification in sin-
and evening (after 2PM), introduce an evidence predicate gle images. For evaluating torso similarity for SameTorso
TimeOfDay(b, time) and the corresponding MLN rule: evidence predicate, simple torso occlusion handling was
performed by assuming that larger faces were in the fore-
TimeOfDay(b, +time) ⇒ Label(b, +o)
ground. Thus, larger-faced torsos were assumed to lie in
Again, to obtain a time-dependent label prior we force the front of smaller-faced torsos, and the resulting torso bound-
system to learn a separate weight for every combination of ing boxes did not intersect (see Fig. 2 for an example).
the time interval and face label.
One can see that extracting the relations introduced in this Results
section requires little preprocessing, and it is possible to
Quantitative results for a batch version of this model were
come up with similar common-sense relations to improve
presented by Chechetka et al. (2010). Here, for some added
accuracy for settings other than security camera image se-
context, we just present some of the qualitative lessons
quences.
learned from those experiments.

Plugging in existing face recognizers Exploiting additional information channels dramati-
cally improves accuracy. Classification error is reduced
The relations and predicates described so far only use sim- by our approach by a factor from 1.35 to 5.2 compared to
ple representations and similarity metrics. However, there the baseline of Kveton et al. (2010). Such an improvement
is a large amount of existing literature and expert knowl- confirms the long-standing observation that using the con-
edge dealing with design of representations, distance met- text, such as time of the day, is crucial for achieving high
rics and integrated face recognition systems that improve recognition accuracy. It also shows that the framework of
Markov logic is an efficient way to combine the multiple Online inference was triggered whenever a new unlabeled
sources of information, both within a single image, and instance was observed. In this case, the structure of the
multiple types of relations between different images, for graph was altered. Specifically, new nodes correspond-
the goal of face recognition. ing to new instantiations of all propositions involving the
new observed faces will be added to the network. At this
No single relation accounts for the majority of the im-
point, since the structure of the network has changed, the
provement. Over all the dataset, the most extreme single-
beliefs of the network are necessarily invalidated. Thus, an
relation accuracy improvement over the baseline of Kveton
exact algorithm would run belief propagation over the en-
et al. (2010) (InTile predicate and the corresponding lo-
tire graph after such events occurred. In our system, we
cation prior is less than 40% of the total performance im-
avoided this with the following heuristic: we maintained
provement of the full model over the baseline. Therefore,
the current beliefs of the network (as of the last iteration),
the multiple relations of our full model are not redundant
and we pushed the beliefs of the new nodes on the top of
and represent information channels that complement each
the priority queue in the Residual BP calculation. This had
other. It is the interaction of multiple relations that enables
the effect of focusing the next round of computation on the
significant accuracy improvements.
new nodes until convergence was reached.
Relation importance is not uniform across datasets.
One can see that the effect of the same relation can be Related Work
dramatically different for different datasets, depending on
those datasets’ properties. Only label propagation via the There exists quite a lot of work now on incorporating re-
SimilarTorso relations provides a consistently signifi- lations into image classification. Rabinovich and Belongie
cant performance improvement, the effect of other relations (2009) provides a good overall review of this work, and
is much more varied. The varying degree of relation im- contrasts “scene-based” and “object-based” context. The
portance for different datasets makes it important for a face former methods are represented by (Torralba, 2003, Ku-
recognition approach to be easily adjustable to emphasize mar and Hebert, 2005, Heitz and Koller, 2008, Heitz et al.,
important relations and ignore the unimportant ones. Fortu- 2008), which all attempt to understand the scene (“the
nately, the Markov logic framework makes such adjustabil- gist”) before trying to recognize objects. Gould et al.
ity extremely easy on two levels. First, learning the weights (2009) and Torralba et al. (2005) use MRFs to do joint
of the formulas automatically assigns large weights to im- segmentation and object recognition by exploiting physical
portant formulas and close to zero weight to irrelevant ones. relations between entities. Gupta and Davis (2008) uses
Second, any relation or formula can be easily taken out of prepositions present in annotated images to help determine
the model or put back in, enabling the search for the opti- relative positions of objects in images. For example, if an
mal set of relations using cross-validation. image is annotated with “car on the street”, one might in-
fer that a car is above a street in the image. Many of these
Building a Real-time system efforts have a different aim from our work. Namely, they
attempt to do object class detection, i.e., detect all the ob-
In this section, we explore how the ideas in this paper can jects of some given classes in an image; whereas in our
be augmented into a real-time system. There are two broad face recognition application, we are doing object-instance
objectives that need to be addressed: recognition: given the presence of objects of a given type,
find specific labels for those objects. On the other hand,
these methods have in common with us the intent to ex-
1. Updating the model as new instances come in (online
ploit physical relations between objects and abstract rela-
learning).
tions between a set of objects and the gist of a scene to
2. Performing graphical model inference at interactive improve their results. The difference between their appli-
speeds (online inference). cation of this principle and ours is that they all attempt to
relate entities across a single image; whereas we use cross-
image relationships. Second, by using the framework of
To perform these two tasks simultaneously, we chose an
Markov Logic, we have a unified, automated mechanism to
asynchronous architecture (Figure 3) where learning and
add arbitrary relations and automatically generate the CRF.
inference are performed in separate processes. This pro-
vided a natural parallelism for the whole system. Even Fergus et al. (2009) and Kveton et al. (2010) present
with this parallelized approach, both learning and infer- approximations to the graph Laplacian-based semi-
ence components required special enhancements to enable supervised learning solution for classifying images. These
real-time operation. Online learning was necessary when- methods in general have the advantage over our method
ever a new labeled instance was observed by the system. that they allow continuous similarity measures rather than
This could happen whenever Incorporating new instances our discretized version, and they can be solved effi-
caused the graph structure to change. ciently. However, these approaches are typically restricted
Figure 3: Real time system architecture
to similarity-based classification; whereas we can incorpo- G. Heitz and D. Koller. Learning spatial context: Using stuff to
rate much more general relations such as our mutual exclu- find things. In ECCV, 2008.
sivity. Furthermore, our approach can easily incorporate G. Heitz, S. Gould, A. Saxena, and D. Koller. Cascaded classi-
any of these classifiers (as we do in this paper by taking the fication models: Combining models for holistic scene under-
classifier of Kveton et al. (2010)) and use them as core face standing. In NIPS. 2008.
recognizers in an object model. Finally, our approach can S. Kok, M. Sumner, M. Richardson, P. Singla, H. Poon, D. Lowd,
approximate these approaches (albeit much less efficiently) and P. Domingos. The alchemy system for statistical relational
AI. Technical report, Department of Computer Science and En-
by using a discretized version of a similarity-measure, as gineering, University of Washington, Seattle, WA., 2009. URL
we do using face and torso histograms in this work. http://alchemy.cs.washington.edu/.
S. Kumar and M. Hebert. A hierarchical field framework for uni-
Conclusions fied context-based classification. In ICCV, 2005.
B. Kveton, M. Valko, A. Rahimi, and L. Huang. Semi-supervised
learning with max-margin graph cuts. In to appear, AISTATS,
Our contributions in this paper are as follows: First, 2010.
we present a real-time perception system that incorpo-
H. Poon, P. Domingos, and M. Sumner. A general method for
rates Markov Logic for multilabel classification in images. reducing the complexity of relational inference and its applica-
Whereas there has been much existing research showing tion to mcmc. In AAAI. AAAI Press, 2008.
the benefits of exploiting local and global in-frame context, A. Rabinovich and S. Belongie. Scenes vs. objects: a comparative
they all have involved custom-made graphical models and study of two approaches to context based recognition. In In-
therefore are less accessible as a general modeling tool for ternational Workshop on Visual Scene Understanding (ViSU),
specific domains. Second, we show that Markov Logic can Miami, FL, 2009.
also provide a powerful new type of context for collective M. Richardson and P. Domingos. Markov logic networks. Ma-
classification across frames, especially when the database chine Learning, 62(1–2):107–136, Feb 2006.
is expected to have many repeated shots of the same entity T. Sim, R. Sukthankar, M. Mullin, and S. Baluja. Memory-based
in different circumstances. We have argued that this type face recognition for visitor identification. In Proceedings of In-
of context generalizes graph-based SSL approaches, and ternational Conference on Automatic Face and Gesture Recog-
nition, 2000.
adds much to these approaches in the expressibility of the
relations across frames that can guide the collective clas- P. Singla and P. Domingos. Entity resolution with markov logic.
In ICDM, 2006.
sification of entities. Thus, we show that Markov Logic
can provide a beneficial unification of two quite dissimi- J. Sivic, C. L. Zitnick, and R. Szeliski. Finding people in repeated
shots of the same scene. In Proceedings of the British Machine
lar cutting-edge techniques for entity classification in im- Vision Conference, 2006.
ages. Finally, for the specific case of person identification,
A. Torralba. Contextual priming for object detection. Interna-
we have shown empirically that relations such as clothing tional Journal of Computer Vision, 53(2):169–191, July 2003.
preferences, mutual exclusivity, spatial and temporal strati-
A. Torralba, K. P. Murphy, and W. T. Freeman. Contextual mod-
fication as well as multiple similarity channels can dramati- els for object detection using boosted random fields. In NIPS.
cally improve face recognition over the state-of-the-art. Al- 2005.
though much work remains to be done, we present some of M. A. Turk and A. P. Pentland. Eigenfaces for recognition. Jour-
the specific modeling issues involved with this system, as nal of Cognitive Neuroscience, 3(1):71–86, 1991.
well as some of the obstacles to making the system operate V. N. Vapnik. The nature of statistical learning theory. Springer-
at interactive speeds. Verlag New York, Inc., New York, NY, USA, 1995.
P. Viola and M. Jones. Robust real-time object detection. In In-
References ternational Journal of Computer Vision, 2001.

A. Chechetka, D. Dash, and M. Philipose. Relational learning for
collective classification of entities in images. In Workshop on
Statistical Relational AI in conjunction with the Twenty-Fourth
Conference on Artificial Intelligence (AAAI-10), Atlanta, Geor-
gia, 2010. Copyright c 2010, Intel Corporation. All rights reserved.
R. Fergus, Y. Weiss, and A. Torralba. Semi-supervised learning
in gigantic image collections. In NIPS. 2009.
J. Gonzalez, Y. Low, and C. Guestrin. Residual splash for opti-
mally parallelizing belief propagation. In AISTATS, 2009.
S. Gould, T. Gao, and D. Koller. Region-based segmentation and
object detection. In NIPS. 2009.
A. Gupta and L. S. Davis. Beyond nouns: Exploiting prepositions
and comparative adjectives for learning visual classifiers. In
ECCV, 2008.