ZERO – Detect objects without training examples by knowing
their parts.
Gertjan J. Burghoutsa and Fieke Hillerströma
a
    TNO Intelligent Imaging, Oude Waalsdorperweg 63, 2597 AK, The Hague, The Netherlands

                 Abstract
                 Current object recognition techniques are based on deep learning and require substantial
                 training samples in order to achieve a good performance. Nonetheless, there are many
                 applications in which no (or only a few) training images of the targets are available, whilst they
                 are well-known by domain experts. Zero-shot learning is used in use cases with no training
                 examples. However, current zero-shot learning techniques mostly tackle cases based on simple
                 attributes and offer no solutions for rare, compositional objects such as a new product, or new
                 home-made weapons. In this paper we propose ZERO: a zero-shot learning method which
                 learns to recognize objects by their parts. Knowledge about the object composition is combined
                 with state-of-the-art few-shot detection models, which detects the parts. ZERO is tested on the
                 example use case of bicycle recognition, for which it outperforms few-shot object detection
                 techniques. The object recognition is extended to detection by localizing it, by taking into
                 account knowledge about the object’s composition, of which the results are studied
                 qualitatively.

                 Keywords 1
                 Zero-shot learning, Knowledge, Object recognition, Object localization.

1. Introduction
      In many computer vision applications, there are no images of the objects of interest. For instance,
a new product that has not yet been assembled or photographed, and new variants of home-made
weapons. A lack of training images makes it harder to learn to recognize the objects. Standard deep
learning offers no solution to recognize such objects, as these models require many labeled images [1].
In zero-shot learning (ZSL) [2], the goal is to learn a new object by leveraging knowledge about that
object.
      The most common approach is to capture knowledge about the objects by representing their
attributes [3]. A new object is modelled as a new combination of known attributes. The state-of-the-art
is to learn the relation between attributes (e.g., furry) and appearance [4]. A new object can be predicted
if its attributes correspond to the observed appearance. To learn the implicit relations between attributes
and appearance, many objects in many different combinations of attributes are needed (e.g., many
animals with attributes [5]).
      The attribute-based approach does not work for new objects that consist of attributes that are not
common in many other objects. For instance, the home-made RC car is composed of wheels, a camera,
a battery, some wires, and a small computer (Figure 1, left). Likewise, the home-made explosive (Figure
1, right, i.e., an improvised explosive device, IED) is composed of a mobile phone, tape, wires, bottle.
Not many other objects are composed of these specific parts. There are not many other objects that
share the IED's parts. Extracting the attributes of these parts and using them for learning, is complex,
since the attributes will only be representative for parts of the object. Hence, the implicit relation

In A. Martin, K. Hinkelmann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.), Proceedings of the AAAI 2021 Spring
Symposium on Combining Machine Learning and Knowledge Engineering (AAAI-MAKE 2021) - Stanford University, Palo Alto, California,
USA, March 22-24, 2021.
EMAIL: gertjan.burghouts@tno.nl (A. 1); fieke.hillerstrom@tno.nl (A. 2)
ORCID: 0000-0001-6265-7276 (A. 1); 0000-0003-1301-3073 (A. 2)
              ©️ 2021 Copyright for this paper by its authors.
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)
between attributes and appearance cannot be learned, as there is not sufficient training data. For those
objects, the attribute-based approach does not fit.


Figure 1: Examples of new, compositional objects for which attribute-based models do not work.

    As a novel approach, we leverage the compositionality of new objects, by modelling them explicitly
as a combination of reusable parts. For composed new objects, the parts are typically everyday objects.
For instance, for the IED, the parts are a phone, tape, wires and bottle, which are all very common. For
everyday objects, it is easy to find images and to annotate the region that contains the relevant part. A
part model can be learned from these images and annotations. A standard object detector is trained to
localize the parts. The new object is modelled by combining the detected parts.
    The proposed method is named ZERO. ZERO consists of four steps. Firstly, expert knowledge
about the object is captured in terms of its parts. This involves the object parts and the relations between
them, i.e., spatial arrangement and relative sizes. Secondly, the parts are learned and detected (localized)
in images. Thirdly, the object is learned by combining the parts and their appearance (visual features).
Fourthly, the object is localized in the image by assessing the spatial arrangement of parts and their
respective sizes relative to the object. ZERO is outlined in Figure 2.


Figure 2: Outline of ZERO and its steps to recognize a new (unseen) object.

The advantages of ZERO are:
   a. Recognition of new objects when no training samples nor attribute annotations are available.
   b. Taking knowledge of a mid-level abstraction into account, instead of low-level attributes.
   c. Using compositional knowledge about the location of properties, which is less feasible for
       fuzzy attributes.
   d. Easier specification of the expert’s knowledge, as the parts are mid-level and clear, contrary to
       fuzzy attributes.
   e. To provide predictions that are explainable towards the user. The parts and composition can be
       expressed more easily to a human than a plain confidence.

    Since parts contain a higher level of abstraction than attributes, they encode more information,
which makes the added knowledge more valuable. Different parts contain different properties and when
they are combined in a composition, these properties are encoded at a location. Attribute-based
approaches encode a specific attribute for the whole object. The expert specifies the object composition
in terms of parts, which is related to the our common way of reasoning. The interpretation towards
attributes is taken out. The new object and its parts are localized with a confidence for the object and
per part. This makes it easier for the user to understand why the algorithm predicted that this object is
in the current image. We will show examples in the experimental results of such predictions.
     Section 2 discusses related work. Section 3 describes the proposed ZERO method. Section 4 details
the experimental results and findings. Section 5 concludes the paper.

2. Related work
    Zero-shot learning based on attributes, e.g., [4], leverages big datasets with many object classes and
many attributes in various combinations, e.g., AWA2 [2], CUB-200 [5] and SUN [6]. AWA2 has 40
object classes for training with 85 attributes [2]. CUB-200 has 200 object classes and 312 attributes [5].
SUN has 717 object classes with 102 attributes [6]. These datasets have more than thousands of training
images with many object classes that share attributes, which enables models to learn the relations
between attributes and appearance. For many types of new objects, such as the IED (Figure 1), such
datasets of shared attributes are not available. In this paper, we are interested in such composed objects.
    There are significant differences between this paper and attribute-based ZSL, which are summarized
in Table 1. In the attribute-based ZSL, there are many annotations of other objects that share similar
attributes, whereas our setup is that there are none. In attribute-based ZSL, the object classes of the
abovementioned datasets are closed-world. For instance, the problem is about animals only, which
limits the learned models to recognize only new animals and not other objects. In this paper, we aim to
recognize a broad set of new objects. The expert knowledge used in attribute-based ZSL only covers
the combinations that constitute the object. ZERO uses additional knowledge about the spatial
arrangement of parts for localizing the object in the image (localization). In attribute-based ZSL, the
importance of each attribute is learnable, because there are so many combinations of attributes and
objects and appearance to learn from. Contrary, in this paper, there are no other annotated objects, which
requires a different approach. Finally, attribute-based ZSL involves knowledge about attribute
composition, where in this paper we leverage more knowledge about the object, i.e., parts, part
composition and spatial arrangement of parts.

Table 1
Differences between attribute-based ZSL and ZERO.
                                       Attribute-based ZSL       ZERO (this paper)
 Annotated other objects               Many                      Zero
 Closed world                          Yes                       No, objects can be from any category
 Importance of elements learnable      Yes                       No, lack of labeled data
 Expert knowledge                      Only about attribute      About part composition and spatial
                                       composition               arrangement of parts

    Our approach is to model a new (unseen) object explicitly as a combination of (reusable) parts. The
parts are typically very common, so there is sufficient data to learn part models. Supervised modelling
of an object by its parts has earlier been investigated [7], by combining a holistic object and body parts,
with the objective to handle large deformations and occlusions of parts. The key of this approach is to
automatically decouple the holistic object or body parts from the model when they are hard to detect.
The model learns the relative positions and scales of the object, based on many training instances of the
object. In contrast, in this paper, we aim to model objects by their parts, but without training samples
of the actual object. To that end, we rely on knowledge about the object, and specifically the parts and
the relations between them. We leverage this knowledge in a learning scheme. In the absence of training
images of the new object, object-parts training samples are synthesized to learn the model for the new
object.
    Instead of only recognizing images, we also aim for object localization. The combination of object
recognition and localization is known as zero-shot detection (ZSD), which aims to detect and localize
instances of unseen object classes. There are roughly three type of methodologies for ZSD. The first
type of methods use a region proposal network on top of a feature extraction backbone [8, 9, 10]. In
[11] these proposals are improved by explicitly taking the background into account while learning. The
features of the proposed regions are used to determine the object classes, using neural layers on top of
the features. The region proposals are used to localize the objects. Often a bounding box regression
model is trained to fine-tune the locations of the region proposals. The region proposals are trained on
common data, or defined as default anchor boxes. These methods are beneficial when no visual samples
of the input is available. However, in our case we do have visual samples of the subparts and can take
knowledge and features of these parts into account. The second type of methods [12] synthesize training
images for the unseen classes, based on semantic information, for example using a GAN. These
synthesized images are used to train a state-of-the-art object detector network. In a way this is
comparable to our method, since for the detection synthesize part combinations, using knowledge.
However, we directly use them for object localization, instead of introducing the additional effort of
image generation and detection network training. The third type of methods use attention based models
to determine the location of the object in the image [13]. These attention maps learn to differentiate
objects from the background, using learned attention weights. Comparable to the region proposal-
methods, this is beneficial when no part detections are available. We use the benefits of being able to
recognize common-known parts of the objects. In summary, none of methods take explicit knowledge
about object parts and configuration into account for object localization.

3. ZERO
   The proposed method, ZERO, consists of four steps, starting with knowledge about the new object,
up to localizing the new object in a test image. These four steps are detailed in the next subsections.

3.1.    Knowledge
   Knowledge about the new (unseen) object is captured in terms of its parts. This involves the object
parts and the relations between them, i.e., spatial arrangement and relative sizes. An example is a
bicycle, which is defined by its parts wheels, saddle, chainwheel and handlebar. The arrangement is
defined by parts that are not allowed to overlap; only the wheel and the chainwheel are allowed to
overlap. And the expected relative sizes of parts are given by the knowledge and are defined by a
minimum ratio and maximum ratio, referred to a reference part. This knowledge is summarized in Table
2.

Table 2
Knowledge about the object at hand (bicycle) and its parts.
  Object parts   Disallowed overlap of parts          Minimal area ratio       Maximum area ratio
 Wheel           Wheel, saddle, handlebar             Reference part           Reference part
 Wheel           Wheel, saddle, handlebar             0.5                      2
 Saddle          Wheel, handlebar, chainwheel         1.5                      7
 Chainwheel      Saddle, handlebar                    1.5                      7
 Handlebar       Wheel, saddle, chainwheel            1                        4


3.2.    Object parts
   Given the object definition, its parts are learned. Generally, it is possible to obtain annotations for
object parts, as most parts are everyday instances and it is easy to find or collect images of them. The
annotations are bounding boxes, each with a label of the part. A part model can be learned by leveraging
modern object detectors that can be retrained by fine-tuning from the broad MS-COCO dataset to the
annotations at hand. We selected Retinanet [14] for this purpose, as it proved to be robust for many
types of images and small objects. The latter is important, as parts are generally smaller. For annotations
of parts, we use the dataset in [7]. After learning, a model is acquired that is able to detect (localize) the
object's parts in test images.

3.3.    Recognition
    The object is learned by combining the parts and their appearance. We aim to learn which specific
part-based features are discriminative of the full object. The parts and their features are combined in a
graph representation such that all features are available to the learning of the object model. To that end,
an graph is composed, where each node resembles one part. Each specific node represents one part and
has a fixed position in the graph representation. The node contains the features of that part. For the
features of a part, we extract the specific region of the image and run it through a standard convolutional
neural network, i.e., a Resnet-50, of which the embedding before the final layer is used as a feature
vector [15]. On top of the graph, a classifier is learned. We will experiment with various classifiers. The
goal is that the classifier learns which features of which parts are most discriminative.
    Our contribution is in how the graph is learned, i.e., classifying the combined nodes' features to
assess whether the current image contains the new object or not. The challenge is how to train the graph
model, with no training images of the object at hand. This is done by synthesizing training samples. A
training sample for the object is obtained by leveraging the part definition. For each part, a randomly
selected instance of that part is plugged into it. In this way, a huge amount of synthesized training
samples can be obtained (in the experiments we set this to 10K), and many variations of part
combinations are presented to the model during the learning. The rationale is that this should lead to
good generalization.

3.4.    Localization
   The object is localized in the image by assessing the spatial arrangement of parts and their respective
sizes relative to the object. Our localization method tries to answer the question ‘Given that the image
would contain the object of interest, which combination of parts represents that object the best?’ and
assumes that when the convex hull of these selected parts is taken, the location of the object is found.
The selection of object-parts is based on predefined knowledge; the object composition, variations of
the overlap of parts and variations of the ratios of part areas (see Table 2).
   The localization starts with a preprocessing step in which the number of parts is reduced. From all
the detected parts in the image, per part-class the 20% with the highest detection confidence is selected
(see Figure 3, first step). The selection is done per part-class in order to remain sufficient part-options
for every class. The value of 20% is chosen in order to remain sufficient variety of bounding boxes for
the object construction, while increasing speed and reducing noise as much as possible. This value is
validated on a small set of bicycle images.


Figure 3: The object localization method starts start with preprocessing to obtain a subset of parts.
This subset is checked for allowed combinations, using matrices. Post-processing is applied to remove
possible incorrect parts.

    After this preprocessing step, all the possible combination of parts are checked whether they are
allowed, based on the knowledge. This is done by constructing N-dimensional matrices of allowed
combinations, where N stands for the number of parts that form the object, as defined in the object
definition. By implementing the localization method using matrices, the use of multi-variable input
(possibility for additional knowledge) and multi-hypothesis output (possibility to return multiple likely
answers) is enabled, which makes the methodology very flexible in use. The N-dimensional matrices
are combined using the logical-AND operation into one final matrix of allowed part combinations (see
Figure 4). From this combined matrix the parts representing the object are selected, based on two scores;
the median of the confidence and the median of the confidence when post-processing would be applied.


Figure 4: Conceptual visualization of the matrices that capture which combinations are allowed.
Possible part combinations are checked for their allowance, based on the knowledge. These sub-
matrices are combined into one final matrix to choose part-combinations from. For visualization
purposes only 3D matrices are shown. In reality the matrices are N-dimensional, with N the number
of parts that constitute the object.

   After the part-combination is selected, post-processing is applied. To take the possibility of missing
or occluded parts into account, parts that have a detection confidence lower than 0.6 times the median
of all confidence values are probably wrong detections and are removed. When multiple parts of the
same class are in the object definition (two wheels for example), parts of this class are removed when
their detection confidence is lower than 0.6 times the median of all the confidence values of this class.

4. Experiments
  The experiments are performed on the PASCAL VOC dataset [16] for recognition and on
downloaded bicycle images for localization. We selected the bicycle as the object of interest, to validate
ZERO. To learn the object parts, the annotations from [7] are used.

4.1.    Recognition
    We compare ZERO’s recognition to various baselines. ZERO uses the part models and combines
them by a graph. We compare the graph-based approach to simply summing the confidence values of
its respective parts. Both are variants with zero examples of the object. We also compare to techniques
that require a few examples of the object. To that end, we use the same model [14] as used for the parts,
but now for the object. We include these baselines for reference only, because our goal with ZERO is
to target the case of zero examples of the object. These baselines cannot deal with that case. For ZERO,
we have explored two classifiers on the graph, by concatenating the node features: an SVM (with the
radial basis function as kernel) and a deep learning (DL) variant (fully-connected layer with a softmax
output).
    The ROC curves are shown in Figure 5. Most interestingly, ZERO (see curves for 0 examples)
outperforms the baselines that do need several training examples. ZERO also outperforms the naive
part combination by summing their confidence values. Note that ZERO's SVM variant performs better
than the DL variant, possibly because it is harder to train and optimize the DL variant (more hyper-
parameters). For most practical applications, it is essential to have low false positive rates. Therefore,
we are especially interested in the left-most part of the ROC curves. In the legend, we report the area
under the curve (AUC) at a false positive rate of 5% (0.05). This performance measure is highest for
ZERO with the SVM classifier (0.70), outperforming few-shot techniques that required 10 examples
(0.64) while ZERO used 0 examples of the object.
      True positive rate


                                            False positive rate

Figure 5: ROC curves of ZERO (zero examples) vs. baselines (few examples).

   Four examples of ZERO's predictions are shown in Figure 6. In the upper-left, a positive with a very
high confidence (correct). In the lower-left, a negative with a very low confidence (correct). In the
upper-right, a negative with a moderate confidence (ideally lower). In the lower-right, a positive with a
very low confidence, because of the large occlusion (the bicycle is marginally visible in the back, behind
the desk). Obviously, it is hard to recognize a new object, if it is largely occluded.


Figure 6: Example predictions by ZERO.
4.2.    Generalization
    We explore how well ZERO generalizes to new, deviating variants of the object of interest. Our
hypothesis is that the training procedure, based on many variations of part combinations, lead to good
generalization. We manually selected a set of 25 deviating objects from the internet, as our objects of
interest. The background of other objects is the same as in the previous experiment.
    Figure 7 shows the ROC curves for ZERO and the baselines, when tested against the deviating
objects. ZERO generalizes well to new, deviating variants of the object of interest. Generalization is
essential for zero-shot recognition, as not all variants will be known beforehand, and still we want to be
able to recognize them well.
                True positive rate


                                                False positive rate

Figure 7: ROC curves on deviating variants of the object of interest.

   Two examples of deviating objects are shown in Figure 8. ZERO is confident that these test images
contain the new, unseen object of interest.


Figure 8: Example predictions by ZERO on deviations of the object of interest.

  We conclude that the hard cases are not the deviating objects (there is good generalization), but
when the object is largely occluded (as in Figure 6).

4.3.    Localization
   We evaluated our localization method quantitively by showing the good (reasonably good), the bad
(understandable mistakes) and the ugly (utterly wrong) localization results on a test set of bicycle
images (see Figure 9). The test set contains images downloaded from the internet with different
compositions; bicycles seen from the side or from more difficult angles, sometimes partly occluded.
The added value of the different knowledge sources is inspected by comparing the localization results
when no other knowledge than the object composition is used with the results when knowledge about
part overlap and areas is used, which is shown in Figure 10.


Figure 9: Localization results for different qualitative performance. Upper: The good; reasonably good
localization results. Middle: The bad; understandable wrong predictions. Bottom: The ugly; utterly
wrong predictions. Red – wheel, blue – chainwheel, yellow – handlebar, green – saddle, white – whole
bike, by taking the convex hull of the parts.
Figure 10: Localization results for two test images, when no other knowledge than the object
composition is used (top left), when only area knowledge is used (top right), when only knowledge
about the overlap is used (bottom left) and when both additional knowledge sources are used (bottom
right).
5. Discussion
    ZERO can be extended to other objects by defining them in terms of their respective parts, collecting
images of parts, annotating them, and applying the method described in this paper. Better discrimination
between similar but different objects can be achieved by including hard negatives. They can be taken
into account explicitly by the data generator in the training procedure, or by hard-negative mining, or
by weighting them in the training optimization. If the objects are better described by their properties
instead of their parts, attribute-based approaches are more appropriate.
    Currently, ZERO’s localization method is limited to one object per image. This could be extended
to multiple objects per image by anchor boxes (e.g., [14]), for which the object presence is evaluated.
This generates multiple hypotheses of where the new object may be located in the image. All hypotheses
should be validated one by one, by applying ZERO’s recognition. Each hypothesis will result in a
confidence, after which the maximum confidence can be determined and the associated localization.
    There is more expert knowledge available about localization, for instance, spatial information of
how parts relate to each other. This positional encoding is expected to add important cues for the part
selection. Another improvement for the world knowledge would be a co-learning setting in which
updates to the knowledge can be made during the deployment phase, since it is difficult to select the
exact right knowledge.
    Note that the parts were extracted from the Pascal VOC part-dataset. As such, the parts are cut out
from images of largely visible objects. Hence the parts are not truly isolated, as a small bit of the context
is visible (e.g., a small part of the bicycle where the wheel is connected to) and the parts could contain
some specific bicycle-part features. This is in contrast to the real envisioned application where no
images of the object are available, and the parts and the ZERO model are to be learned from images of
truly isolated parts without object specific context and with more general part features. This will be
addressed in our near-future research.
    It would be interesting to explore the benefits of ZERO’s part-based technique for robustness against
adversarial attacks. In adversarial attacks, pixels of an image are weakly adjusted to force another
prediction from the deep learning model. When using our part-based model, multiple predictions have
to be misled in order to change the prediction of the whole image.
    In ZERO’s part-based recognition method, constructing additional training samples with a new type
of part is relatively easy. Therefore our method would allow for fine-grained identification, using
knowledge of important recognition cues. Possibly combined with attributes, to answer queries like
‘Find the person with the pink bag’. We would like to explore these type of use cases in future work.

6. Conclusion
     In this paper we have proposed a zero-shot object detection method based on known parts and world
knowledge. Since for our zero-shot learning use case no test-images are available, we tested our method
on bicycles and their parts. Our localization method allows for multi-variable input and multi-
hypothesis output. For the object recognition, we outperform few shot baselines that require labeled
training data. The results of localization show the potential of the method and the multi-variable input
allows for updating and extending the used world knowledge.

Acknowledgements
   We would like to thank the RVO research program at TNO for financial support.

References
[1] H. Touvron, A. Vedaldi, M. Douze, H. Jegou. Fixing the train-test resolution discrepancy, NeurIPS
    2019.
[2] Y. Xian, C. H. Lampert, B. Schiele, Z. Akata. Zero-shot learning - a comprehensive evaluation of
    the good, the bad and the ugly. IEEE TPAMI 2018.
[3] Y. Liu, J. Guo, D. Cai, X. He. Attribute Attention for Semantic Disambiguation in Zero-Shot
     Learning. IEEE ICCV 2019.
[4] V. Khare, D. Mahajan, H. Bharadhwaj, V. Verma, P. Rai. A Generative Framework for Zero-Shot
     Learning with Adversarial Domain Adaptation. IEEE WACV 2020.
[5] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, P. Perona. Caltech-UCSD Birds
     200. 2010.
[6] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, A. Torralba. SUN database: Large-scale scene
     recognition from abbey to zoo. IEEE CVPR 2010.
[7] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, A. Yuille. Detect What You Can: Detecting
     and Representing Objects using Holistic Models and Body Parts. IEEE CVPR 2014.
[8] Rahman, Shafin, Salman H. Khan, and Fatih Porikli. "Zero-shot object detection: Joint recognition
     and localization of novel concepts." International Journal of Computer Vision 128.12 (2020):
     2979-2999.
[9] Rahman, Shafin, Salman Khan, and Fatih Porikli. "Zero-shot object detection: Learning to
     simultaneously recognize and localize novel concepts." Asian Conference on Computer Vision.
     Springer, Cham, 2018.
[10] Yan, Caixia, et al. "Semantics-Preserving Graph Propagation for Zero-Shot Object Detection."
     IEEE Transactions on Image Processing 29 (2020): 8163-8176.
[11] Bansal, Ankan, et al. "Zero-shot object detection." Proceedings of the European Conference on
     Computer Vision (ECCV). 2018.
[12] Hayat, Nasir, et al. "Synthesizing the Unseen for Zero-shot Object Detection." arXiv preprint
     arXiv:2010.09425 (2020).
[13] Zhu, Yizhe, et al. "Semantic-guided multi-attention localization for zero-shot learning." Advances
     in Neural Information Processing Systems. 2019.
[14] T.Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár. Focal loss for dense object detection. IEEE ICCV
     2017.
[15] K. He, X. Zhang, S. Ren, J. Sun. Deep residual learning for image recognition. IEEE CVPR 2016.
[16] M. Everingham, S. Eslami, L. Van Gool, C. Williams, J. Winn, A. Zisserman. The PASCAL Visual
     Object Classes Challenge: A Retrospective. International Journal of Computer Vision, 111(1), 98-
     136, 2015.