A Context Aware Deep Learning Architecture for Object
                                                Detection

                          Kevin Bardool, Tinne Tuytelaars, José Oramas
                                       ESAT-PSI, KU Leuven, Belgium
                                               August 2019


1. Introduction                                                   Whereas in these approaches the use of contex-
A notable feature of our visual sensory system is              tual information is intertwined with the object de-
its ability to exploit contextual cues present in a            tection architecture, we take a different approach:
scene, enhancing our perception and understand-                the separation of appearance detection and contex-
ing of the image. Exploring methods of incorporat-             tual reasoning. The object detector remains re-
ing and learning such information using deep neural            sponsible for generating appearance-based informa-
network (DNN) architectures is a relatively young              tion as well as detection and localization. Addi-
field of research, with room for more studies. In              tionally, it will be used to construct contextual fea-
this work we propose a new architecture aimed at               ture descriptors that are passed on to a secondary
learning contextual relationships and improving the            model, responsible for learning contextual relation-
precision of existing DNN-based object detectors.              ships. While in some works a secondary model was
An off-the-shelf detector is modified to extract con-          used as a source of contextual information flowing
textual cues present in scenes. We implement a                 into the object detector, our design will attempt to
DNN-based architecture aimed at learning this in-              reverse this flow of information: contextual infor-
formation. A synthetic image generator is imple-               mation from the object detector is passed on to a
mented that generates random images while enforc-              secondary model trained to learn contextual rela-
ing a set of simple, predetermined contextual rela-            tionships. At inference time, the secondary model
tionships. Finally, a series of experiments are car-           is used to re-evaluate object detector proposals.
ried out to evaluate the effectiveness of our design
by measuring the improvement in average precision.             3. Methodolgy
                                                               Our architectural pipeline consists of two stages
2. Background                                                  (Figure 1). The first stage is an off-the-shelf object
There have been various attempts to categorize                 detector. For this, the Mask R-CNN model [8] was
sources of contextual information [2, 5, 7]. Bieder-           selected. A new network layer was implemented in
man groups relationships between an object and its             the object detector to generate per-class contextual
surroundings into five classes: interposition, sup-            feature maps. These heatmaps are constructed us-
port, probability, position, and size [2]. It is the           ing confidence scores and bounding boxes produced
three latter relationships which are of our interest:          by the Mask R-CNN object detection and localiza-
probability, the likelihood of an object appearing in          tion heads.
a particular scene, position, the expectation that                The secondary model is trained to learn seman-
when certain objects are present, they normally oc-            tic relationships using the contextual feature maps
cupy predictable positions, and size, the expecta-             generated by the primary object detector. For this
tion that an object’s size relative to other objects           stage, a DNN model based on the Fully Convolu-
and the general scene [7].                                     tional Network (FCN) architecture [11] was imple-
   In works aimed at exploiting contextual informa-            mented. The output of this model is also a series of
tion in DNN-based object detectors, two main ap-               contextual feature maps, representing its confidence
proaches stand out. One uses contextual informa-               on the original detections based on contextual rela-
tion as a feedback method that guides the genera-              tionships it has learned.
tion of initial object proposals [12, 3]. A second ap-            A scoring layer is implemented to produce a
proach involves the extraction and use of contextual           ‘contextual score’ for each proposed detection, and
information after proposal selection, and during the           added to both models. Scores are calculated using
scoring stage [13, 9, 1].                                      the contextual feature maps generated by the ob-
    Copyright c 2019 for this paper by its authors. Use
                                                               ject detector and contextual model. They are used
permitted under Creative Commons License Attribution 4.0       for comparison and AP calculations, and allow us
International (CC BY 4.0).                                     to measure the impact of the contextual learner on


                                                           1
object detection performance.                        image (i.e., closer to the observer).
   For training and evaluation two types of datasets
were considered. First, a dataset of synthetically
generated images containing a series of consistently
enforced contextual relationships. Such a dataset
will allow us to train the pipeline with relatively
simple content, introducing contextual cues in a
controlled manner. In addition, a subset of the
COCO dataset [10] was selected as a real-world
dataset.
   Training was conducted using various choices of
loss functions and scoring algorithms. Eventually a
Binary Cross Entropy loss was selected, and con-
textual scoring was performed using a localized
summation on a tight region surrounding each ob-
ject proposal’s centroid, defined by their predicted
bounding box.
                                                                     Figure 1: Our proposed architectural pipeline
4. Experiments
Conducted experiments have mainly focused on
measuring the capacity of our design in learning var-
ious contextual relationships enforced by the syn-                  Figure 2: Detection results on test toy dataset.
thetic image generator. The contextual scores are
used to compute the average precision (AP) and
mean average precision (mAP) as defined in the
evaluation protocols for Pascal VOC and COCO
challenges [6, 4].
   Inference on a set of 500 images demonstrates a
context-based mAP improvement of approximately
1.3 points. However, the Mask R-CNN based soft-
max score outperforms our context-score APs (Fig-              Figure 3: AP comparison on images with out of context
ure 2).                                                        proposals. Dark green/red bars represent the contextual
   We measure the model’s capacity in detecting the            learner’s AP.
expected spatial context of objects. A controlled
set of hypotheses that include object proposals po-
sitioned out of their expected spatial location is
generated and passed to the contextual reasoning
model. The success of the contextual model in re-
jecting such false positives indicates that it is able
to learn spatial constraints (Figure 3).
                                                               Figure 4: Maximal score vs. relative positioning of car and
   Another set of experiments were conducted to                person objects. Green box indicates region where correct rel-
test the model’s capability in recognizing spatial             ative positioning results in maximal scores.
relations enforced between objects belonging to dif-
ferent classes. Here we were able to confirm that              5. Conclusions
the contextual model does recognize such spatial               We propose a two stage architecture that extracts
relationships in a limited range (Figure 4).                   and learns contextual relationships using feature
   Experiments measuring the model’s capability in             maps that encode such information. Results of our
learning semantic co-occurrence relationships be-              experiments show that the context-based model is
tween classes (i.e., objects of different classes that         able to learn intra- and inter-class spatial relation-
appear together in scenes) determined that our                 ships. Additionally, it is able to learn the rela-
model is unable to learn such relationships.                   tion between the size of an image and its depth
   In the synthetic images, a sense of depth is cre-           in the scene. However, we have not seen robustness
ated by scaling object sizes relative to a horizon             towards learning co-occurrence of semantically re-
line present in the image. Our contextual model                lated objects. Continuing experiments on the more
was able to learn the relationship between relative            challenging COCO dataset and investigating meth-
size and vertical location for different classes, favor-       ods to induce learning of semantic co-occurrence re-
ing larger size objects when positioned lower in the           lationships are open avenues for future work.


                                                           2
References                                                   [11] E. Shelhamer, J. Long, and T. Darrell. Fully
 [1] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick.            Convolutional Networks for Semantic Segmen-
     Inside-Outside Net: Detecting Objects in Con-                tation. IEEE Transactions on Pattern Anal-
     text with Skip Pooling and Recurrent Neural                  ysis and Machine Intelligence, 39(4):640–651,
     Networks. 2015.                                              2016.

 [2] I. Biederman, R. J. Mezzanotte, and J. C. Ra- [12] A. Shrivastava and A. Gupta. Contextual
     binowitz. Scene Perception : Detection and         priming and feedback for faster R-CNN. Lec-
     Judging Objects undergiong relational viola-       ture Notes in Computer Science (including
     tions. Cognitive Psychology, 177(2):143–177,       subseries Lecture Notes in Artificial Intelli-
     1982.                                              gence and Lecture Notes in Bioinformatics),
                                                        9905 LNCS:330–348, 2016.
 [3] X. Chen and A. Gupta. Spatial Memory for
     Context Reasoning in Object Detection. Pro- [13] X. Zeng, W. Ouyang, J. Yan, H. Li, T. Xiao,
     ceedings of the IEEE International Conference      K. Wang, Y. Liu, Y. Zhou, B. Yang, Z. Wang,
     on Computer Vision, 2017-Octob:4106–4116,          H. Zhou, and X. Wang. Crafting GBD-Net
     2017.                                              for Object Detection. IEEE Transactions on
                                                        Pattern Analysis and Machine Intelligence,
 [4] COCO Consortium. Coco: Common objects              8828(c):1–16, 2017.
     in context - detection evaluation. URL: https:
     //cocodataset.org/detection-eval last ac-
     cessed on 2019-03-23.

 [5] S. Divvala, D. Hoiem, J. Hays, A. Efros, and
     M. Hebert. An empirical study of context in
     object detection. In 2009 IEEE Conference
     on Computer Vision and Pattern Recognition,
     pages 1271–1278. IEEE, 6 2009.

 [6] M. Everingham, L. Van Gool, C. K. I.
     Williams, J. Winn, and A. Zisserman. The
     pascal visual object classes (VOC) challenge.
     International Journal of Computer Vision,
     88(2):303–338, 6 2010.

 [7] C. Galleguillos and S. Belongie. Context
     based object categorization: A critical survey.
     Computer Vision and Image Understanding,
     114(6):712–722, 2010.

 [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep
     Residual Learning for Image Recognition. 2016
     IEEE Conference on Computer Vision and
     Pattern Recognition (CVPR), pages 770–778,
     2016.

 [9] J. Li, Y. Wei, X. Liang, J. Dong, T. Xu,
     J. Feng, and S. Yan. Attentive contexts for
     object detection. IEEE Transactions on Mul-
     timedia, 19(5):944–954, 2017.

[10] T.-Y. Y. Lin, M. Maire, S. Belongie, J. Hays,
     P. Perona, D. Ramanan, P. Dollár, C. L. Zit-
     nick, and P. Doll. Microsoft COCO: Com-
     mon objects in context. Lecture Notes in
     Computer Science (including subseries Lec-
     ture Notes in Artificial Intelligence and Lecture
     Notes in Bioinformatics), 8693 LNCS(PART
     5):740–755, 2014.


                                                         3