A Context Aware Deep Learning Architecture for Object Detection Kevin Bardool, Tinne Tuytelaars, José Oramas ESAT-PSI, KU Leuven, Belgium August 2019 1. Introduction Whereas in these approaches the use of contex- A notable feature of our visual sensory system is tual information is intertwined with the object de- its ability to exploit contextual cues present in a tection architecture, we take a different approach: scene, enhancing our perception and understand- the separation of appearance detection and contex- ing of the image. Exploring methods of incorporat- tual reasoning. The object detector remains re- ing and learning such information using deep neural sponsible for generating appearance-based informa- network (DNN) architectures is a relatively young tion as well as detection and localization. Addi- field of research, with room for more studies. In tionally, it will be used to construct contextual fea- this work we propose a new architecture aimed at ture descriptors that are passed on to a secondary learning contextual relationships and improving the model, responsible for learning contextual relation- precision of existing DNN-based object detectors. ships. While in some works a secondary model was An off-the-shelf detector is modified to extract con- used as a source of contextual information flowing textual cues present in scenes. We implement a into the object detector, our design will attempt to DNN-based architecture aimed at learning this in- reverse this flow of information: contextual infor- formation. A synthetic image generator is imple- mation from the object detector is passed on to a mented that generates random images while enforc- secondary model trained to learn contextual rela- ing a set of simple, predetermined contextual rela- tionships. At inference time, the secondary model tionships. Finally, a series of experiments are car- is used to re-evaluate object detector proposals. ried out to evaluate the effectiveness of our design by measuring the improvement in average precision. 3. Methodolgy Our architectural pipeline consists of two stages 2. Background (Figure 1). The first stage is an off-the-shelf object There have been various attempts to categorize detector. For this, the Mask R-CNN model [8] was sources of contextual information [2, 5, 7]. Bieder- selected. A new network layer was implemented in man groups relationships between an object and its the object detector to generate per-class contextual surroundings into five classes: interposition, sup- feature maps. These heatmaps are constructed us- port, probability, position, and size [2]. It is the ing confidence scores and bounding boxes produced three latter relationships which are of our interest: by the Mask R-CNN object detection and localiza- probability, the likelihood of an object appearing in tion heads. a particular scene, position, the expectation that The secondary model is trained to learn seman- when certain objects are present, they normally oc- tic relationships using the contextual feature maps cupy predictable positions, and size, the expecta- generated by the primary object detector. For this tion that an object’s size relative to other objects stage, a DNN model based on the Fully Convolu- and the general scene [7]. tional Network (FCN) architecture [11] was imple- In works aimed at exploiting contextual informa- mented. The output of this model is also a series of tion in DNN-based object detectors, two main ap- contextual feature maps, representing its confidence proaches stand out. One uses contextual informa- on the original detections based on contextual rela- tion as a feedback method that guides the genera- tionships it has learned. tion of initial object proposals [12, 3]. A second ap- A scoring layer is implemented to produce a proach involves the extraction and use of contextual ‘contextual score’ for each proposed detection, and information after proposal selection, and during the added to both models. Scores are calculated using scoring stage [13, 9, 1]. the contextual feature maps generated by the ob- Copyright c 2019 for this paper by its authors. Use ject detector and contextual model. They are used permitted under Creative Commons License Attribution 4.0 for comparison and AP calculations, and allow us International (CC BY 4.0). to measure the impact of the contextual learner on 1 object detection performance. image (i.e., closer to the observer). For training and evaluation two types of datasets were considered. First, a dataset of synthetically generated images containing a series of consistently enforced contextual relationships. Such a dataset will allow us to train the pipeline with relatively simple content, introducing contextual cues in a controlled manner. In addition, a subset of the COCO dataset [10] was selected as a real-world dataset. Training was conducted using various choices of loss functions and scoring algorithms. Eventually a Binary Cross Entropy loss was selected, and con- textual scoring was performed using a localized summation on a tight region surrounding each ob- ject proposal’s centroid, defined by their predicted bounding box. Figure 1: Our proposed architectural pipeline 4. Experiments Conducted experiments have mainly focused on measuring the capacity of our design in learning var- ious contextual relationships enforced by the syn- Figure 2: Detection results on test toy dataset. thetic image generator. The contextual scores are used to compute the average precision (AP) and mean average precision (mAP) as defined in the evaluation protocols for Pascal VOC and COCO challenges [6, 4]. Inference on a set of 500 images demonstrates a context-based mAP improvement of approximately 1.3 points. However, the Mask R-CNN based soft- max score outperforms our context-score APs (Fig- Figure 3: AP comparison on images with out of context ure 2). proposals. Dark green/red bars represent the contextual We measure the model’s capacity in detecting the learner’s AP. expected spatial context of objects. A controlled set of hypotheses that include object proposals po- sitioned out of their expected spatial location is generated and passed to the contextual reasoning model. The success of the contextual model in re- jecting such false positives indicates that it is able to learn spatial constraints (Figure 3). Figure 4: Maximal score vs. relative positioning of car and Another set of experiments were conducted to person objects. Green box indicates region where correct rel- test the model’s capability in recognizing spatial ative positioning results in maximal scores. relations enforced between objects belonging to dif- ferent classes. Here we were able to confirm that 5. Conclusions the contextual model does recognize such spatial We propose a two stage architecture that extracts relationships in a limited range (Figure 4). and learns contextual relationships using feature Experiments measuring the model’s capability in maps that encode such information. Results of our learning semantic co-occurrence relationships be- experiments show that the context-based model is tween classes (i.e., objects of different classes that able to learn intra- and inter-class spatial relation- appear together in scenes) determined that our ships. Additionally, it is able to learn the rela- model is unable to learn such relationships. tion between the size of an image and its depth In the synthetic images, a sense of depth is cre- in the scene. However, we have not seen robustness ated by scaling object sizes relative to a horizon towards learning co-occurrence of semantically re- line present in the image. Our contextual model lated objects. Continuing experiments on the more was able to learn the relationship between relative challenging COCO dataset and investigating meth- size and vertical location for different classes, favor- ods to induce learning of semantic co-occurrence re- ing larger size objects when positioned lower in the lationships are open avenues for future work. 2 References [11] E. Shelhamer, J. Long, and T. Darrell. Fully [1] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Convolutional Networks for Semantic Segmen- Inside-Outside Net: Detecting Objects in Con- tation. IEEE Transactions on Pattern Anal- text with Skip Pooling and Recurrent Neural ysis and Machine Intelligence, 39(4):640–651, Networks. 2015. 2016. [2] I. Biederman, R. J. Mezzanotte, and J. C. Ra- [12] A. Shrivastava and A. Gupta. Contextual binowitz. Scene Perception : Detection and priming and feedback for faster R-CNN. Lec- Judging Objects undergiong relational viola- ture Notes in Computer Science (including tions. Cognitive Psychology, 177(2):143–177, subseries Lecture Notes in Artificial Intelli- 1982. gence and Lecture Notes in Bioinformatics), 9905 LNCS:330–348, 2016. [3] X. Chen and A. Gupta. Spatial Memory for Context Reasoning in Object Detection. Pro- [13] X. Zeng, W. Ouyang, J. Yan, H. Li, T. Xiao, ceedings of the IEEE International Conference K. Wang, Y. Liu, Y. Zhou, B. Yang, Z. Wang, on Computer Vision, 2017-Octob:4106–4116, H. Zhou, and X. Wang. Crafting GBD-Net 2017. for Object Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, [4] COCO Consortium. Coco: Common objects 8828(c):1–16, 2017. in context - detection evaluation. URL: https: //cocodataset.org/detection-eval last ac- cessed on 2019-03-23. [5] S. Divvala, D. Hoiem, J. Hays, A. Efros, and M. Hebert. An empirical study of context in object detection. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 1271–1278. IEEE, 6 2009. [6] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2):303–338, 6 2010. [7] C. Galleguillos and S. Belongie. Context based object categorization: A critical survey. Computer Vision and Image Understanding, 114(6):712–722, 2010. [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. [9] J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, and S. Yan. Attentive contexts for object detection. IEEE Transactions on Mul- timedia, 19(5):944–954, 2017. [10] T.-Y. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zit- nick, and P. Doll. Microsoft COCO: Com- mon objects in context. Lecture Notes in Computer Science (including subseries Lec- ture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8693 LNCS(PART 5):740–755, 2014. 3