=Paper=
{{Paper
|id=Vol-2962/paper49
|storemode=property
|title=Segmenting out Generic Objects in Monocular Videos
|pdfUrl=https://ceur-ws.org/Vol-2962/paper49.pdf
|volume=Vol-2962
|authors=Jan Hůla,David Adamczyk,David Mojžíšek,Vojtech Molek
|dblpUrl=https://dblp.org/rec/conf/itat/HulaAMM21
}}
==Segmenting out Generic Objects in Monocular Videos ==
Segmenting out Generic Objects in Monocular Videos Jan Hula, David Adamczyk, David Mojzisek, and Vojtech Molek CE IT4I - IRAFM, University of Ostrava 30. dubna 22, 701 03 Ostrava, Czech Republic {jan.hula, vojtech.molek, david.adamczyk, david.mojzisek}@osu.cz Figure 1: A schema depicting the improvement of clustering based on features from autoencoder trained on segmented objects using our approach vs. the one trained on the original images. Abstract: We present an approach for generic object de- background. Object separation from a background auto- tection and segmentation in monocular videos. In this matically restricts a classifier to consider only features di- task, we want to segment objects from a background with rectly tied to the class label, as opposed to features only no prior knowledge about the possible classes of objects correlated with it. which we may encounter. This makes this task much As the generic object separation is a rather nonstandard harder than the classical object detection and segmenta- task and still vaguely defined – it is not clear what should tion, which can be posed as a supervised learning prob- be considered an independent object – we focus on a sim- lem. Our approach uses an ensemble of 3 different models plified setting in which a camera captures a single salient which are trained by different objectives and have different object. Our solution is an ensemble of three different mod- failure modes and therefore complement each other. We els trained for different objectives. demonstrate the usefulness of our approach on a custom Furthermore, we study the impact of the background re- dataset containing 18 classes of organic objects. Using moval on the clustering properties of the resulting repre- our method, we were able to recover the classes of objects sentations. The representations are obtained by training in a fully unsupervised way. a neural network in an unsupervised way. We are able to recover the categories of objects in a fully unsupervised way, using our custom dataset containing videos of organic 1 Introduction things. Our main contributions are: Separating generic objects from a background in monocu- lar videos is a challenging task. We believe that this prob- • We present an ensemble model that can separate ob- lem is essential to Computer Vision and, as such, gained jects from the background in monocular videos con- an unproportionally small amount of attention from the re- taining one salient object. The ensembled models search community. The ability to separate objects from the compensate for each other failure modes. background would vastly simplify other tasks as it can be viewed as a kind of dimensionality reduction on relevant • We demonstrate the benefits of object separation by features. comparing the classification accuracy of objects with In image classification, object separation prevents a and without background using clustering. classifier from learning spurious correlations, which could Section 2 contains related work. Section 3 describes arise when a certain class is often captured on a particular our approach for detecting and segmenting generic objects Copyright ©2021 for this paper by its authors. Use permitted under within monocular videos. Section 4 describes how the Creative Commons License Attribution 4.0 International (CC BY 4.0). detected objects enable unsupervised discovery of object classes. In section 5 we describe our experiments and the move independently. Nonetheless, we consider this as our dataset we test our approach on, and finally we provide a working definition because it allows us to make progress conclusion in section 6. in generic object detection and segmentation. 2 Related Work 3.1 Ensemble of Models Trained for Different Objectives Generic object separation is a largely an unexplored area, Our approach to the problem of generic object detection and therefore similar works are scarce. Our approach and segmentation is based on an ensemble of 3 models is most related to DINO method [1] which uses self- which are trained by different objectives. Even though supervised learning with Vision Transformers. The au- each of these models has its own failure modes, together thors introduced DINO as a form of self-distillation with they constitute a robust ensemble. Concretely, we use no labels. They emphasize that DINO automatically learns one model trained for depth map prediction, one model an interpretable representation and separates the main ob- trained for optical flow estimation, and one model trained ject from the background clutter. for tracking of objects. Using the model for depth predic- Lu et al. introduced an approach called CO-attention tion, we can separate foreground objects based on depth, Siamese Network (COSNet) [2] for unsupervised video using the model trained for optical flow estimation, we object segmentation. It is based on two ideas. The first can separate moving objects, and finally, using the tracker, is the importance of inherent correlation among video we can verify the temporal coherency of our predictions. frames, and the second is the global co-attention mecha- The tracker is initialised with a bounding box obtained nism responsible for learning motion in short-term tem- from the predictions of the two other models in the frames poral segments. The COSNet is trained on pairs of video where these predictions are most consistent. The follow- frames, which increases the learning capacity. ing paragraphs provide a high-level description of these The task of class discovery is marginally related to cur- three models. For a more complete description of these rent self-supervised approaches using large amounts of un- models, see the respective publications. labeled data such as [3, 4] and approaches that try to ex- ploit coherency in the data [5]. Lastly, our approach for class discovery can be seen as a Depth Prediction For the depth prediction, we use the version of clustering with constraints. It has been heavily model introduced by Ranftl et al. [8], available from the studied in the past, for example, by [6, 7]. author’s repository1 . This transformer-based model pre- dicts a scalar value for each pixel, which represents the dis- tance of the surface captured by that pixel from the camera 3 Generic Object Detection and center. Segmentation Optical Flow Estimation For optical estimation, we use This section describes our approach to generic object de- the model introduced by Teed et al. [9]. It is also tection and segmentation. By a generic object we mean a transformer-based model which requires 2 consecutive an object of an unknown class. We use this term to dis- frames of video to produce the optical flow field. The tinguish it from classical object detection and segmenta- optical flow field assigns two scalar values to each pixel. tion, which can deal only with a concrete set of speci- These values represent the pixel displacement on the x and fied classes. Classical object detection and segmentation y axes, relative to the previous frame. To obtain one scalar is much easier because it can be approached as a super- value for each pixel, we take the magnitude of the dis- vised learning problem on a dataset with annotated bound- placement. We used the implementation of the model with ing boxes and segmentation masks. With generic objects, trained weights provided in the authors repository2 . it is not that straightforward because it is not known in ad- vance what kind of objects we will encounter at the test time. Object tracking For tracking objects, we use a model Moreover, at first it may not be obvious how to define called SiamMask [10], a neural network trained as a what should be considered as a separate object. One use- Siamese architecture which simultaneously performs both ful definition would be that an object is anything that can visual object tracking and object segmentation in a video. move independently from the rest of the environment. In We used the implementation available online3 . this view, we can understand generic object segmentation as a way to factorise the visual stream into independent components. We need to mention that this definition does not cover all cases in which we would like to detect some- 1 https://github.com/intel-isl/DPT thing as a separate object. Examples include buildings, let- 2 https://github.com/princeton-vl/RAFT ters on a sheet of paper, and other “entities” which can not 3 https://github.com/foolwood/SiamMask 3.2 Segmenting out Known Classes other. Then we compute the consistency score c of frame x using the following formula: In our dataset, each video captures a hand holding one ob- ject. Using our approach for generic object segmentation which is based on predicted depth maps and optical flow, c(x) = ∑ eε1i +ε2i , (1) our model segments out the hand together with the object. i∈Pixels(x) We fix this issue by segmenting out hands separately by a model trained specifically for hand segmentation. where ε1 and ε2 are the two blurred edge maps from the To obtain training data for hand segmentation, we two predictions and i indexes individual pixels. Next, we downloaded the following 4 datasets: GTEA, HandOver- obtain the aggregated predictions for each pixel i by: Face, GTEA_GAZE_PLUS, and EgoHands456 . The ar- 1 chitecture of the model is UNet with timm_regnetty_160 y(x)i = e f1 (x)i + f2 (x)i + ∑ e f1 (x) j + f2 (x) j . [11] as encoder and softmax2D as activation. The encoder |Pixels(x)| j∈Pixels(x) weights were pretrained on ImageNet. (2) Using freely available datasets for hand segmentation We exponentiate the sum of the predictions from the two mentioned above, the trained model was not working well models because we want these predictions to interact su- on our dataset, probably because of a large distribution perlinearly. We also subtract the mean of this value taken shift (most of the images in these public datasets contained over all pixels within the image to make the aggregated hands in front of the face or were captured in the interior). predictions centered at zero. Therefore, the pixels where To mitigate this problem, we used a simple trick to en- no object was predicted will contain negative values. large the training data with images of hands which are sim- Once we have the aggregated predictions for the se- ilar to the images in our target dataset. Concretely, we cap- lected frames, we initialize the bounding boxes for the tured our hands from a similar viewpoint as in our dataset, tracker. For this, we again devise a score which captures and then used the same model for depth prediction to pro- how well a given bounding box (bbox) covers pixels with duce depth maps for every 10th frame within the video. high values (signifying that an object is present) and at the Finally, we thresholded the predicted death maps to obtain same time excludes pixels with low values. It has the fol- reliable segmentation masks of hands. In this way, we ob- lowing form: tained hundreds of labeled images of hands with minimal effort. After adding this dataset to the other datasets, we obtained an accurate model for hand segmentation. bboxScore(bbox, y(x)) = isInBbox(i, bbox)·y(x)i , We use this model to remove hands from the mask pre- ∑ i∈Pixels(x) dicted by our ensemble. More precisely, we remove the (3) hands from the outputs of the model predicting the depth where the function isInBbox returns −1 if the pixel is map and the model predicting the optical flow before we not contained in the bounding box and 1 otherwise. initialise the bounding box for the tracker. In this way we Finally, we optimize the coordinates of the bounding obtain masks only for the object, ignoring the hands. box using CMA-ES [13] which is a derivative-free opti- mization algorithm used for the optimization of continu- 3.3 Bounding Box Initialization ous parameters. The optimization tries to find coordinates which maximize this score. At the end of this procedure, As mentioned above, we initialize the tracker with a we obtain k frames with bounding boxes in each video. bounding box obtained from the predictions of depth and The quality of predictions and the resulting bounding box optical flow maps within each frame. These predictions is shown in Figure 2. are two rectangular matrices of the same shape. We rescale the range of values to the interval between 0 and 1 and de- note the final matrices by f1 (x) and f2 (x), respectively. We first choose k frames where the predictions from these two models are most consistent. To measure this consistency, we devise the following heuristic. We first de- tect edges using a Canny edge detector [12] in both predic- tions and then measure the overlap of the resulting edges. To account for small deviations of edges in the two pre- dictions, we blur them with a gaussian kernel of the width set to 7px to achieve their overlap if they are close to each Figure 2: Predictions from the model for optical flow esti- mation (middle) and depth estimation (right). The initial- 4 http://cbs.ic.gatech.edu/fpv/ 5 https://www.cl.cam.ac.uk/research/rainbow/emotions/hand.html ized bounding box is depicted in the RGB image. 6 http://vision.soic.indiana.edu/projects/egohands/ Input video Ensemble of 3 models Segmented objects Low-dim representation Filter correlated frames Compute similarities Create similarity graph Community detection Figure 3: This figure depicts the whole pipeline of our approach. The diagram describes the process of our approach from the first step, where we preprocess the input video for the last step, where we obtain sets of similar objects. 3.4 Verification with the Tracker goal. Therefore, we evaluate the usefulness of our ap- proach on this target task. That is, we test how well are Using the chosen frames and their bounding boxes, we ini- we able to recover classes of objects from images where tialize the tracker and let it track the object in between the objects were segmented out by our approach. We com- the selected frames. The tracker provides another layer of pare it to the setup where we use the same algorithm for consistency check. Once we have the predictions from the class discovery but where we use the original images with three models (denoted by f1 (x), f2 (x) and f3 (x)), we can a background. We also mention that we do not require treat the consistency of these predictions as a certainty of pixel-perfect segmentation masks as our goal is only to fo- the whole ensemble. To measure this certainty, we again cus on the relevant parts of the image, so that the measured compute the consistency score as we did in the selection similarity between images will mostly reflect the similar- of reliable frames in the Equation 1. Using an empirically ity of objects and not of backgrounds. The next section estimated threshold, we filter out frames with low consis- describes our pipeline for the task of class discovery tency and for each pixel i in the filtered frames, we ag- gregate the predictions of the ensemble with the following formula: 4 Class Discovery 3 min e∑ j=1 f j (x)i − 1, e2 − 1 The algorithm for class discovery was proposed in [14]. output(x)i = , (4) Its input is a set of videos, each containing one object and e2 − 1 each represented as a sequence of images. The goal is The subtraction of 1 ensures that we get 0 when all three to find clusters of videos based on the similarity between models predict 0. Thresholding and dividing by e2 − 1 them. Generally, our algorithm works in 3 steps: insures that we obtain a value close to 1 when at least two models predict values close to 1. Finally, we obtain a 1. Measure the similarity between every pair of videos bounding box for each frame using the same method as in with the method described in Section 4.1. the bounding box initialization (optimization using CMA- 2. Construct a similarity graph by connecting each ES), i.e., minimizing the objective in Equation 1. video to its five most similar videos. The whole process can be viewed as certainty propaga- tion. We first select a few frames where the first two mod- 3. Apply the Louvain community detection algo- els agree on their predictions and from these the tracker rithm [15] to detect the highly interconnected parts propagates the certainty to other frames. of the graph and consider these as the discovered classes. Evaluating The Quality of the Aggregated Predictions The advantage of the Louvain algorithm is that it needs no Our final task is a discovery of classes of objects within apriori knowledge of the number of clusters/communities. monocular videos. We view the generic object detec- The accuracy of our approach is measured in two ways. tion and segmentation as an intermediate step towards this First, by counting how many times a video was assigned to an incorrect cluster. Second, whether the algorithm discovered all clusters. It is clear that the final accuracy mostly reflects the measured similarity between individual videos. 4.1 Computing Similarity between a Pair of Videos Each video is represented by a sequence of images, but to compute the similarity, we ignore the ordering and treat the sequence as a set. The similarity between the two videos is computed in the following four steps: 1. Train an autoencoder using images from all videos to obtain a low-dimensional representation zi of each image xi . 2. In each video, select n representative frames which are not correlated, described in 4.2. 3. For each pair of videos, compute all pairwise similar- ities d(zi , z j ) with cosine distance. 4. Finally, select the l most similar pairs of images and average their similarities to obtain the final similarity between two videos. Figure 4: Top: A graph of cosine similarity between the first and other frames in the video. Bottom: Selected The intuition behind step 4 is that videos of similar ob- frames from the video with their corresponding frame jects may contain only a few frames where these objects numbers. The two highlighted frames correspond to two are captured from the same angle or in the same situation. arrows in the graph. 4.2 Filtering out Correlated Frames Step 2 of similarity computation takes n representative frames. If we would simply use all frames from each video, the distribution of the dataset may end up skewed because some parts of a video may be more static than others. These static parts would produce many correlated frames. Therefore, the correlated frames need to be filtered out from a given video. We first test whether the subjec- tive visual similarity of images can be captured by cosine similarity between their low-dimensional representations Figure 5: Samples from the Organic Objects dataset. Im- obtained in step 1 of the similarity computation. As can ages are cropped and have blurred background. be seen in Figure 4, it captures the visual similarity well enough. which would otherwise not be linked based only on the To extract n uncorrelated frames from each video, similarity. The video contains a constraint that says that we run k-means clustering, where k = n, on the low- the object cannot change its class in time. dimensional representations and take the most similar frame to every centroid of the resulting clusters. This sim- ple heuristic produces uncorrelated images. 5 Experiments To conclude this section, if the low-dimensional rep- resentation of individual images obtained by the autoen- To test the algorithm for object discovery, we assembled coder reflects the similarity between the captured objects a custom dataset of organic objects. The dataset con- and not some other irrelevant factors, we may expect to tains 18 classes of organic objects, some of which are de- obtain meaningful clusters. Moreover, note the benefit of picted in Figure 5. We have chosen organic objects be- creating the similarity graph of videos instead of individ- cause they naturally produce large variability between in- ual images. All images in one video are automatically stances. For every class, we capture ten different samples linked together. If a few images in 2 videos are similar, this from different viewpoints. The final dataset can be down- similarity is propagated to other frames within the video, loaded at the following address – github.com/Jan21/ Organic-objects-dataset. three different models trained by three different objectives. Using our ensemble described in Section 3, we segment To demonstrate the effectiveness of the approach, we have the object in every frame of each video. Using the re- created a custom dataset of organic objects. The dataset sulting bounding boxes and segmentation masks, we crop was used in our pipeline to remove background, create each image and blur the background to suppress the dis- low-dimensional representations, and perform class dis- tinctive features present in the background. covery and classification. We have shown that background To obtain the low-dimensional representations used to removal significantly increases the accuracy of the classi- filter out correlated frames and construct the similarity fication and the number of correctly discovered classes. graph, we resize all images to a fixed resolution of 64 × 64 In future work, we plan to optimize our approach for pixels and train a convolutional autoencoder. The autoen- speed, generalize it to work with videos containing multi- coder has five convolutional layers (16, 32, 64, 128, 256 ple objects, and make the class discovery an online process filters with stride 2) and one fully-connected layer7 with that can discover new classes on-the-fly. dimensions 1024 → 96. 8 References [1] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” arXiv preprint arXiv:2104.14294, 2021. [2] X. Lu, W. Wang, C. Ma, J. Shen, L. Shao, and F. Porikli, “See more, know more: Unsupervised video object seg- mentation with co-attention siamese networks,” in Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3623–3632, 2019. Figure 6: Visualization of detected communities in the Or- [3] Q. Xie, E. Hovy, M.-T. Luong, and Q. V. Le, “Self-training ganic Objects dataset with the Louvain method. Nodes are with noisy student improves imagenet classification,” arXiv colored according to the component (community) they are preprint arXiv:1911.04252, 2019. assigned to. The method discovered 18 components which [4] H. Bagherinezhad, M. Horton, M. Rastegari, and belong to 18 different classes. A. Farhadi, “Label refinery: Improving imagenet classification through label progression,” ArXiv, vol. abs/1805.02641, 2018. [5] P. Bhattacharjee and S. Das, “Temporal coherency based 5.1 Results criteria for predicting video frames using deep multi-stage generative adversarial networks,” in Advances in Neural In- After running community detection on the similarity formation Processing Systems, pp. 4268–4277, 2017. graph, we inspected how many image bundles were as- [6] I. Davidson and S. Ravi, “Clustering with constraints: Fea- signed to the wrong component. Out of 173 videos, only 5 sibility issues and the k-means algorithm,” in Proceedings in the training set were assigned to the wrong component. of the 2005 SIAM international conference on data mining, The result of community detection on the constructed sim- pp. 138–149, SIAM, 2005. ilarity graph is shown in Figure 6. Numerical results and [7] S. Basu, M. Bilenko, A. Banerjee, and R. J. Mooney, comparison with clustering of non-segmented images are “Probabilistic semi-supervised clustering with constraints,” presented in Table 1. From the accuracy and number of Semi-supervised learning, pp. 71–98, 2006. discovered classes, it is clear that background removal cre- [8] R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vi- ated a significant accuracy difference of 56.1% and a dif- sion transformers for dense prediction,” arXiv preprint ference in the number of correctly discovered classes. arXiv:2103.13413, 2021. [9] Z. Teed and J. Deng, “Raft: Recurrent all-pairs field trans- forms for optical flow,” in European Conference on Com- 6 Conclusion puter Vision, pp. 402–419, Springer, 2020. [10] Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. Torr, In this contribution, we present a method for generic object “Fast online object tracking and segmentation: A unify- detection and segmentation, which uses an ensemble of ing approach,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1328–1338, 7 Image is downscaled to 2 × 2 times 256 filters → 1024 input vector 2019. to the fully-connected layer. 8 We also tried to extract representations by using VGG16 which was [11] P. Yakubovskiy, “Segmentation models pytorch.” pre-trained on ImageNet. These representations better discriminated very https://github.com/qubvel/segmentation_ similar objects (e.g., two types of red flowers). models.pytorch, 2020. Pre-segmenting the objects Number of discovered classes Accuracy Yes 18 97.1% No 15 41.0% Table 1: Comparison of the clustering of videos with and without the background removed. The accuracy is computed by checking how many times a video was assigned to incorrect cluster. [12] J. Canny, “A computational approach to edge detection,” IEEE Transactions on pattern analysis and machine intel- ligence, no. 6, pp. 679–698, 1986. [13] N. Hansen, “The cma evolution strategy: a comparing re- view,” Towards a new evolutionary computation, pp. 75– 102, 2006. [14] J. Hula, “Unsupervised object-aware learning from videos,” in 2020 IEEE Third International Conference on Data Stream Mining & Processing (DSMP), pp. 237–242, IEEE, 2020. [15] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefeb- vre, “Fast unfolding of communities in large networks,” Journal of statistical mechanics: theory and experiment, vol. 2008, no. 10, p. P10008, 2008.