=Paper=
{{Paper
|id=Vol-2962/paper49
|storemode=property
|title=Segmenting out Generic Objects in Monocular Videos	
|pdfUrl=https://ceur-ws.org/Vol-2962/paper49.pdf
|volume=Vol-2962
|authors=Jan Hůla,David Adamczyk,David Mojžíšek,Vojtech Molek
|dblpUrl=https://dblp.org/rec/conf/itat/HulaAMM21
}}
==Segmenting out Generic Objects in Monocular Videos	==
<pdf width="1500px">https://ceur-ws.org/Vol-2962/paper49.pdf</pdf>
<pre>
                       Segmenting out Generic Objects in Monocular Videos

                                  Jan Hula, David Adamczyk, David Mojzisek, and Vojtech Molek

                                            CE IT4I - IRAFM, University of Ostrava
                                         30. dubna 22, 701 03 Ostrava, Czech Republic
                           {jan.hula, vojtech.molek, david.adamczyk, david.mojzisek}@osu.cz


Figure 1: A schema depicting the improvement of clustering based on features from autoencoder trained on segmented
objects using our approach vs. the one trained on the original images.

Abstract: We present an approach for generic object de-                   background. Object separation from a background auto-
tection and segmentation in monocular videos. In this                     matically restricts a classifier to consider only features di-
task, we want to segment objects from a background with                   rectly tied to the class label, as opposed to features only
no prior knowledge about the possible classes of objects                  correlated with it.
which we may encounter. This makes this task much                            As the generic object separation is a rather nonstandard
harder than the classical object detection and segmenta-                  task and still vaguely defined – it is not clear what should
tion, which can be posed as a supervised learning prob-                   be considered an independent object – we focus on a sim-
lem. Our approach uses an ensemble of 3 different models                  plified setting in which a camera captures a single salient
which are trained by different objectives and have different              object. Our solution is an ensemble of three different mod-
failure modes and therefore complement each other. We                     els trained for different objectives.
demonstrate the usefulness of our approach on a custom                       Furthermore, we study the impact of the background re-
dataset containing 18 classes of organic objects. Using                   moval on the clustering properties of the resulting repre-
our method, we were able to recover the classes of objects                sentations. The representations are obtained by training
in a fully unsupervised way.                                              a neural network in an unsupervised way. We are able
                                                                          to recover the categories of objects in a fully unsupervised
                                                                          way, using our custom dataset containing videos of organic
1    Introduction                                                         things.
                                                                             Our main contributions are:
Separating generic objects from a background in monocu-
lar videos is a challenging task. We believe that this prob-                 • We present an ensemble model that can separate ob-
lem is essential to Computer Vision and, as such, gained                       jects from the background in monocular videos con-
an unproportionally small amount of attention from the re-                     taining one salient object. The ensembled models
search community. The ability to separate objects from the                     compensate for each other failure modes.
background would vastly simplify other tasks as it can be
viewed as a kind of dimensionality reduction on relevant                     • We demonstrate the benefits of object separation by
features.                                                                      comparing the classification accuracy of objects with
   In image classification, object separation prevents a                       and without background using clustering.
classifier from learning spurious correlations, which could                 Section 2 contains related work. Section 3 describes
arise when a certain class is often captured on a particular              our approach for detecting and segmenting generic objects
     Copyright ©2021 for this paper by its authors. Use permitted under   within monocular videos. Section 4 describes how the
Creative Commons License Attribution 4.0 International (CC BY 4.0).       detected objects enable unsupervised discovery of object
classes. In section 5 we describe our experiments and the       move independently. Nonetheless, we consider this as our
dataset we test our approach on, and finally we provide a       working definition because it allows us to make progress
conclusion in section 6.                                        in generic object detection and segmentation.


2   Related Work                                                3.1     Ensemble of Models Trained for Different
                                                                        Objectives
Generic object separation is a largely an unexplored area,      Our approach to the problem of generic object detection
and therefore similar works are scarce. Our approach            and segmentation is based on an ensemble of 3 models
is most related to DINO method [1] which uses self-             which are trained by different objectives. Even though
supervised learning with Vision Transformers. The au-           each of these models has its own failure modes, together
thors introduced DINO as a form of self-distillation with       they constitute a robust ensemble. Concretely, we use
no labels. They emphasize that DINO automatically learns        one model trained for depth map prediction, one model
an interpretable representation and separates the main ob-      trained for optical flow estimation, and one model trained
ject from the background clutter.                               for tracking of objects. Using the model for depth predic-
   Lu et al. introduced an approach called CO-attention         tion, we can separate foreground objects based on depth,
Siamese Network (COSNet) [2] for unsupervised video             using the model trained for optical flow estimation, we
object segmentation. It is based on two ideas. The first        can separate moving objects, and finally, using the tracker,
is the importance of inherent correlation among video           we can verify the temporal coherency of our predictions.
frames, and the second is the global co-attention mecha-        The tracker is initialised with a bounding box obtained
nism responsible for learning motion in short-term tem-         from the predictions of the two other models in the frames
poral segments. The COSNet is trained on pairs of video         where these predictions are most consistent. The follow-
frames, which increases the learning capacity.                  ing paragraphs provide a high-level description of these
   The task of class discovery is marginally related to cur-    three models. For a more complete description of these
rent self-supervised approaches using large amounts of un-      models, see the respective publications.
labeled data such as [3, 4] and approaches that try to ex-
ploit coherency in the data [5].
   Lastly, our approach for class discovery can be seen as a    Depth Prediction For the depth prediction, we use the
version of clustering with constraints. It has been heavily     model introduced by Ranftl et al. [8], available from the
studied in the past, for example, by [6, 7].                    author’s repository1 . This transformer-based model pre-
                                                                dicts a scalar value for each pixel, which represents the dis-
                                                                tance of the surface captured by that pixel from the camera
3   Generic Object Detection and                                center.
    Segmentation
                                                                Optical Flow Estimation For optical estimation, we use
This section describes our approach to generic object de-       the model introduced by Teed et al. [9]. It is also
tection and segmentation. By a generic object we mean           a transformer-based model which requires 2 consecutive
an object of an unknown class. We use this term to dis-         frames of video to produce the optical flow field. The
tinguish it from classical object detection and segmenta-       optical flow field assigns two scalar values to each pixel.
tion, which can deal only with a concrete set of speci-         These values represent the pixel displacement on the x and
fied classes. Classical object detection and segmentation       y axes, relative to the previous frame. To obtain one scalar
is much easier because it can be approached as a super-         value for each pixel, we take the magnitude of the dis-
vised learning problem on a dataset with annotated bound-       placement. We used the implementation of the model with
ing boxes and segmentation masks. With generic objects,         trained weights provided in the authors repository2 .
it is not that straightforward because it is not known in ad-
vance what kind of objects we will encounter at the test
time.                                                           Object tracking For tracking objects, we use a model
    Moreover, at first it may not be obvious how to define      called SiamMask [10], a neural network trained as a
what should be considered as a separate object. One use-        Siamese architecture which simultaneously performs both
ful definition would be that an object is anything that can     visual object tracking and object segmentation in a video.
move independently from the rest of the environment. In         We used the implementation available online3 .
this view, we can understand generic object segmentation
as a way to factorise the visual stream into independent
components. We need to mention that this definition does
not cover all cases in which we would like to detect some-            1 https://github.com/intel-isl/DPT
thing as a separate object. Examples include buildings, let-          2 https://github.com/princeton-vl/RAFT

ters on a sheet of paper, and other “entities” which can not          3 https://github.com/foolwood/SiamMask
3.2     Segmenting out Known Classes                                   other. Then we compute the consistency score c of frame
                                                                       x using the following formula:
In our dataset, each video captures a hand holding one ob-
ject. Using our approach for generic object segmentation
which is based on predicted depth maps and optical flow,                                    c(x) =         ∑          eε1i +ε2i ,           (1)
our model segments out the hand together with the object.                                               i∈Pixels(x)
We fix this issue by segmenting out hands separately by a
model trained specifically for hand segmentation.                      where ε1 and ε2 are the two blurred edge maps from the
   To obtain training data for hand segmentation, we                   two predictions and i indexes individual pixels. Next, we
downloaded the following 4 datasets: GTEA, HandOver-                   obtain the aggregated predictions for each pixel i by:
Face, GTEA_GAZE_PLUS, and EgoHands456 . The ar-
                                                                                                    1
chitecture of the model is UNet with timm_regnetty_160                  y(x)i = e f1 (x)i + f2 (x)i +          ∑ e f1 (x) j + f2 (x) j .
[11] as encoder and softmax2D as activation. The encoder                                       |Pixels(x)| j∈Pixels(x)
weights were pretrained on ImageNet.                                                                                                (2)
   Using freely available datasets for hand segmentation               We exponentiate the sum of the predictions from the two
mentioned above, the trained model was not working well                models because we want these predictions to interact su-
on our dataset, probably because of a large distribution               perlinearly. We also subtract the mean of this value taken
shift (most of the images in these public datasets contained           over all pixels within the image to make the aggregated
hands in front of the face or were captured in the interior).          predictions centered at zero. Therefore, the pixels where
   To mitigate this problem, we used a simple trick to en-             no object was predicted will contain negative values.
large the training data with images of hands which are sim-               Once we have the aggregated predictions for the se-
ilar to the images in our target dataset. Concretely, we cap-          lected frames, we initialize the bounding boxes for the
tured our hands from a similar viewpoint as in our dataset,            tracker. For this, we again devise a score which captures
and then used the same model for depth prediction to pro-              how well a given bounding box (bbox) covers pixels with
duce depth maps for every 10th frame within the video.                 high values (signifying that an object is present) and at the
Finally, we thresholded the predicted death maps to obtain             same time excludes pixels with low values. It has the fol-
reliable segmentation masks of hands. In this way, we ob-              lowing form:
tained hundreds of labeled images of hands with minimal
effort. After adding this dataset to the other datasets, we
obtained an accurate model for hand segmentation.                      bboxScore(bbox, y(x)) =                         isInBbox(i, bbox)·y(x)i ,
   We use this model to remove hands from the mask pre-
                                                                                                             ∑
                                                                                                         i∈Pixels(x)
dicted by our ensemble. More precisely, we remove the                                                                         (3)
hands from the outputs of the model predicting the depth                  where the function isInBbox returns −1 if the pixel is
map and the model predicting the optical flow before we                not contained in the bounding box and 1 otherwise.
initialise the bounding box for the tracker. In this way we
                                                                          Finally, we optimize the coordinates of the bounding
obtain masks only for the object, ignoring the hands.
                                                                       box using CMA-ES [13] which is a derivative-free opti-
                                                                       mization algorithm used for the optimization of continu-
3.3     Bounding Box Initialization                                    ous parameters. The optimization tries to find coordinates
                                                                       which maximize this score. At the end of this procedure,
As mentioned above, we initialize the tracker with a                   we obtain k frames with bounding boxes in each video.
bounding box obtained from the predictions of depth and                The quality of predictions and the resulting bounding box
optical flow maps within each frame. These predictions                 is shown in Figure 2.
are two rectangular matrices of the same shape. We rescale
the range of values to the interval between 0 and 1 and de-
note the final matrices by f1 (x) and f2 (x), respectively.
   We first choose k frames where the predictions from
these two models are most consistent. To measure this
consistency, we devise the following heuristic. We first de-
tect edges using a Canny edge detector [12] in both predic-
tions and then measure the overlap of the resulting edges.
To account for small deviations of edges in the two pre-
dictions, we blur them with a gaussian kernel of the width
set to 7px to achieve their overlap if they are close to each
                                                                       Figure 2: Predictions from the model for optical flow esti-
                                                                       mation (middle) and depth estimation (right). The initial-
      4 http://cbs.ic.gatech.edu/fpv/
      5 https://www.cl.cam.ac.uk/research/rainbow/emotions/hand.html
                                                                       ized bounding box is depicted in the RGB image.
      6 http://vision.soic.indiana.edu/projects/egohands/
              Input video             Ensemble of 3 models           Segmented objects         Low-dim representation


        Filter correlated frames       Compute similarities        Create similarity graph      Community detection


Figure 3: This figure depicts the whole pipeline of our approach. The diagram describes the process of our approach from
the first step, where we preprocess the input video for the last step, where we obtain sets of similar objects.


3.4   Verification with the Tracker                            goal. Therefore, we evaluate the usefulness of our ap-
                                                               proach on this target task. That is, we test how well are
Using the chosen frames and their bounding boxes, we ini-
                                                               we able to recover classes of objects from images where
tialize the tracker and let it track the object in between
                                                               the objects were segmented out by our approach. We com-
the selected frames. The tracker provides another layer of
                                                               pare it to the setup where we use the same algorithm for
consistency check. Once we have the predictions from the
                                                               class discovery but where we use the original images with
three models (denoted by f1 (x), f2 (x) and f3 (x)), we can
                                                               a background. We also mention that we do not require
treat the consistency of these predictions as a certainty of
                                                               pixel-perfect segmentation masks as our goal is only to fo-
the whole ensemble. To measure this certainty, we again
                                                               cus on the relevant parts of the image, so that the measured
compute the consistency score as we did in the selection
                                                               similarity between images will mostly reflect the similar-
of reliable frames in the Equation 1. Using an empirically
                                                               ity of objects and not of backgrounds. The next section
estimated threshold, we filter out frames with low consis-
                                                               describes our pipeline for the task of class discovery
tency and for each pixel i in the filtered frames, we ag-
gregate the predictions of the ensemble with the following
formula:                                                       4     Class Discovery
                            3                       
                      min e∑ j=1 f j (x)i − 1, e2 − 1          The algorithm for class discovery was proposed in [14].
         output(x)i =                                  , (4)   Its input is a set of videos, each containing one object and
                                  e2 − 1
                                                               each represented as a sequence of images. The goal is
The subtraction of 1 ensures that we get 0 when all three      to find clusters of videos based on the similarity between
models predict 0. Thresholding and dividing by e2 − 1          them. Generally, our algorithm works in 3 steps:
insures that we obtain a value close to 1 when at least
two models predict values close to 1. Finally, we obtain a         1. Measure the similarity between every pair of videos
bounding box for each frame using the same method as in               with the method described in Section 4.1.
the bounding box initialization (optimization using CMA-           2. Construct a similarity graph by connecting each
ES), i.e., minimizing the objective in Equation 1.                    video to its five most similar videos.
   The whole process can be viewed as certainty propaga-
tion. We first select a few frames where the first two mod-        3. Apply the Louvain community detection algo-
els agree on their predictions and from these the tracker             rithm [15] to detect the highly interconnected parts
propagates the certainty to other frames.                             of the graph and consider these as the discovered
                                                                      classes.
Evaluating The Quality of the Aggregated Predictions           The advantage of the Louvain algorithm is that it needs no
Our final task is a discovery of classes of objects within     apriori knowledge of the number of clusters/communities.
monocular videos. We view the generic object detec-              The accuracy of our approach is measured in two ways.
tion and segmentation as an intermediate step towards this     First, by counting how many times a video was assigned
to an incorrect cluster. Second, whether the algorithm
discovered all clusters. It is clear that the final accuracy
mostly reflects the measured similarity between individual
videos.


4.1   Computing Similarity between a Pair of Videos

Each video is represented by a sequence of images, but to
compute the similarity, we ignore the ordering and treat the
sequence as a set. The similarity between the two videos
is computed in the following four steps:

  1. Train an autoencoder using images from all videos
     to obtain a low-dimensional representation zi of each
     image xi .

  2. In each video, select n representative frames which
     are not correlated, described in 4.2.

  3. For each pair of videos, compute all pairwise similar-
     ities d(zi , z j ) with cosine distance.

  4. Finally, select the l most similar pairs of images and
     average their similarities to obtain the final similarity
     between two videos.                                         Figure 4: Top: A graph of cosine similarity between the
                                                                 first and other frames in the video. Bottom: Selected
   The intuition behind step 4 is that videos of similar ob-     frames from the video with their corresponding frame
jects may contain only a few frames where these objects          numbers. The two highlighted frames correspond to two
are captured from the same angle or in the same situation.       arrows in the graph.

4.2   Filtering out Correlated Frames

Step 2 of similarity computation takes n representative
frames. If we would simply use all frames from each
video, the distribution of the dataset may end up skewed
because some parts of a video may be more static than
others. These static parts would produce many correlated
frames. Therefore, the correlated frames need to be filtered
out from a given video. We first test whether the subjec-
tive visual similarity of images can be captured by cosine
similarity between their low-dimensional representations         Figure 5: Samples from the Organic Objects dataset. Im-
obtained in step 1 of the similarity computation. As can         ages are cropped and have blurred background.
be seen in Figure 4, it captures the visual similarity well
enough.                                                          which would otherwise not be linked based only on the
   To extract n uncorrelated frames from each video,             similarity. The video contains a constraint that says that
we run k-means clustering, where k = n, on the low-              the object cannot change its class in time.
dimensional representations and take the most similar
frame to every centroid of the resulting clusters. This sim-
ple heuristic produces uncorrelated images.                      5   Experiments
   To conclude this section, if the low-dimensional rep-
resentation of individual images obtained by the autoen-         To test the algorithm for object discovery, we assembled
coder reflects the similarity between the captured objects       a custom dataset of organic objects. The dataset con-
and not some other irrelevant factors, we may expect to          tains 18 classes of organic objects, some of which are de-
obtain meaningful clusters. Moreover, note the benefit of        picted in Figure 5. We have chosen organic objects be-
creating the similarity graph of videos instead of individ-      cause they naturally produce large variability between in-
ual images. All images in one video are automatically            stances. For every class, we capture ten different samples
linked together. If a few images in 2 videos are similar, this   from different viewpoints. The final dataset can be down-
similarity is propagated to other frames within the video,       loaded at the following address – github.com/Jan21/
Organic-objects-dataset.                                                     three different models trained by three different objectives.
    Using our ensemble described in Section 3, we segment                    To demonstrate the effectiveness of the approach, we have
the object in every frame of each video. Using the re-                       created a custom dataset of organic objects. The dataset
sulting bounding boxes and segmentation masks, we crop                       was used in our pipeline to remove background, create
each image and blur the background to suppress the dis-                      low-dimensional representations, and perform class dis-
tinctive features present in the background.                                 covery and classification. We have shown that background
    To obtain the low-dimensional representations used to                    removal significantly increases the accuracy of the classi-
filter out correlated frames and construct the similarity                    fication and the number of correctly discovered classes.
graph, we resize all images to a fixed resolution of 64 × 64                    In future work, we plan to optimize our approach for
pixels and train a convolutional autoencoder. The autoen-                    speed, generalize it to work with videos containing multi-
coder has five convolutional layers (16, 32, 64, 128, 256                    ple objects, and make the class discovery an online process
filters with stride 2) and one fully-connected layer7 with                   that can discover new classes on-the-fly.
dimensions 1024 → 96. 8


                                                                             References

                                                                              [1] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal,
                                                                                  P. Bojanowski, and A. Joulin, “Emerging properties
                                                                                  in self-supervised vision transformers,” arXiv preprint
                                                                                  arXiv:2104.14294, 2021.
                                                                              [2] X. Lu, W. Wang, C. Ma, J. Shen, L. Shao, and F. Porikli,
                                                                                  “See more, know more: Unsupervised video object seg-
                                                                                  mentation with co-attention siamese networks,” in Pro-
                                                                                  ceedings of the IEEE/CVF Conference on Computer Vision
                                                                                  and Pattern Recognition, pp. 3623–3632, 2019.
Figure 6: Visualization of detected communities in the Or-                    [3] Q. Xie, E. Hovy, M.-T. Luong, and Q. V. Le, “Self-training
ganic Objects dataset with the Louvain method. Nodes are                          with noisy student improves imagenet classification,” arXiv
colored according to the component (community) they are                           preprint arXiv:1911.04252, 2019.
assigned to. The method discovered 18 components which                        [4] H. Bagherinezhad, M. Horton, M. Rastegari, and
belong to 18 different classes.                                                   A. Farhadi, “Label refinery:          Improving imagenet
                                                                                  classification through label progression,”           ArXiv,
                                                                                  vol. abs/1805.02641, 2018.
                                                                              [5] P. Bhattacharjee and S. Das, “Temporal coherency based
5.1     Results                                                                   criteria for predicting video frames using deep multi-stage
                                                                                  generative adversarial networks,” in Advances in Neural In-
After running community detection on the similarity                               formation Processing Systems, pp. 4268–4277, 2017.
graph, we inspected how many image bundles were as-                           [6] I. Davidson and S. Ravi, “Clustering with constraints: Fea-
signed to the wrong component. Out of 173 videos, only 5                          sibility issues and the k-means algorithm,” in Proceedings
in the training set were assigned to the wrong component.                         of the 2005 SIAM international conference on data mining,
The result of community detection on the constructed sim-                         pp. 138–149, SIAM, 2005.
ilarity graph is shown in Figure 6. Numerical results and                     [7] S. Basu, M. Bilenko, A. Banerjee, and R. J. Mooney,
comparison with clustering of non-segmented images are                            “Probabilistic semi-supervised clustering with constraints,”
presented in Table 1. From the accuracy and number of                             Semi-supervised learning, pp. 71–98, 2006.
discovered classes, it is clear that background removal cre-                  [8] R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vi-
ated a significant accuracy difference of 56.1% and a dif-                        sion transformers for dense prediction,” arXiv preprint
ference in the number of correctly discovered classes.                            arXiv:2103.13413, 2021.
                                                                              [9] Z. Teed and J. Deng, “Raft: Recurrent all-pairs field trans-
                                                                                  forms for optical flow,” in European Conference on Com-
6      Conclusion                                                                 puter Vision, pp. 402–419, Springer, 2020.
                                                                             [10] Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. Torr,
In this contribution, we present a method for generic object                      “Fast online object tracking and segmentation: A unify-
detection and segmentation, which uses an ensemble of                             ing approach,” in Proceedings of the IEEE Conference on
                                                                                  Computer Vision and Pattern Recognition, pp. 1328–1338,
      7 Image is downscaled to 2 × 2 times 256 filters → 1024 input vector
                                                                                  2019.
to the fully-connected layer.
     8 We also tried to extract representations by using VGG16 which was     [11] P. Yakubovskiy, “Segmentation models pytorch.”
pre-trained on ImageNet. These representations better discriminated very          https://github.com/qubvel/segmentation_
similar objects (e.g., two types of red flowers).                                 models.pytorch, 2020.
                          Pre-segmenting the objects       Number of discovered classes   Accuracy
                          Yes                              18                             97.1%
                          No                               15                             41.0%

Table 1: Comparison of the clustering of videos with and without the background removed. The accuracy is computed by
checking how many times a video was assigned to incorrect cluster.


[12] J. Canny, “A computational approach to edge detection,”
     IEEE Transactions on pattern analysis and machine intel-
     ligence, no. 6, pp. 679–698, 1986.
[13] N. Hansen, “The cma evolution strategy: a comparing re-
     view,” Towards a new evolutionary computation, pp. 75–
     102, 2006.
[14] J. Hula, “Unsupervised object-aware learning from videos,”
     in 2020 IEEE Third International Conference on Data
     Stream Mining & Processing (DSMP), pp. 237–242, IEEE,
     2020.
[15] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefeb-
     vre, “Fast unfolding of communities in large networks,”
     Journal of statistical mechanics: theory and experiment,
     vol. 2008, no. 10, p. P10008, 2008.

</pre>