=Paper= {{Paper |id=Vol-2203/100 |storemode=property |title=Semisupervised Segmentation of UHD Video |pdfUrl=https://ceur-ws.org/Vol-2203/100.pdf |volume=Vol-2203 |authors=Oliver Kerul-Kmec,Petr Pulc,Martin Holena |dblpUrl=https://dblp.org/rec/conf/itat/Kerul-KmecPH18 }} ==Semisupervised Segmentation of UHD Video== https://ceur-ws.org/Vol-2203/100.pdf
S. Krajči (ed.): ITAT 2018 Proceedings, pp. 100–107
CEUR Workshop Proceedings Vol. 2203, ISSN 1613-0073, c 2018 Oliver Kerul’-Kmec, Petr Pulc, and Martin Holeňa



                                    Semisupervised Segmentation of UHD Video

                                             Oliver Kerul’-Kmec1 , Petr Pulc1,2 , Martin Holeňa2
                      1  Faculty of Information Technology, Czech Technical University, Thákurova 7, Prague, Czech Republic
                 2   Institute of Computer Science, Czech Academy of Sciences, Pod vodárenskou věží 2, Prague, Czech Republic

      Abstract: One of the key preprocessing tasks in informa-             our research, a combination of a corner detection method
      tion retrieveal from video is the segmentation of the scene,         FAST (Features from Accelerated Segment Test) with a
      primarily its segmentation into foreground objects and the           visual descriptor method BRIEF (Binary Robust Indepen-
      background. This is actually a classification task, but with         dent Elementary Features) is used to this end, known as
      the specific property that it is very time consuming and             ORB (oriented FAST and rotated BRIEF) [7]. Points of
      costly to obtain human-labelled training data for classifier         interest detected in a frame are always attempted to match
      training. That suggests to use semisupervised classifiers to         those detected in the next frame. Such matching points are
      this end. The presented work in progress reports the inves-          searched in a two-step fashion:
      tigation of semisupervised classification methods based on           (i) Only the points of interest in the spacial neighbour-
      cluster regularization and on fuzzy c-means in connection                  hood of the expected position are considered. That
      with the foreground / background segmentation task. To                     position is based on last known interest point posi-
      classify as many video frames as possible using only a                     tion and its past motion (if available).
      single human-based frame, the semisupervised classifica-             (ii) Among the points of interest resulting from (i), as
      tion is combined with a frequently used keypoint detec-                    well as among all detected in the current frame for
      tor based on a combination of a corner detection method                    which no information about their past motion is avail-
      with a visual descriptor method. The paper experimentally                  able, points in the previous frame are searched based
      compares both methods, and for the first of them, also clas-               on the Hamming distance between the descriptors of
      sifiers with different delays between the human-labelled                   both points.
      video frame and classifier training.                                 Whereas the dependence of matching success on the dif-
                                                                           ference between positions of the points and on the move-
                                                                           ment of the first point has a straightforward geometric
      1     Introduction                                                   meaning, its dependence on the Hamming distance be-
      For the indexing of multimedial content, it is beneficial to         tween their descriptors has a probabilistic character. In
      have annotations of actors, objects or any other informa-            [7], this dependence was investigated and was found that
      tion that can occur in a video. A vital preprocessing task to        if the Hamming distance between 256-bit binary descrip-
      prepare such annotations is the segmentation of the scene            tors of the points is greater than 64, then the probability of
      into foreground objects and the background.                          successful match is less than 5%.
         Traditional methods, such as Gaussian mixture model-                 If two points of interests in subsequent frames are con-
      ing, work on the pixel level and are time consuming on               sidered matching, the point in the later frame is added to
      higher resolution video [1]. Another simple method mod-              the history vector of the point in the previous frame. In
      els the background through image averaging, however it               this way, we get the motion description of each point of
      requires a static camera [6]. Our approach, on the other             interest.
      hand, is based on the level of detected interest points, and
      uses semi-supervised classification to assign those points           3     Semi-supervised Classification
      as belonging either to the foreground objects or to the
      background.                                                          Traditional supervised classification techniques use only
         In the next section, we introduce the key points detector         labelled instances in the learning phase. In situations
      we employed for the detection of points of interest. Sec-            where the number of availabe labelled instances is insuffi-
      tion 3 recalls two methods of semi-supervised classifica-            cient, labelling is expensive and time consuming, semi-
      tion we used in our approach. The approach itself is out-            supervised classification can be employed, which uses
      lined in Section 4. Finally, Section 5 presents the results          both labelled and unlabelled instances for learning.
      of its experimental validation performed so far.                        In the reported research, we used the following two
                                                                           methods for semisupervised classification.

      2    Scene Segmentation in the Context of
                                                                           3.1   Semisupervised Classification with Cluster
           Video Preprocessing                                                   Regularization
      In each frame of the video, a keypoint detector is used to           The principle of this method, in detail described in [8],
      detect points of interest and compute their descriptors. In          consists in clustering all labelled and unlabelled instances
Semisupervised Segmentation of UHD Video                                                                                                                   101

     and estimating, for the instance xk , k = 1, . . . , N, its proba-             classifier outputs form probability distributions on
     bility distribution qk on the set of clusters. In addition, the                classes, then
     following penalty function is proposed for the differences
     between the pairs (qk , qn ) of probability distributions of                      DKL ((ŷn,1 , . . . , ŷn,M )kF(xn )) =
     the instances.                                                                               M                           
                                                                                                                    (F(xn )) j
                          π                                                                = ∑ ŷn, j ln                       , n = 1, . . . , N. (6)
        P(qk , qn ) = sin    (r(qk , qn ) ∗ s(qk , qn ))κ ,                                      j=1                  ŷn, j
                           2
                                          k, n = 1, . . . , N, k 6= n, (1)          Therefore, the considered loss function is
     where r(qk , qn ) denotes the Pearson correlation coeffi-
                                                                                      L((F(xk )) j , ŷk, j ) =
     cient between qk and qn , κ is a parameter controlling the                                               
     steepeness of the mapping from similarity to penalty, and                                     (F(xn )) j
                                                                                     = ŷn, j ln                 , n = 1, . . . , N, j = 1, . . . , M.
     s(qk , qn ) is a normalized similarity of the probability dis-                                  ŷn, j
     tributions qk and qn , defined                                                                                                                  (7)
                                             kqk − qn k − dmin
                     s(qk , qn ) = 1 −                                  (2)     2. As a classifier, a multilayer perceptron with one hid-
                                               dmax − dmin
                                                                                   den layer is used, such that the activation function g in
     using the notation                                                            its hidden layer is smooth and includes no bias, and
                                                                                   its output layer performs the softmax normalization
        dmin = min Q, dmax = max Q,                                                of the hidden layer. Hence,
              with Q = {kqk − qn k|k, n = 1, . . . , N, k 6= n}. (3)
                                                                                                                   exp(g(w>j· x))
                                                                                                   (F(x)) j =     M
                                                                                                                                     .              (8)
        The results of clustering allow to assign pseudolabels                                                   ∑i=1 exp(g(w> i· x)
     to unlabelled instances. In particular, the pseudolabel as-
     signed for the j-th among the M considered clusters to an                   The weight vectors w1· , . . . , wM· in (8) are learned
     unlabelled instance xn in a cluster Ψ is                                 through the minimization of the error function (5).
                                                      
                         exp ∑xk ∈Ψ is labelled yk, j
               ŷn, j = M                               ,   (4)
                       ∑i=1 exp ∑xk ∈Ψ is labelled yk,i                       3.2    Semi-supervised Kernel-Based Fuzzy C-means

     where yk,i , i = 1, . . . , M is a crisp or fuzzy label of the la-       This method, in detail described in [9], originated from
     belled instance xk for the class i. For uniformity of nota-              the fuzzy c-means clustering algorithm [2]. Similarly to
     tion, the symbol ŷk, j , j = 1, . . . , M can also be used for yk, j    the original fuzzy c-means, the method is parametrized by
     if xk is labelled.                                                       a parameter m > 1. What makes this method more gen-
        The penalty function (1) can be used as a regulariza-                 eral than the original fuzzy c-means, is its dependence
     tion modifier in some loss function L : [0, 1]2 → [0, +∞)                on the choice of some kernel K, i.e., a symmetric func-
     measuring the discrepancy between the classifier outputs                 tion on pairs (x, y) of clustered vectors, which has positive
     F(xn ) = ((F(xn ))1 , . . . , (F(xn ))M ) for an instance xn , and       semidefinite Gramm matrices (e.g., Gaussian or polyno-
     the corresponding labels (yn,1 , . . . , yn,M ) or pseudolabels
                                                                              mial kernels). In fact, the fuzzy c-means algorithm corre-
     (ŷn,1 , . . . , ŷn,M ):
                                                                              sponds to the choice K(x, y) = x> y.
                                                                                 First, the membership matrix U l is constructed, for clus-
             1 M
        E=     ∑
             N j=1       ∑          L((F(xn )) j , yn, j )+                   tering nl labelled instances x1l , . . . , xnl l into as many clusters
                      xn labelled
                                                                   !          as there are classes, i.e., M. For j = 1, . . . , M, k = 1, . . . , nk ,
                   λ max(qn )                                                         (
           ∑                     ∑ P(qk , qn )L((F(xk )) j , ŷk, j ,   (5)
                                                                                         1 if the instance xkl is labelled with the class j
      xn unlabelled |φ (xn )| x ∈φ (x )
                                    k    n
                                                                                 l
                                                                              U j,k =
                                                                                         0 else.
     where λ > 0 is a given parameter determining the tradeoff                                                                                    (9)
     between supervised loss and unsupervised regularization,
     and the set of instances xk 6= xn with the highest value of              From U l , the initial cluster centers are constructed as
     P(qk , qn ) is denoted φ (xn ).
        In [8], the following design decisions have been made                                          n
                                                                                                      ∑k=1
                                                                                                        l    l xl
                                                                                                           U j,k k
     for the loss function and the classifier in (5):                                         v0j =        n  l
                                                                                                                     , j = 1, . . . , M.          (10)
                                                                                                       ∑k=1
                                                                                                         l
                                                                                                            U j,k
       1. The employed loss function can be derived from
          DKL ( (ŷn,1 , . . . , ŷn,M )kF(xn ) ), the Kullback-Leibler         If for some t = 0, 1, . . . , the cluster centers vt1 , . . . , vtM are
          divergence, from classifier outputs to labels or pseu-              available, such as (10), then they are used together with
          dolabels. If both the labels or pseudolabels and the                the chosen kernel K to construct the membership matrix
102                                                                                                    Oliver Kerul’-Kmec, Petr Pulc, and Martin Holeňa

      U u,t for clustering nu unlabelled instances x1u , . . . , xnuu , as      4.2    Implementation of Object Segmentation
      follows:
                                                                                The Cartesian coordinates ([p]1 , [p]2 ) of a point p of in-
                                          1                                     terest are expressed with respect to top left corner of the
            u,t     (1 − K(xku , v j ))− m−1
          U j,k =                            1 ,
                                                                                frame, using as units the frame height and width. Due to
                  ∑M             u
                   i=1 (1 − K(xk , vi ))
                                         − m−1
                                                                                that, [p]1 and [p]2 are normalized to [0, 1].
                                   j = 1, . . . , M, k = 1, . . . , nu . (11)      For a match between points of interest pk and pk+1 in
                                                                                subsequent frames k and k + 1, the following criteria have
      Finally, the cluster centers are updated, for t = 0, 1, .. by             been used:
      calculating                                                               (i) The point pk+1 must lie within the radius rkp from the
                                                                                      estimated new position of the point p̂k
          vt+1 =
            j                                                                                            kpk+1 − p̂k k < rkp .              (13)
             nl      l )m K(xl , vt )xl + nl (U u,t )m K(xu , vt )xu
            ∑k=1  (U j,k     k j k        ∑k=1 j,k         k j k
          =     nl                         nl     u,t m               .               Here, the estimated position p̂k is calculated as
             ∑k=1 (U j,k ) K(xk , v j ) + ∑k=1 (U j,k ) K(xku , vtj )
                         l m     l   t
                                                                                              (
                                                                       (12)                     pk + c1 (pk − pk−1 ) if pk−1 is available,
                                                                                       p̂k =
         The computations (11)–(12) are iterated until at least                                 pk                    else,
      one of the following termination criteria is reached:                                                                             (14)
      (i) kU u,t − U u,t−1 k < ε,t ≥ 1, for a given matrix norm
           k · k and a given ε > 0;                                                   where c1 > 0, and the radius rkp is calculated as
      (ii) a given maximal number of iterations tmax .
                                                                                                           rkp = (ukpW )2 ,                 (15)

                                                                                      where ukp quantifies the uncertainty pertaining to the
      4      Proposed Approach                                                        point pk in the k-th frame and W denotes the frame
                                                                                      width (in the units in which point positions are ex-
      4.1     Overall Strategy                                                        pressed). The uncertainty u p is set to u1p = c2 > 0
      Our methodology for the segmentation of video frames                            in the first frame and is then evolved from frame
      into foreground objects and background relies on the as-                        to frame through linear scaling above a lower limit
      sumption that the user typically assigns corresponding la-                      c3 > 0:
      bels to points of interest only in the first frame, and even                               (
                                                                                           p        max(c3 , c4 ukp ) if pk is matched,
      not necessarily to all detected points of interest.                                uk+1 =
         No matter whether the considered method of semisuper-                                      c5 ukp            if pk is not matched,
      vised classification is semisupervised classification with                                                                          (16)
      cluster regularization or semi-supervised kernel-based
                                                                                      where 0 < c4 < 1, c5 > 1.
      fuzzy c-means, the methodology always proceeds in the                                                                     p
                                                                                      Moreover, if the evolution (16) leads to uk+1 > c6 for
      following steps:
                                                                                      some c6 > c3 , then the point p is deactivated and not
          1. In the first frame, the user labels some of the points                   any more considered for matching.
             of interest detected by the ORB detector.                          (ii) Hamming distance between the 256-bit binary de-
                                                                                      sciptors of the points is at most 64.
          2. Using the considered method of semisupervised clas-                   The choice of the real-valued constants in the criterion
             sification, the remaining detected points of interest              (i) has been based on the resolution of the video (4K), on
             are labelled.                                                      the frame rate (25) and on the defaults in the ORB imple-
                                                                                mentation based on [7]. They have been set to the follow-
          3. Matching points detected in the next frame are as-                 ing values: c1 = 0.6, c2 = 0.02, c3 = 0.009, c4 = 0.9, c5 =
             signed the same labels as the points to which they are             1.1, c6 = 0.03.
             matched.                                                              In each frame, the described implementation was used
          4. Using the considered method of semisupervised clas-                to find 500 most interesting points. On a linux computer
             sification, the remaining points of interest detected in           with a 3.3 GHz Intel Xeon E3-1230 processor, this took
             the next frame are labelled.                                       95.32 ms.

          5. Steps 3 and 4 are repeated till either the points of in-           4.3    Implementation of Semi-supervised Classifiers
             terest in all frames have been classified or the scene
             has been so much disrupted between two frames that                 As input features for both classification methods, the
             no points of interest could be matched between them                Cartesian coordinates ([pk ]1 , [pk ]2 ) of the point in the k-th
             (in such a case, new labelling by the user is needed).             frame and and the polar coordinates ([pk − pk−1 ]|| , [pk+1 −
Semisupervised Segmentation of UHD Video                                                                                                    103

     pk ]ϕ ) of its movement with respect to the previous frame             • a handheld camera, only the foreground object is
     are used.                                                                sharp (2 videos),
        In the implementation of the semisupervised classifica-
     tion with cluster regularization method described in 3.1,              • a static camera, only the background is sharp (2
     we used k-means clustering for an initial clustering of all              videos),
     instances. Although this method allows choosing the num-               • a static camera, only the background is sharp, the
     ber of clusters independently of the number of classes,                  foreground object is close to the camera,
     we have set it to the same value for comparability with
     semi-supervised kernel-based fuzzy c-means, i.e., to the               • a static camera, only the foreground object is sharp, a
     value 2 corresponding to the classes of foreground objects               hand is interfering with the background (2 videos),
     and background. Hence, we performed k-means cluster-
     ing with k = 2. Since the k-means algorithm does not                   • a static camera, only the foreground object is sharp,
     output a probability distribution on the set of clusters, we             it is moving towards the camera,
     employed a simple procedure proposed in [8] to transform
                                                                            • a static camera, only the foreground object is sharp,
     the original distances from an instance xn to cluster centers
                                                                              it is moving away from the camera,
     v1 , . . . , vk , to a probability distribution qn , which assures
     that xn more likely belongs to clusters to which centers it            • static camera, only the foreground object is sharp, it
     is closer:                                                               passes the scene multiple times (2 videos).
                                                         
                                    1−       kxn −vi k                    For the testing, labels were available for all points of inter-
                                          ∑kj=1 kxn −vi k
                           (qn )i =                         .      (17)   est. Unfortunately, those labels were often unreliable.
                                           k−1
     Consequently, for our case k = 2:                                    5.2   Results and Their Analysis
                                   kxn − v2 k                             On all the employed videos, we measured the quality of
                    (qn )1 =                         ,            (18)
                             kxn − v1 k + kxn − v2 k                      classification by means of accuracy, sensitivity, specificity
                                   kxn − v1 k                             and F-measure of both implemented classification meth-
                    (qn )2 =                         .            (19)
                             kxn − v1 k + kxn − v2 k                      ods.
                                                                             For the fuzzy c-means method, the accuracy and speci-
     The remaining parameters pertaining to semisupervised                ficity on the unlabelled data are illustrated for four partic-
     classification with cluster regularization were set as pro-          ular videos in Figure 1.
     posed in [8]: λ = 0.2, κ = 2, |φ (xn )| = 10.                           For the cluster-regularization method, we compared the
        For the semi-supervised kernel-based fuzzy c-means                values of the considered four quality meaures obtained
     algorithm described in 3.2, we used a Gaussian kernel                with five classifiers trained in each of the five first video
     function for updating the membership matrix K(x, y) =                frames with respect to the delay between classifier training
     exp(−kx − yk2 /σ 2 ), where the parameter σ is computed              and measuring its quality. The results of their comparison
     as proposed in [9]:                                                  are for three particular delays, 1 frame, 5 frames and 10
                             s                                            frames, summarized in Table 1. In addition, for delays up
                           1 ∑Nn=1 kxn − vk2                              to 50 frames, they are again illustrated for accuracy and
                     σ=                          ,          (20)
                           M           N                                  sensitivity on the four videos used already in connection
                                                                          with the fuzzy c-means classifier, in Figures 2–5.
     where v is the center of all instances. The remaining pa-
                                                                             The figures (2)–(5) indicate that classifiers trained in a
     rameters were set as follows: m = 2, ε = 0.001,tmax = 50.
                                                                          later frame tend to have higher accuracy and specificity,
                                                                          but in general, the differences between classifiers trained
     5     Experimental Validation                                        in different frames are small. This is confirmed by the
                                                                          Friedman test for delays 1, 5 and 10 frames between clas-
     5.1    Employed Data                                                 sifier training and measuring its quality and for all four
                                                                          considered quality measures. The hypothesis of equality
     For the validation of the proposed approach we prepared
                                                                          of all five classifiers is rejected (p-value < 5%) only for
     12 short videos. In all videos, there is a yellow or blue bal-
                                                                          the delay 1 frame and the F-measure, and weakly rejected
     loon as a foreground object and a green background. On
                                                                          (p-value < 10%) for the delay 1 frame and the sensitivity,
     the background, there are a few small red sticky notes to
                                                                          as well as for the delay 5 frames and the F-measure. A
     help detecting some key points. The videos were recorded
                                                                          posthoc test expectedly reveals that the equality of all five
     in a UHD resolution.
                                                                          classifiers was rejected mainly due to differences between
        Here is a brief characterization of all employed videos:
                                                                          classifiers trained in the early and in later frames; in par-
         • a handheld camera, both the foreground object and              ticular between those trained in the 1st and 4th frame (de-
           the background are sharp,                                      lay 1, both sensitivity and F-measure), classifiers trained
                                                                                                                                                                                      104




Table 1: Comparison of the values of the considered quality measures obtained with classifiers trained in each of the 5 first video frames for different delays between
classifier training and testing, obtained on data from the 12 employed videos. The result in a cell of the table indicates on how many videos the considered measure of
classifier quality (accuracy, sensitivity, specificity, F-measure) was higher for the row classifier : on how many videos it was higher for the column classifier. A result in
italic, respectively bold italic, indicates that after the Friedman test at least weakly rejected (p-value < 10%) the hypothesis that the considered quality measure is equal for
all classifiers (cf. Table 2), the post-hoc test according to [3, 4] weakly rejects, respectively rejects (p-value < 5%) the hypothesis that it is equal for the particular row and
column classifiers. All simultanously tested hypotheses were corrected in accordance with Holm [5]

                                            Delay between the frame on which the classifier is trained and the frame when it is tested
                                    1 frame                                        5 frames                                            10 frames
        #1                                                  Frame in which the compared classifier was trained # 2
                 1          2          3         4        5         1         2         3           4         5        1          2        3        4                 5
                                                                                  Accuracy
         1                 5:7        5:7       7:5      6:6                 4:8       7:5         7:5      10:2                 4:8      4:8      4:8              2:10
         2      7:5                   4:8       8:4      6:6       8:4                 6:6         7:5      10:2      8:4                 7:5      5:7               5:7
         3      7:5        8:4                  9:3      6:6       5:7       6:6                   7:5      11:1      8:4        5:7               6:6               5:7
         4      5:7        4:8        3:9                5:7       5:7       5:7       5:7                  10:2      8:4        7:5      6:6                        5:7
         5      6:6        6:6        6:6       7:5               2:10      2:10      1:11        2:10               10:2        7:5      7:5      7:5
                                                                                 Sensitivity
         1                 8:4      8.5:3.5  9.5:2.5 8.5:3.5                 8:4       7:5         7:5       9:3              6.5:5.5     8:4    8.5:3.5           8.5:3.5
         2      4:8                   8:4      10:2      9:3       4:8                 6:6       7.5:4.5 9.5:2.5 5.5:6.5                  6:6      8:4               8:4
         3    3.5:8.5      4:8               10.5:1.5 9.5:2.5      5:7       6:6                 7.5:4.5 8.5:3.5      4:8        6:6             8.5:3.5           9.5:2.5
         4    2.5:9.5     2:10     1.5:10.5            6.5:5.5     5:7    4.5:7.5 4.5:7.5                    8:4    3.5:8.5      4:8    3.5:8.5                    8.5:3.5
         5    3.5:8.5      3:9      2.5:9.5  5.5:6.5               3:9    2.5:9.5 3.5:8.5          4:8              3.5:8.5      4:8    2.5:9.5 3.5:8.5
                                                                                 Specificity
         1               7.5:4.5    6.5:5.5     7:5      7:5              3.5:8.5      5:7         5:7       4:8              6.5:5.5 3.5:8.5     2:10             3.5:8.5
         2    4.5:7.5               4.5:7.5     6:6    6.5:5.5 8.5:3.5                 5:7       4.5:7.5     4:8    5.5:6.5               4:8      4:8               4:8
         3    5.5:6.5    7.5:4.5                6:6      7:5       7:5       7:5                   7:5    6.5:5.5 8.5:3.5        8:4             4.5:7.5             6:6
         4      5:7        6:6        6:6              4.5:7.5     7:5    7.5:4.5      5:7                   4:8     10:2        8:4    7.5:4.5                    6.5:5.5
         5      5:7      5.5:6.5      5:7     7.5:4.5              8:4       8:4    5.5:6.5        8:4              8.5:3.5      8:4      6:6    5.5:6.5
                                                                                 F-measure
         1                 8:4        9:3      10:2      8:4                 6:6       7:5         8:4      11:1              5.5:6.5     9:3    8.5:3.5           9.5:2.5
         2       4:8                  7:5      12:0      9:3       6:6              6.5:5.5        7:5      10:2    6.5:5.5             6.5:5.5 7.5:4.5            9.5:2.5
         3       3:9       5:7                 11:1      8:4       5:7    5.5:6.5                  8:4      11:1      3:9     5.5:6.5              8:4               9:3
         4      2:10      0:12       1:11                6:6       4:8       5:7       4:8                8.5:3.5 3.5:8.5 4.5:7.5         4:8                        9:3
         5       4:8       3:9        4:8       6:6               1:11      2:10      1:11       3.5:8.5            2.5:9.5 2.5:9.5       3:9      3:9
                                                                                                                                                                                      Oliver Kerul’-Kmec, Petr Pulc, and Martin Holeňa
Semisupervised Segmentation of UHD Video                                                                                                  105


     Table 2: Results of the Friedman test of the hypothesis
     that for a given delay between classifier training and mea-
     suring its quality, a given quality measure is equal for the
     classifiers trained in each of the 5 first video frames, for
     the 12 combinations of delays and quality measures con-
     sidered in Table 1. The combinations for which the tested
     hypotheseis was weakly rejected (p-value < 10%) are in
     italic, the single combination for which it was rejected (p-
     value < 5%) is in bold italic. All simultanously tested hy-
     potheses were corrected in accordance with Holm [5]

                Quality measure     Delay     p-Value
                   accuracy           1          1
                   accuracy           5        0.117
                   accuracy          10          1
                  sensitivity         1        0.052
                  sensitivity         5        0.428
                  sensitivity        10        0.238
                  specificity         1          1
                  specificity         5          1
                  specificity        10         0.25
                  F-measure           1        0.043
                  F-measure           5        0.089
                  F-measure          10        0.238



     in the 1st and 4th frame (delay 1, F-measure) and classi-
     fiers trained in the 1-3 frame and in the 5th frame (delay 5,
     F-measure).


     6    Conclusion

     The presented research integrates two comparatively re-
     cent approaches, the keypoint detector ORB, which is a
     combination of a corner detection method with a visual          Figure 1: The evolution of accuracy (top) and specificity
     descriptor method, and two semi-supervised classifiction        (bottom) of the c-means method on the unlabelled data for
     methods. To our knowledge, this is the first time these ap-     four particular videos
     proaches are used together for the task of scene segmenta-
     tion into the foreground objects and the background.
                                                                     References
        On the other hand, this is a work in progress and the pre-
     sented results are still rather preliminary, being obtained
     on 12 artificially created videos with a quite simple scene     [1] M.S. Allili, N. Bouguila, and D. Ziou. Finite general Gaus-
     segmentation. Both approaches should be investigated in             sian mixture modeling and application to image and video
     the context of more complex segmentations and more re-              foreground segmentation. Journal of Electronic Imaging,
     alistic scenes. To this end, however, especially the ORB            17:paper 013005, 2008.
     detector needs to be more deeply elaborated with methods        [2] J.C. Bezdek. Pattern Recognition with Fuzzy Objective
     of semisupervised classification.                                   Function Algorithms. Plenum Press, New York, 1981.
                                                                     [3] J. Demšar. Statistical comparisons of classifiers over multi-
                                                                         ple data sets. Journal of Machine Learning Research, 7:1–
                                                                         30, 2006.
      Acknowledgement
                                                                     [4] S. Garcia and F. Herrera. An extension on “Statistical Com-
                                                                         parisons of Classifiers over Multiple Data Sets” for all pair-
     The research reported in this paper has been supported by           wise comparisons. Journal of Machine Learning Research,
     the Czech Science Foundation (GAČR) grant 18-18080S.               9:2677–2694, 2008.
106                                                                                         Oliver Kerul’-Kmec, Petr Pulc, and Martin Holeňa




      Figure 2: The evolution of accuracy (top) and specificity         Figure 3: The evolution of accuracy (top) and specificity
      (bottom) of the classifiers trained in each of the 5 first        (bottom) of the classifiers trained in each of the 5 first
      video frames for a handheld-camera video with both the            video frames for a handheld-camera video with only the
      foreground object and the background sharp                        foreground object sharp


      [5] S. Holm. A simple sequentially rejective multiple test pro-       2012.
          cedure. Scandinavian Journal of Statistics, 6:65–70, 1979.    [9] D. Zhang, K. Tan, and S. Chen. Semi-supervised kernel-
      [6] L. Li, W. Huang, I.Y.H. Gu, and Q. Tan. Foreground object         based fuzzy c-means. In ICONIP’04, pages 1229–1234.
          detection from videos containing complex background. In           Springer, 2004.
          11th ACM Conference on Multimedia, pages 2–10, 2003.
      [7] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. ORB:
          An efficient alternative to SIFT or SURF. In International
          Conference on Computer Vision, pages 2564–2571, 2011.
      [8] R.G.F. Soares, H. Chen, and X. Yao. Semisupervised clas-
          sification with cluster regularization. IEEE Transactions
          on Neural Networks and Learning Systems, 23:1779–1792,
Semisupervised Segmentation of UHD Video                                                                                       107




     Figure 4: The evolution of accuracy (top) and specificity    Figure 5: The evolution of accuracy (top) and specificity
     (bottom) of the classifiers trained in each of the 5 first   (bottom) of the classifiers trained in each of the 5 first
     video frames for a static-camera video, in which only the    video frames for a static-camera video, in which only the
     foreground object is sharp and is moving towards the cam-    foreground object is sharp and passes the scene multiple
     era                                                          time