=Paper= {{Paper |id=Vol-2083/paper-07 |storemode=property |title=Unsupervised Video Semantic Partitioning Using IBM Watson and Topic Modelling |pdfUrl=https://ceur-ws.org/Vol-2083/paper-07.pdf |volume=Vol-2083 |authors=Alexandre Quemy,Krzysztof Jamrog,Marcin Janiszewski |dblpUrl=https://dblp.org/rec/conf/edbt/QuemyJJ18 }} ==Unsupervised Video Semantic Partitioning Using IBM Watson and Topic Modelling== https://ceur-ws.org/Vol-2083/paper-07.pdf
Unsupervised Video Semantic Partitioning Using IBM Watson
                  And Topic Modelling
              Alexandre Quemy                                         Krzysztof Jamrog                                       Marcin Janiszewski
              IBM Software Lab                                       IBM Software Lab                                       IBM Software Lab
               Cracow, Poland                                         Cracow, Poland                                          Cracow, Poland
        Faculty of Computing, Poznań                           krzysztof.jamrog1@pl.ibm.com                           marcin.janiszewski@pl.ibm.com
          University of Technology
               Poznań, Poland
            aquemy@pl.ibm.com

ABSTRACT                                                                                 The information extraction plays a preponderant role in the al-
In this paper, we present the problem of Video Semantic Parti-                           gorithms performances because even the best possible algorithm
tioning that consists in breaking a video into semantically inde-                        would return poor results on badly extracted information which
pendent blocks. Defined within a framework of optimization, we                           is equivalent to noise.
present a preliminary heuristic approach to solve the problem,                              The plan of this paper is as follows: Section 2 formalizes the
called Split-and-Merge. The algorithm itself is unsupervised, but                        Video Semantic Partitioning problem, Section 3 presents the fea-
the mechanisms to extract data from videos are supervised (for                           ture extraction mechanisms while in Section 4 the algorithm
some) since they used IBM Watson Services. Finally, we demon-                            Split-and-Merge is presented. In Section 5 we discuss an opti-
strate on few videos the capabilities of our prototype and discuss                       mized version of the algorithm under a restrictive assumption
the limitations and future improvements. From the experiments,                           about the videos. In Section 6, we present some results obtained
we draw two conclusions: (i) the optimal solution to the problem                         on real videos and conclude this paper in Section 7.
varies from human to human with a large variability from video
to video, (ii) Split-and-Merge demonstrates encouraging quali-                           2     VIDEO SEMANTIC PARTITIONING
tative results to find the average optimal solution defined as the                       A video is a temporal signal on a time interval [0,T ]. We denote by
average solution given by humans.                                                        P ([0,T ], m) the set of partitions of [0,T ] into m elements, i.e. the
                                                                                         elements of P ([0,T ], m) are of the form π = (p0 , ..., pm , pm+1 ),
                                                                                         ∀i ∈ {0, ..., m}, pi < pi+1 , with p0 = 0 and pm+1 = T , such that
1    INTRODUCTION                                                                        Sm
                                                                                            i=0 [pi , pi+1 ] = [0,T ].
The problem of text segmentation, that is to say partitioning a                              Our goal is to find a partition such that each [pi−1 , pi ] is an
text into semantic blocks, has been widely studied (e.g. in [2, 5])                      independent semantic block, i.e. holds a different semantic than
but, as far as we know, never extended to videos. Indeed, video                          its adjacent blocks. Let π ∗ denotes the optimal partition. The
segmentation usually refers to the process of extracting objects                         first formulation of the Semantic Partition problem is to find a
from a video and not breaking it into semantic chunks. In this                           minimizing sequence πk such that πk → π ∗ with a convergence
                                                                                                                                           k
paper, we define the Video Semantic Partitioning problem as                              notion that implies that the number of cutting points of πk con-
an extension of the text segmentation problem and present our                            verges to those of π ∗1 and that each cutting point of π ∗ is the
primary attempt to solve it. Among the possible applications,                            limit of a cutting point of πk .
we can highlight a better understanding of the video content for                            As an element of any partition π is contained in a closed
tagging and indexing in a database or the creation of educational                        interval defined by two cutting points of π ∗ we can define the
paths by joining semantic blocks from different videos to create                         following cost function ∀π ∈
                                                                                                                          S
                                                                                                                            P ([0,T ], m), ∀pi ∈ π , c (pi ) =
a new one.                                                                                                                     m ∈N∗
   A video is a complex and multidimensional signal holding a                            |p ∗j − pi ||pi − p ∗j+1 | with pi ∈ [p ∗j , p ∗j+1 ]. Additionally, we define
lot of information in a very short amount of time. In addition,                                         m−1
                                                                                                         P
                                                                                         J (π , π ∗ ) =      c (pi ) + c (pi+1 ).
the raw information is totally unstructured and has to be ex-                                           i=1
tracted, contrary to text segmentation where this information                               Under the assumption that π has as many cutting points as π ∗ ,
is mostly structured (the text is already provided as it is). By                         then if J (p, p ∗ ) = 0 then p = p ∗ and the problem can be seen as
raw, we mean the primary source of semantic. For a given text, it                        finding a minimizing sequence πk s.t. J (πk , π ∗ ) → 0. In practice,
                                                                                                                                                        k
mainly holds in the sentences and words themselves, while the                            as π ∗ is unknown, there are two problems: the optimal number
support brings some secondary elements (e.g. in [2] the support                          of cutting points is not known (i) and J is not available (ii).
is a HTML page and the tags are used to delimit some sections).                              For (i), assuming J is available, if π 0 contains enough cutting
For a video, the primary source is not as well defined: is it the                        points and some in each interval induced by π ∗ , we can minimize
audio track or the visual information? Most likely a combination                         J given π0 by removing the unnecessary cuttings points. Assum-
of both. Additionally, this raw information is not directly suit-                        ing we know how to remove the unnecessary cutting points, if
able for most algorithms and requires a pre-treatment to extract                         π 0,h is defined by the uniform partition s.t. ∀pi , pi+1 ∈ π0 , pi+1 −
exploitable data. Those techniques are known as Cognitive Com-                           pi = h, then ∀h 1 , h 2 s.t. h 1 < h 2 , J (π0,h1 , π ∗ ) < J (π 0,h2 , π ∗ ).
puting, i.e. processing and analyzing large unstructured signals.                        More generally, we can control J by h.
© 2018 Copyright held by the owner/author(s). Published in the Workshop                  1 We may not accept the case such that two cutting points are identical, creating an
Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna,              empty interval, thus with an empty semantic. However, it may be convenient for
Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted        practical algorithms and thus we assume we merge the similar cutting points of πk
under the terms of the Creative Commons license CC-by-nc-nd 4.0.                         at any time if necessary.




                                                                                    44
    For (ii), we need to find an approximation of J . As p ∗ is a                         3     FEATURE STREAMS
partition into semantic blocks we will try to rely on the seman-                          In this section, we review the feature streams we extract from a
tic contained in the video to approximate J . To do so, we are                            video as well as the associated block similarity.
able to identify n features stream from the video, that is to say,                             We are able to extract a frame at any moment of the video. For
n sequences of features on [0,T ]. For each i ∈ {1, ..., n}, we                           two frames, we are able to calculate two metrics: a perceptual
have a similarity metric d i : [0,T ] × [0,T ] → [0, 1] to quantify                       hash d hash using phash open-source library2 and a pixel-based
how related two blocks are w.r.t. a particular feature stream. In                         hamming distance d pixel . For two blocks [p j−1 , p j ] and [p j , p j+1 ],
particular, ∀p1 , p2 , di ([p1 , p2 ], [p1 , p2 ]) = 1 because the seman-                 we made the choice to take into account resp. only the last and
tic of a block is obviously equal to its own semantic. However,                           the first frame contain in the first and second block. With Watson
[p1 , p2 ] ∩ [p3 , p4 ] = ∅ does not imply di ([p1 , p2 ], [p3 , p4 ]) = 0, i.e.          Visual Recognition and for each frame, we are able to obtain a
it is not because two blocks are not overlapping in time that they                        list of tags with confidence on concepts and entities. For a block
do not have semantic similarity. To ease the notations, given a                           [p j , p j+1 ], we aggregate the tags of all frames it contains into a
                                  j
partition π , we denote by di the similarity between the blocks                           the sets Wj and C j respectively for the words and the confidence
                                      j de f
[p j−1 , p j ] and [p j , p j+1 ], i.e. di = di ([p j−1 , p j ], [p j , p j+1 ]).         level. For two blocks, we denote by Wj, j+1 the set defined by
    The problem of the video partitioning is to find a partition π ∈                      Wj ∩ Wj+1 with C j,L j+1 and C j,R j+1 the confidence associated to
S                                                                                         the left and right block.
   k ∈N∗ P ([0,T ], m) that minimizes the adjacent block similarity,
that is to say                                                                                                                      w1
                                                                                          Example: Let us assume Wj = *.w 2 +/ composed of three words,
                                               M X
                                                  m                                                                               ,w 3 -
                                                             j
                 π∗ =         argmin                                           (1)                                                               c 1, j
                                                                                          associated to the confidence vector C j = *.c 2, j +/. Similarly, let us
                              S
                                                            di
                        π∈        P ([0,T ],m) 1≤i ≤n j=1
                           m∈N∗
                                                                                                                                               ,c 3, j -
     L                                                                                                            w1                  c 1, j+1                      !
with     an aggregation operator.                  L                                                             *    +              *         +
                                                                                          assume Wj+1 = .w 3 / with C j+1 = .c 3, j+1 /. Then, Wj, j+1 =
                                                                                                                                                                w1
   For the purpose of this paper, we will consider   as a convex                                                                                                w3
                                                                                                                 ,w 4 -              ,c 4, j+1 -
combination of the block similarity, allowing us to reformulate
                                                                                          and the left and right confidence vectors are given by C j,L j+1 =
the problem as follows:                                                                            !                             !
                                                                                            c 1, j                      c
                                                                                                     and C j,R j+1 = 1, j+1 .
                                                                                            c 3, j                      c 3, j+1
                                                  P P                     
   
   
   
                                                                       j
                                 argmin
        π∗                                            m
                                                            j=1 ωi di
                    =
   
                            π∈
                                  S
                                      P ([0,T ],m) 1≤i ≤n
                                                                                                                                                   j
                                                                                          If |Wj−1 ∩ Wj | = 0, the similarity is zero: d tags = 0. Otherwise,
                                                                              (2)
   
   
   
       P                       m∈N∗
                                                                                          the similarity measure is defined as the Jaccard index ponderated
           ωi     =1
    1≤i ≤n                                                                               by the confidence:
                                                                                                                                   L          R
                                   L                                                                 j      |Wj−1 ∩ Wj |      ||C j−1, j − C j−1, j ||1
                                                                                                                                                         
    The choice of the operator         is an interesting problem by                                d tags =                1−
                                                                                                            |Wj−1 ∪ Wj |            |Wj−1 ∩ Wj |
itself because it is known to greatly influence the algorithms
                                                                                                                                       L          R
capacity to find the optimal solution [7]. In the future, we plan to                                                |Wj−1 ∩ Wj | − ||C j−1, j − C j−1, j ||1
investigate the choice of this operator, and in particular treating                                             =
                                                                                                                                |Wj−1 ∪ Wj |
the problem as multi-objective (optimizing on each feature stream
                                                                                          where |A| denotes the cardinal of A. In other words, two blocks
independently) using the Hypervolume [1] or ε-dominance [4]
                                                                                          are similar not only if they have the same tags but also if their
metric as aggregation function.
                                                                                          confidences are close enough.
    As said before, one major difficulty comes from the fact the
optimal number of elements in the partition is not known a priori.
                                                                                          Example: Considering the vectors defined in the previous ex-
Notice that in practice, if an interval is too small the information it
                                                                                          ample, the similarity between the two blocks is
holds is considered as null. For instance, a very small interval does
not hold more than few words or no word at all, no screenshot,                                       j+1    2 − [(c 1, j − c 1, j+1 ) + (c 3, j − c 3, j+1 )]
                                                                                                   d tags =
etc. As a result, its similarity with another very small interval                                                                  4
would be one or close to one. In other words, dividing too many
times [0,T ] does not minimize the objective function. However,                           In this preliminary work, we did not consider OCR to extract
having two large blocks separated by a very small block would                             the text from the frames or image segmentation techniques to
be a minimizing strategy: the small block holds no semantic,                              obtain the location of objects on a frame. However, using Watson
contrary to the large blocks and as a result, the similarity between                      Speech-to-Text, we extract the transcript from the audio track.
adjacent blocks is (close to) zero. To handle this problem, we see                        The service provides the timestamp to be able to locate the text
several possibilities: taking into account this phenomenon into                           corresponding to a given block. Similarly to Watson Visual Recog-
the definition of the metrics di (i), adding a regularization term                        nition, Watson Natural Language Understanding is able to extract
on the block sizes to (2) to be sure they are homogenous enough                           from a text a list of keywords and concepts with a confidence
(ii), fixing the set of possible values for m and a distribution of                       level (expressing the relevance among the text). This allows us to
                                                                                                   j              j                                  j
block size w.r.t. T (iii). As (i) is too ad-hoc and we do not see                         define d keywords and d concepts in a similar way as for d tags .
any practical justification to support (ii), we will make a stronger                         We normalized the transcript using NLTK [6] which includes
assumption on the optimal partition in Section 5 that will define                         the tokenization, removing the stopwords, the lemmatization,
both the maximal number of blocks and their respective sizes.                             2 http://phash.org/




                                                                                     45
the computation of n-grams from 1 to 4 and filtering the n-                                p1 p2 p3 p4 p5 p6 p7 p8                        ...             pm
grams by a minimal number of occurrences. A given partition
π ∈ P ([0,T ], m) can be seen as a corpus of m documents repre-                        0          p1∗              p2∗                       p3∗               1
sented by Bag-of-Words. Using Gensim [8], we applied a TF-IDF                              p1           p2        p3     p4               ...             pm
transformation on each block and used a Latent Dirichlet Allo-
cation [3] to express each block in a latent topic space. LDA is                       0          p1∗              p2∗                       p3∗               1
a generative probabilistic model that represents documents as a
mixture of topics. Each topic represents a (sparse) distribution
over the dictionary of words used in the documents expressing                 Figure 1: Illustration of Split-and-Merge. On top, the ini-
the idea that a topic is defined by a high frequency of few terms.            tial uniform partition, finer than the optimal partition π ∗ .
The model parameters are then estimated from the real data,                   At the bottom, the partition after few steps: several cut-
in particular the word distribution per topic and the topic dis-              ting points have been removed. It is to be noticed that, as
tribution for each document. For each pair of adjacent blocks,                p1 ∗ or p2 ∗ are not on the initial partition, it is impossible
we use the LDA model to determine the distance between the                    to have them in the output of Split-and-Merge.
                       j
blocks denoted by d topic . LDA has three hyperparameters: K
the number of topics, α the document-topic prior distribution
and η the topic-word priori distribution. The last two parameters             and an improvement threshold under which we consider it is not
control the sparsity of the distribution. In this work, we used               interesting enough to merge.
the online hyperparameter strategy provided by Gensim, i.e. the
values for α and η are deduced from the real data. As LDA can
be seen as a dimensionality reduction method, the number of                   Algorithm 1 Split-and-Merge
dimensions K influences the quality of the model, thus should be                             T m
                                                                                  1: π 0 ← { m i}i=0
fixed not too low to keep enough information but not too high                     2: do
for the documents to be summarized. For this paper, knowing                       3:
                                                                                                                         j
                                                                                                Calculate the family {di }i, j for πk
the approximate size of the Bag-of-Words representation of our                                          Pn
                                                                                                F j ← n1 ωi di
                                                                                                               j
                                                                                  4:
data, we fixed K = 10. Further work should focus on hyperpa-                                            i=1
rameter tuning taking into account the length of the video and                    5: for F j ∈ {F j | F j1 ≥ F j2 , ∀j 1 > j 2 } do
the number of cutting points at a given moment in a partition                   6:       if F j ≥ δ and R(πk , j) ≤ η then
(because it influences the transcript length of a block).                       7:           πk +1 = πk \ {p j }
   We would like to draw the attention on the fact that if the                  8:           break
algorithm we used is not supervised, the feature extraction is                  9:       end if
supervised for some feature streams. The transcript obtained with              10:   end for
Watson Speech-to-Text is rectified (including abbreviation and                 11: while πk , πk +1
specific term annotations) and inserted into a custom model. As a
result, from video to video, the feature extraction becomes more
and more relevant, which we believe is the primary requirement
to achieve a good semantic partitioning. Similarly, Watson Visual             5        RESTRICTING THE SEARCH SPACE
Recognition service is trained to recognize certain objects in
relation with the theme of the test videos (e.g. trained to recognize         If the audio of the video is intended to describe the video content
specific softwares displayed in the video).                                   (e.g. a documentary or a commented video for educational pur-
                                                                              poses), we make a stronger assumption: a cutting point between
4    SPLIT-AND-MERGE ALGORITHM                                                two semantic blocks can only occur between two sentences. Not
                                                                              only this assumption turns the continuous problem into a (al-
The algorithm is composed of two phases. In the first one, called             most) discrete problem, but it also provides an upper bound on
Split, we break the video into regular blocks that are small enough           the number of cutting points. If we denote by S = {s 1 , ..., s K } the
to be sure they are smaller than the smallest semantic block of               set of sentences in the perfect transcript, the number of cutting
the video and thus, in particular, there are several cutting points           points is bounded by K. It is only almost discrete, because in
within each of the intervals defined by the optimal partition                 many cases the time interval between two sentences may be
p ∗ . The Merge phase consists in merging two adjacent blocks                 quite long and there is still a need to determine where to cut
if they are related enough according to the metrics we defined                exactly.
in the previous Section. Iteratively, we merge the blocks with                    Let [ε + (s j ), ε − (s j+1 )] ⊂ [0,T ] denotes the time interval be-
the higher normalized combined distance defined by F (π ) =                   tween two consecutive sentences s j and s j+1 . In other words,
        P P |π |                                             n
   1                   j     1 P |π | j                      P     j
                                                                              ∀p ∈ π ∗ , ∃j ∈ {1, ..., K − 1} s.t. p j ∈ [ε + (s j ), ε − (s j+1 )]. To
              j=1 ωi di = n |π | j=1 F (π ) with F (π ) =      ωi di .
                                                     j
n |π |
     1≤i ≤n                                                  i=1              take advantage of this assumption, we change the initial uni-
However, without a stopping criterion all the blocks will be
                                                                              form partition π 0 defined in Algorithm 1 By π 0 = {pi | ∀i ∈
merged. To prevent this, we take into consideration the im-                                                        |ε − (s i +1 )−ε + (s i ) |
provement implied by removing p j between step k and k + 1:                   {1, ..., K − 1}, pi = ε + (si ) +                 2              }. Contrary to the pre-
              F (π                                                            vious version, there is no parameter h to control the (best case)
R(πk , j) = F (πk +1) with πk +1 = πk \ {p j }. Split-and-Merge is
                     )
                 k                                                            precision. However, as we know there is a unique cutting point
detailed in Algorithm 1. It has four hyperparameters: m > 0,                  in [ε + (s j ), ε − (s j+1 )], we can split it into an arbitrary number of
                P
ωi ∈ [0, 1] s.t. ωi = 1, δ ∈ [0, 1] and η ∈ [0, 1[, resp. the initial         potential cutting points and select the one that minimizes the
                 i
number of cutting points, the weights for each feature, a similar-            aggregated similarity. This step is done after applying Split-and-
ity threshold under which we do not want to merge two blocks                  Merge as a refinement step.




                                                                         46
6    EXPERIMENTS AND RESULTS                                                            In a second step, we used Split-and-Merge (without the assump-
The plan of this section is as follows: we present the data we used                     tion made in Section 5) on the three videos with the following
in Section 6.1, then we elaborate the protocol in Section 6.2. We                       settings: a uniform initial partition with a length of three seconds
analyse the results in Section 6.3. Last but not least, in Section                      per block, ωi = n1 , ∀i (i.e. each feature has the same importance),
6.4, we address some problems or possible critics of our protocol.                      δ = 0.9 and η = 0.1. Finally, we reported the partition found
                                                                                        under the form of three second intervals centered on the cutting
6.1 Videos                                                                              points. The choice to assimilate a cutting point to a three sec-
                                                                                        ond interval result both from the protocol (people may agree on
We used three short videos (2:16 to 4:00 minutes) available on
Youtube:                                                                                the same cutting point, but report different time to one or two
                                                                                        seconds difference) and from the algorithm (for computational
    (1) Running an experiment in the IBM Quantum Experience3
                                                                                        reasons, extracting and processing the screenshots is relatively
    (2) IBM Bluemix - 3 minutes demo (NodeJS, SSO)4
    (3) IBM Watson for Oncology Demo5
                                                                                        costly). In the analysis, we consider that two cutting points coin-
                                                                                        cide if there is an intersection between their intervals. In other
Apart from being short, the videos have some specific characteris-                      words, two cutting points are assimilated if they are separated
tics. The first video presents how to create, run and access to the                     by less than three seconds, which correspond to the selected
result of a quantum algorithm. The video is relatively naturally                        precision for the algorithm.
broken down into parts except the moment when it explains the
algorithm where the video switches between displaying a screen
with the software to perform the experiment and an illustration                         6.3    Results
using cards. Also, it seems to us that there is a long block in the                     The results are summarized by Figure 2. Let us analyze first the
middle of the video which would require Split-and-Merge to re-                          panel answer. For the first video, the cutting points are relatively
ally understand how to remove inappropriate cutting points. The                         numerous with 11 in total. A lot of them are mostly in blue indi-
second one is very naturally broken down into several parts by a                        cating a doubt from the panel. Around 1:00, a ten second interval
short notice with the subject of the next part written during the                       is formed, indicating a variability in the choice of cutting points.
transitions. Between transitions, the speaker is visually present                       This was expected as mentioned previously in the video descrip-
on the video which can perturbate the visual metrics since he                           tion. The large block from around 1:00 to 2:25 is also confirmed
tends to move, thus modifying perceptual hash and pixel-based                           by the answers. For the second video, and as we expected, there
distance more than expected. Finally, the third video has been                          is a neat consensus: every person from the panel answered the
selected because it is harder to break down since the speaker flow                      same that is to say, putting a cutting point when the transition is
and presentation seem more "continuous" without neat transition                         visually indicated. There is one exception with a weak transition,
between screens or topics. Also, it integrated a visual transition                      in one answer, located on the transition between the black screen
(from black screen to a web page and the contrary) at the begin-                        at the beginning and the real content of the video. Regarding
ning and at the end, which can trick both humans and the visual                         the third video, the participants seem to agree on whether or
metrics.                                                                                not a transition is weak or strong. For the second distinct area
                                                                                        in the timeline, around 24s, the union of intervals made of the
6.2 Protocol                                                                            cutting points has a nine second length. Checking the answer
We asked people through a public form to indicate their optimal                         individually confirms that there is no one adding more than one
partition with the following instruction: For each video, describe                      cutting point in this area. The results globally reflect our own
how you would break it into "steps" (a bit like a chapter in a book).                   personal answers on the cutting points and more important, the
We added that in case of doubt, one can indicate that a transition                      panel answers reflect the difficulties we described in the previous
is weak. We respectively collected seven, five and five answers                         section. Globally, the huge amount of weak transitions in the first
for each video6 . For each answer we obtained, we plot the cutting                      and last video confirms that the semantic partitioning problem is
points ti as a three second interval centered on ti (in blue for the                    not well defined as the perception of what exactly is a semantic
weak cutting points, red otherwise) such that we can visually                           block varies between persons and videos.
observe the areas on the timeline such that people agree there is                           In the first video, only two out of eleven cutting points have
a cutting point.                                                                        been found. There are two pairs of red and blue cutting points
                                                                                        separated by only four seconds (in resp. 0:08 and 0.30) and for
Example of answer for the first video:                                                  both, Split-and-Merge identified a cutting point in the middle. We
                                                                                        explain this by the fact that the algorithm did not find relevant
 0:12
                                                                                        the blue cutting points (the selected cutting points are closer
 0:20 W
                                                                                        to the red ones). One good point to notice is the absence of
 0:31 W
                                                                                        cutting point in the large middle block. In general, the results
 0:38
                                                                                        on this video are disappointing because the video seemed easy
 0:59
                                                                                        to break for humans, and none of the most consensual cutting
 2:35 W
                                                                                        points (red at 1:10 and 2:28) has been identified. In the second
 2:56 W
                                                                                        video, the size of the final partition matches the size of the panel
 3:49 W
                                                                                        optimal partition with seven cutting points. However, except for
3 https://www.youtube.com/watch?v=pYD6bvKLI_c                                           two points, there are all misplaced: the one around 0:25 is about
4 https://www.youtube.com/watch?v=xE_I8BgDroA                                           five second late and the ones 2:20 and 2:40 are three seconds
5 https://www.youtube.com/watch?v=338CIHlVi7A
                                                                                        in advance. The two others are clearly misplaced and almost in
6 Due to the required time to complete the form, people were able to answer only
                                                                                        the worst possible location, i.e. in the middle of an interval (at
for some videos. On top of that, a fourth video was initially planned, but as we
received less than five answers, we decided not to report the results.                  least not the same). It is quite surprising that the video with the




                                                                                   47
Figure 2: Timelines for the three videos. From the top to the bottom, video 1, 2 and 3. The distribution of weak cutting
points is indicated in blue, the normal ones in red. A cutting point is materialized by a three second interval and the
opacity represents how much an area has been selected as cutting point by the panel.


most clear consensus on the optimal partition does not provide                construction by the extraction of feature streams, each of them
better results. The fact that the speaker appears on the video may            holding a part of the semantic contained in the video. To solve
be an explanation as between two frames the number of pixels                  the problem, we introduced a heuristic called Split-and-Merge. Its
that changed is most likely to be higher without changing the                 main idea is to cut the video in too many pieces before greedily
semantics. This would imply that the visual metrics play a more               merging the blocks until the improvement is neglectable. Under
important role than we first expected, at least for videos with this          a restrictive assumption, we reduced the continuous problem
characteristic. Further experiments with different weights ωi and             to a (almost) discrete problem and adapted the algorithm conse-
higher thresholds for the visual metric need to be carried. In the            quently. Finally, we presented the results obtained on a few test
last video, Split-and-Merge successfully identified seven out of              videos and discuss the protocol.
nine cutting points among which, two are identified with a zero                  Despite the variability in the quality of results, the results are
second difference. However, the most important cutting point                  encouraging taking into account no hyperparameter tuning was
(not identified as weak, and being in the five answers with at                performed. Surprisingly, the algorithm obtains the best results
most one second difference) is not identified. In the large interval          on what we considered to be the hardest video and the worst
around 0:24, Split-and-Merge identified the cutting point at the              result on the easiest video.
extreme right side. Surprisingly, Split-and-Merge performed the                  In future work, we will investigate how to tune the hyperpa-
best on the third video which seems to the authors to be the                  rameters to obtain the best out of the algorithm on a set of videos.
hardest to naturally break into semantically distinct parts.                  We are also interested in comparing the convex combination ap-
                                                                              proach presented in this paper with a multi-objective approach.
6.4    Discussion on the protocol                                             Also, we need to investigate how good the modified algorithm
The small number of answers collected to create the optimal                   under the restrictive assumption perform compared to the initial
partition may be a threat of validity. However, as shown in Figure            version. Last but not least, we plan to incorporate more advanced
2, a consensus seems to emerge, especially for the second video               feature streamed such as objects obtained by video segmentation.
which can mitigate this critic. Another drawback in our protocol
is that the results are more qualitative than quantitative: we do
                                                                              ACKNOWLEDGMENT
not measure any distance from the Split-and-Merge result to the               The authors warmly thank IBM BTO and IBM Krakow Software
optimal partition. We justify this approach by the fact that such             Lab for their financial support.
metric may be useful to compare algorithms to each others, but
too ad-hoc to be interesting by itself. In a preliminary evaluation           AUTHOR CONTRIBUTIONS
and taking into account the "fuzziness" of the optimal partition, a           Marcin Janiszewski pointed out the problem and guided its reso-
qualitative description of the results appeared to be enough to us.           lution using IBM Watson Services.
A last critic that may appear is the lack of clear definition of what         Krzysztof Jamrog implemented the algorithm, taught IBM Wat-
is or should be a cutting point in the instruction for the panel. This        son Speech-to-Text to increase the transcript quality.
has been done on purpose since a more precise indication would                Alexandre Quemy conceived the algorithm, wrote the article,
probably contain explicit references to some feature streams or               selected the data, conceived and run the experiments, analyze
metrics on feature streams and indirectly influence the answer.               the data.

7     CONCLUSION AND FUTURE WORK
In this paper we presented the Video Semantic Partitioning prob-
lem defined as finding an optimal partition w.r.t. an unknown and
not so well defined function measuring the semantic. We reformu-
lated the problem as the optimization of a convex combination
supposedly approximating the unknown function. We justify this




                                                                         48
REFERENCES
[1] Anne Auger, Johannes Bader, Dimo Brockhoff, and Eckart Zitzler. 2012.
    Hypervolume-based multiobjective optimization: Theoretical foundations and
    practical implications. Theoretical Computer Science 425, Supplement C (2012),
    75 – 103. https://doi.org/10.1016/j.tcs.2011.03.012 Theoretical Foundations of
    Evolutionary Computation.
[2] Nyein Myint Myint Aung and Su Su Maung. 2013. Semantic Based Text Block
    Segmentation Using WordNet. International Journal of Computer and Commu-
    nication Engineering 2, 5 (2013), 601.
[3] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet
    Allocation. J. Mach. Learn. Res. 3 (March 2003), 993–1022. http://dl.acm.org/
    citation.cfm?id=944919.944937
[4] Kalyanmoy Deb, Manikanth Mohan, and Shikhar Mishra. 2005. Evaluating the
    &Epsi;-Domination Based Multi-Objective Evolutionary Algorithm for a Quick
    Computation of Pareto-Optimal Solutions. Evol. Comput. 13, 4 (Dec. 2005),
    501–525. https://doi.org/10.1162/106365605774666895
[5] Jisheng Liang, Ihsin T. Phillips, and Robert M. Haralick. 2000. Consistent
    Partition and Labelling of Text Blocks. Pattern Anal. Appl. 3 (2000), 196–208.
[6] Edward Loper and Steven Bird. 2002. NLTK: The Natural Language Toolkit.
    In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for
    Teaching Natural Language Processing and Computational Linguistics - Volume
    1 (ETMTNLP ’02). Association for Computational Linguistics, Stroudsburg, PA,
    USA, 63–70. https://doi.org/10.3115/1118108.1118117
[7] Achille Messac, Cyriaque Puemi-Sukam, and Emanuel Melachrinoudis. 2000.
    Aggregate Objective Functions and Pareto Frontiers: Required Relationships
    and Practical Implications. Optimization and Engineering 1, 2 (01 Jul 2000),
    171–188. https://doi.org/10.1023/A:1010035730904
[8] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Mod-
    elling with Large Corpora. In Proceedings of the LREC 2010 Workshop on
    New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http:
    //is.muni.cz/publication/884893/en.




                                                                                     49