<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Unsupervised Video Semantic Partitioning Using IBM Watson And Topic Modelling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Krzysztof Jamrog</string-name>
          <email>krzysztof.jamrog1@pl.ibm.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcin Janiszewski</string-name>
          <email>marcin.janiszewski@pl.ibm.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexandre Quemy</string-name>
          <email>aquemy@pl.ibm.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IBM Software Lab</institution>
          ,
          <addr-line>Cracow</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IBM Software Lab, Cracow, Poland, Faculty of Computing</institution>
          ,
          <addr-line>Poznań</addr-line>
          ,
          <institution>University of Technology</institution>
          ,
          <addr-line>Poznań</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <fpage>44</fpage>
      <lpage>49</lpage>
      <abstract>
        <p>In this paper, we present the problem of Video Semantic Partitioning that consists in breaking a video into semantically independent blocks. Defined within a framework of optimization, we present a preliminary heuristic approach to solve the problem, called Split-and-Merge. The algorithm itself is unsupervised, but the mechanisms to extract data from videos are supervised (for some) since they used IBM Watson Services. Finally, we demonstrate on few videos the capabilities of our prototype and discuss the limitations and future improvements. From the experiments, we draw two conclusions: (i) the optimal solution to the problem varies from human to human with a large variability from video to video, (ii) Split-and-Merge demonstrates encouraging qualitative results to find the average optimal solution defined as the average solution given by humans.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The problem of text segmentation, that is to say partitioning a
text into semantic blocks, has been widely studied (e.g. in [
        <xref ref-type="bibr" rid="ref2 ref5">2, 5</xref>
        ])
but, as far as we know, never extended to videos. Indeed, video
segmentation usually refers to the process of extracting objects
from a video and not breaking it into semantic chunks. In this
paper, we define the Video Semantic Partitioning problem as
an extension of the text segmentation problem and present our
primary attempt to solve it. Among the possible applications,
we can highlight a better understanding of the video content for
tagging and indexing in a database or the creation of educational
paths by joining semantic blocks from diferent videos to create
a new one.
      </p>
      <p>
        A video is a complex and multidimensional signal holding a
lot of information in a very short amount of time. In addition,
the raw information is totally unstructured and has to be
extracted, contrary to text segmentation where this information
is mostly structured (the text is already provided as it is). By
raw, we mean the primary source of semantic. For a given text, it
mainly holds in the sentences and words themselves, while the
support brings some secondary elements (e.g. in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] the support
is a HTML page and the tags are used to delimit some sections).
For a video, the primary source is not as well defined: is it the
audio track or the visual information? Most likely a combination
of both. Additionally, this raw information is not directly
suitable for most algorithms and requires a pre-treatment to extract
exploitable data. Those techniques are known as Cognitive
Computing, i.e. processing and analyzing large unstructured signals.
      </p>
      <p>The information extraction plays a preponderant role in the
algorithms performances because even the best possible algorithm
would return poor results on badly extracted information which
is equivalent to noise.</p>
      <p>The plan of this paper is as follows: Section 2 formalizes the
Video Semantic Partitioning problem, Section 3 presents the
feature extraction mechanisms while in Section 4 the algorithm
Split-and-Merge is presented. In Section 5 we discuss an
optimized version of the algorithm under a restrictive assumption
about the videos. In Section 6, we present some results obtained
on real videos and conclude this paper in Section 7.
2</p>
    </sec>
    <sec id="sec-2">
      <title>VIDEO SEMANTIC PARTITIONING</title>
      <p>A video is a temporal signal on a time interval [0, T ]. We denote by
P ([0, T ], m) the set of partitions of [0, T ] into m elements, i.e. the
elements of P ([0, T ], m) are of the form π = (p0, ..., pm , pm+1),
∀i ∈ {0, ..., m}, pi &lt; pi+1, with p0 = 0 and pm+1 = T , such that
Smi=0[pi , pi+1] = [0, T ].</p>
      <p>Our goal is to find a partition such that each [pi−1, pi ] is an
independent semantic block, i.e. holds a diferent semantic than
its adjacent blocks. Let π ∗ denotes the optimal partition. The
ifrst formulation of the Semantic Partition problem is to find a
minimizing sequence πk such that πk →k π ∗ with a convergence
notion that implies that the number of cutting points of πk
converges to those of π ∗1 and that each cutting point of π ∗ is the
limit of a cutting point of πk .</p>
      <p>As an element of any partition π is contained in a closed
interval defined by two cutting points of π ∗ we can define the
following cost function ∀π ∈ S P ([0, T ], m), ∀pi ∈ π , c (pi ) =
m ∈N∗
|p∗j − pi ||pi − p∗j+1 | with pi ∈ [p∗j, p∗j+1]. Additionally, we define
m−1
J (π , π ∗) = P c (pi ) + c (pi+1).</p>
      <p>i=1</p>
      <p>Under the assumption that π has as many cutting points as π ∗,
then if J (p, p∗) = 0 then p = p∗ and the problem can be seen as
ifnding a minimizing sequence πk s.t. J (πk , π ∗) → 0. In practice,
k
as π ∗ is unknown, there are two problems: the optimal number
of cutting points is not known (i) and J is not available (ii).</p>
      <p>For (i), assuming J is available, if π0 contains enough cutting
points and some in each interval induced by π ∗, we can minimize
J given π0 by removing the unnecessary cuttings points.
Assuming we know how to remove the unnecessary cutting points, if
π0,h is defined by the uniform partition s.t. ∀pi , pi+1 ∈ π0, pi+1 −
pi = h, then ∀h1, h2 s.t. h1 &lt; h2, J (π0,h1 , π ∗) &lt; J (π0,h2 , π ∗).
More generally, we can control J by h.
1We may not accept the case such that two cutting points are identical, creating an
empty interval, thus with an empty semantic. However, it may be convenient for
practical algorithms and thus we assume we merge the similar cutting points of πk
at any time if necessary.</p>
      <p>
        For (ii), we need to find an approximation of J . As p∗ is a
partition into semantic blocks we will try to rely on the
semantic contained in the video to approximate J . To do so, we are
able to identify n features stream from the video, that is to say,
n sequences of features on [0, T ]. For each i ∈ {1, ..., n}, we
have a similarity metric di : [0, T ] × [0, T ] → [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] to quantify
how related two blocks are w.r.t. a particular feature stream. In
particular, ∀p1, p2, di ([p1, p2], [p1, p2]) = 1 because the
semantic of a block is obviously equal to its own semantic. However,
[p1, p2] ∩ [p3, p4] = ∅ does not imply di ([p1, p2], [p3, p4]) = 0, i.e.
it is not because two blocks are not overlapping in time that they
do not have semantic similarity. To ease the notations, given a
partition π , we denote by dij the similarity between the blocks
[pj−1, pj ] and [pj , pj+1], i.e. dij def
      </p>
      <p>= di ([pj−1, pj ], [pj , pj+1]).</p>
      <p>The problem of the video partitioning is to find a partition π ∈
Sk ∈N∗ P ([0, T ], m) that minimizes the adjacent block similarity,
that is to say
π ∗ =</p>
      <p>argmin
π ∈ S P ([0,T ],m) 1≤i ≤n j=1
m∈N∗</p>
      <p>M
m
X d j
i
with L an aggregation operator.</p>
      <p>For the purpose of this paper, we will consider L as a convex
combination of the block similarity, allowing us to reformulate
the problem as follows:
 π ∗




 P
1≤i ≤n
ωi
=
= 1</p>
      <p>argmin P
π ∈ S P ([0,T ],m) 1≤i ≤n
m∈N∗</p>
      <p>j
Pmj=1 ωi di</p>
      <p>
        The choice of the operator L is an interesting problem by
itself because it is known to greatly influence the algorithms
capacity to find the optimal solution [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In the future, we plan to
investigate the choice of this operator, and in particular treating
the problem as multi-objective (optimizing on each feature stream
independently) using the Hypervolume [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or ε-dominance [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
metric as aggregation function.
      </p>
      <p>As said before, one major dificulty comes from the fact the
optimal number of elements in the partition is not known a priori.
Notice that in practice, if an interval is too small the information it
holds is considered as null. For instance, a very small interval does
not hold more than few words or no word at all, no screenshot,
etc. As a result, its similarity with another very small interval
would be one or close to one. In other words, dividing too many
times [0, T ] does not minimize the objective function. However,
having two large blocks separated by a very small block would
be a minimizing strategy: the small block holds no semantic,
contrary to the large blocks and as a result, the similarity between
adjacent blocks is (close to) zero. To handle this problem, we see
several possibilities: taking into account this phenomenon into
the definition of the metrics di (i), adding a regularization term
on the block sizes to (2) to be sure they are homogenous enough
(ii), fixing the set of possible values for m and a distribution of
block size w.r.t. T (iii). As (i) is too ad-hoc and we do not see
any practical justification to support (ii), we will make a stronger
assumption on the optimal partition in Section 5 that will define
both the maximal number of blocks and their respective sizes.
(1)
(2)</p>
    </sec>
    <sec id="sec-3">
      <title>FEATURE STREAMS</title>
      <p>In this section, we review the feature streams we extract from a
video as well as the associated block similarity.</p>
      <p>We are able to extract a frame at any moment of the video. For
two frames, we are able to calculate two metrics: a perceptual
hash dhash using phash open-source library2 and a pixel-based
hamming distance dpixel. For two blocks [pj−1, pj ] and [pj , pj+1],
we made the choice to take into account resp. only the last and
the first frame contain in the first and second block. With Watson
Visual Recognition and for each frame, we are able to obtain a
list of tags with confidence on concepts and entities. For a block
[pj , pj+1], we aggregate the tags of all frames it contains into a
the sets Wj and Cj respectively for the words and the confidence
level. For two blocks, we denote by Wj, j+1 the set defined by
Wj ∩ Wj+1 with CjL, j+1 and CjR, j+1 the confidence associated to
the left and right block.
w1
Example: Let us assume Wj = *.w2+/ composed of three words,
c1, j
associated to the confidence vector Cj = *.c2, j /+. Similarly, let us
w1 c1, j+1
assume Wj+1 = *.w3+/ with Cj+1 = *.c3, j+1/+. Then, Wj, j+1 =
,w4- ,c4,
j+1and the left and right confidence vectors are given by CjL, j+1 =
w1!
w3
If |Wj−1 ∩ Wj | = 0, the similarity is zero: d tjags = 0. Otherwise,
the similarity measure is defined as the Jaccard index ponderated
by the confidence:</p>
      <p>=
d tjags = |Wj−1 ∩ Wj | 1 − ||CjL−1, j − CjR−1, j ||1
|Wj−1 ∪ Wj | |Wj−1 ∩ Wj |
|Wj−1 ∩ Wj | − ||CjL−1, j − CjR−1, j ||1</p>
      <p>|Wj−1 ∪ Wj |
where |A| denotes the cardinal of A. In other words, two blocks
are similar not only if they have the same tags but also if their
confidences are close enough.</p>
      <p>Example: Considering the vectors defined in the previous
example, the similarity between the two blocks is
d tj+ag1s =
2 − [(c1, j − c1, j+1) + (c3, j − c3, j+1)]
4
In this preliminary work, we did not consider OCR to extract
the text from the frames or image segmentation techniques to
obtain the location of objects on a frame. However, using Watson
Speech-to-Text, we extract the transcript from the audio track.
The service provides the timestamp to be able to locate the text
corresponding to a given block. Similarly to Watson Visual
Recognition, Watson Natural Language Understanding is able to extract
from a text a list of keywords and concepts with a confidence
level (expressing the relevance among the text). This allows us to
define d kjeywords and d cjoncepts in a similar way as for d tjags.</p>
      <p>
        We normalized the transcript using NLTK [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] which includes
the tokenization, removing the stopwords, the lemmatization,
2http://phash.org/
the computation of n-grams from 1 to 4 and filtering the
ngrams by a minimal number of occurrences. A given partition
π ∈ P ([0, T ], m) can be seen as a corpus of m documents
represented by Bag-of-Words. Using Gensim [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], we applied a TF-IDF
transformation on each block and used a Latent Dirichlet
Allocation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to express each block in a latent topic space. LDA is
a generative probabilistic model that represents documents as a
mixture of topics. Each topic represents a (sparse) distribution
over the dictionary of words used in the documents expressing
the idea that a topic is defined by a high frequency of few terms.
The model parameters are then estimated from the real data,
in particular the word distribution per topic and the topic
distribution for each document. For each pair of adjacent blocks,
we use the LDA model to determine the distance between the
blocks denoted by d tjopic. LDA has three hyperparameters: K
the number of topics, α the document-topic prior distribution
and η the topic-word priori distribution. The last two parameters
control the sparsity of the distribution. In this work, we used
the online hyperparameter strategy provided by Gensim, i.e. the
values for α and η are deduced from the real data. As LDA can
be seen as a dimensionality reduction method, the number of
dimensions K influences the quality of the model, thus should be
ifxed not too low to keep enough information but not too high
for the documents to be summarized. For this paper, knowing
the approximate size of the Bag-of-Words representation of our
data, we fixed K = 10. Further work should focus on
hyperparameter tuning taking into account the length of the video and
the number of cutting points at a given moment in a partition
(because it influences the transcript length of a block).
      </p>
      <p>We would like to draw the attention on the fact that if the
algorithm we used is not supervised, the feature extraction is
supervised for some feature streams. The transcript obtained with
Watson Speech-to-Text is rectified (including abbreviation and
specific term annotations) and inserted into a custom model. As a
result, from video to video, the feature extraction becomes more
and more relevant, which we believe is the primary requirement
to achieve a good semantic partitioning. Similarly, Watson Visual
Recognition service is trained to recognize certain objects in
relation with the theme of the test videos (e.g. trained to recognize
specific softwares displayed in the video).
4</p>
    </sec>
    <sec id="sec-4">
      <title>SPLIT-AND-MERGE ALGORITHM</title>
      <p>
        The algorithm is composed of two phases. In the first one, called
Split, we break the video into regular blocks that are small enough
to be sure they are smaller than the smallest semantic block of
the video and thus, in particular, there are several cutting points
within each of the intervals defined by the optimal partition
p∗. The Merge phase consists in merging two adjacent blocks
if they are related enough according to the metrics we defined
in the previous Section. Iteratively, we merge the blocks with
the higher normalized combined distance defined by F (π ) =
1 P Pj|π= 1| ωi dij = n |1π | Pj|π= 1| F j (π ) with F j (π ) = Pn ωidij .
n |π | 1≤i ≤n i=1
However, without a stopping criterion all the blocks will be
merged. To prevent this, we take into consideration the
improvement implied by removing pj between step k and k + 1:
R (πk , j ) = FF( π(πkk+1) ) with πk+1 = πk \ {pj }. Split-and-Merge is
detailed in Algorithm 1. It has four hyperparameters: m &gt; 0,
ωi ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] s.t. Pωi = 1, δ ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] and η ∈ [0, 1[, resp. the initial
i
number of cutting points, the weights for each feature, a
similarity threshold under which we do not want to merge two blocks
. . .
      </p>
      <p>p∗
3
p∗
3
pm
pm
1
1
and an improvement threshold under which we consider it is not
interesting enough to merge.</p>
      <p>Algorithm 1 Split-and-Merge
1: π0 ← { mT i }im=0
2: do
3:
4:</p>
      <p>F j ← n1 iP=n1ωidij</p>
      <p>Calculate the family {dij }i, j for πk
5:
6:
7: πk+1 = πk \ {pj }
8: break
9: end if
10: end for
11: while πk , πk+1
for F j ∈ {F j | F j1 ≥ F j2 , ∀j1 &gt; j2} do</p>
      <p>if F j ≥ δ and R (πk , j ) ≤ η then
5</p>
    </sec>
    <sec id="sec-5">
      <title>RESTRICTING THE SEARCH SPACE</title>
      <p>If the audio of the video is intended to describe the video content
(e.g. a documentary or a commented video for educational
purposes), we make a stronger assumption: a cutting point between
two semantic blocks can only occur between two sentences. Not
only this assumption turns the continuous problem into a
(almost) discrete problem, but it also provides an upper bound on
the number of cutting points. If we denote by S = {s1, ..., sK } the
set of sentences in the perfect transcript, the number of cutting
points is bounded by K . It is only almost discrete, because in
many cases the time interval between two sentences may be
quite long and there is still a need to determine where to cut
exactly.</p>
      <p>Let [ε+ (sj ), ε− (sj+1)] ⊂ [0, T ] denotes the time interval
between two consecutive sentences sj and sj+1. In other words,
∀p ∈ π ∗, ∃j ∈ {1, ..., K − 1} s.t. pj ∈ [ε+ (sj ), ε− (sj+1)]. To
take advantage of this assumption, we change the initial
uniform partition π0 defined in Algorithm 1 By π0 = {pi | ∀i ∈
{1, ..., K − 1}, pi = ε+ (si ) + |ε− (si+1)2−ε+ (si ) | }. Contrary to the
previous version, there is no parameter h to control the (best case)
precision. However, as we know there is a unique cutting point
in [ε+ (sj ), ε− (sj+1)], we can split it into an arbitrary number of
potential cutting points and select the one that minimizes the
aggregated similarity. This step is done after applying
Split-andMerge as a refinement step.</p>
    </sec>
    <sec id="sec-6">
      <title>EXPERIMENTS AND RESULTS</title>
      <p>The plan of this section is as follows: we present the data we used
in Section 6.1, then we elaborate the protocol in Section 6.2. We
analyse the results in Section 6.3. Last but not least, in Section
6.4, we address some problems or possible critics of our protocol.</p>
    </sec>
    <sec id="sec-7">
      <title>6.1 Videos</title>
      <p>We used three short videos (2:16 to 4:00 minutes) available on
Youtube:
(1) Running an experiment in the IBM Quantum Experience3
(2) IBM Bluemix - 3 minutes demo (NodeJS, SSO)4
(3) IBM Watson for Oncology Demo5
Apart from being short, the videos have some specific
characteristics. The first video presents how to create, run and access to the
result of a quantum algorithm. The video is relatively naturally
broken down into parts except the moment when it explains the
algorithm where the video switches between displaying a screen
with the software to perform the experiment and an illustration
using cards. Also, it seems to us that there is a long block in the
middle of the video which would require Split-and-Merge to
really understand how to remove inappropriate cutting points. The
second one is very naturally broken down into several parts by a
short notice with the subject of the next part written during the
transitions. Between transitions, the speaker is visually present
on the video which can perturbate the visual metrics since he
tends to move, thus modifying perceptual hash and pixel-based
distance more than expected. Finally, the third video has been
selected because it is harder to break down since the speaker flow
and presentation seem more "continuous" without neat transition
between screens or topics. Also, it integrated a visual transition
(from black screen to a web page and the contrary) at the
beginning and at the end, which can trick both humans and the visual
metrics.
6.2</p>
    </sec>
    <sec id="sec-8">
      <title>Protocol</title>
      <p>We asked people through a public form to indicate their optimal
partition with the following instruction: For each video, describe
how you would break it into "steps" (a bit like a chapter in a book).
We added that in case of doubt, one can indicate that a transition
is weak. We respectively collected seven, five and five answers
for each video6. For each answer we obtained, we plot the cutting
points ti as a three second interval centered on ti (in blue for the
weak cutting points, red otherwise) such that we can visually
observe the areas on the timeline such that people agree there is
a cutting point.</p>
      <p>Example of answer for the first video:
0:12
0:20 W
0:31 W
0:38
0:59
2:35 W
2:56 W
3:49 W
3https://www.youtube.com/watch?v=pYD6bvKLI_c
4https://www.youtube.com/watch?v=xE_I8BgDroA
5https://www.youtube.com/watch?v=338CIHlVi7A
6Due to the required time to complete the form, people were able to answer only
for some videos. On top of that, a fourth video was initially planned, but as we
received less than five answers, we decided not to report the results.</p>
      <p>In a second step, we used Split-and-Merge (without the
assumption made in Section 5) on the three videos with the following
settings: a uniform initial partition with a length of three seconds
per block, ωi = n1 , ∀i (i.e. each feature has the same importance),
δ = 0.9 and η = 0.1. Finally, we reported the partition found
under the form of three second intervals centered on the cutting
points. The choice to assimilate a cutting point to a three
second interval result both from the protocol (people may agree on
the same cutting point, but report diferent time to one or two
seconds diference) and from the algorithm (for computational
reasons, extracting and processing the screenshots is relatively
costly). In the analysis, we consider that two cutting points
coincide if there is an intersection between their intervals. In other
words, two cutting points are assimilated if they are separated
by less than three seconds, which correspond to the selected
precision for the algorithm.
6.3</p>
    </sec>
    <sec id="sec-9">
      <title>Results</title>
      <p>The results are summarized by Figure 2. Let us analyze first the
panel answer. For the first video, the cutting points are relatively
numerous with 11 in total. A lot of them are mostly in blue
indicating a doubt from the panel. Around 1:00, a ten second interval
is formed, indicating a variability in the choice of cutting points.
This was expected as mentioned previously in the video
description. The large block from around 1:00 to 2:25 is also confirmed
by the answers. For the second video, and as we expected, there
is a neat consensus: every person from the panel answered the
same that is to say, putting a cutting point when the transition is
visually indicated. There is one exception with a weak transition,
in one answer, located on the transition between the black screen
at the beginning and the real content of the video. Regarding
the third video, the participants seem to agree on whether or
not a transition is weak or strong. For the second distinct area
in the timeline, around 24s, the union of intervals made of the
cutting points has a nine second length. Checking the answer
individually confirms that there is no one adding more than one
cutting point in this area. The results globally reflect our own
personal answers on the cutting points and more important, the
panel answers reflect the dificulties we described in the previous
section. Globally, the huge amount of weak transitions in the first
and last video confirms that the semantic partitioning problem is
not well defined as the perception of what exactly is a semantic
block varies between persons and videos.</p>
      <p>In the first video, only two out of eleven cutting points have
been found. There are two pairs of red and blue cutting points
separated by only four seconds (in resp. 0:08 and 0.30) and for
both, Split-and-Merge identified a cutting point in the middle. We
explain this by the fact that the algorithm did not find relevant
the blue cutting points (the selected cutting points are closer
to the red ones). One good point to notice is the absence of
cutting point in the large middle block. In general, the results
on this video are disappointing because the video seemed easy
to break for humans, and none of the most consensual cutting
points (red at 1:10 and 2:28) has been identified. In the second
video, the size of the final partition matches the size of the panel
optimal partition with seven cutting points. However, except for
two points, there are all misplaced: the one around 0:25 is about
ifve second late and the ones 2:20 and 2:40 are three seconds
in advance. The two others are clearly misplaced and almost in
the worst possible location, i.e. in the middle of an interval (at
least not the same). It is quite surprising that the video with the
most clear consensus on the optimal partition does not provide
better results. The fact that the speaker appears on the video may
be an explanation as between two frames the number of pixels
that changed is most likely to be higher without changing the
semantics. This would imply that the visual metrics play a more
important role than we first expected, at least for videos with this
characteristic. Further experiments with diferent weights ωi and
higher thresholds for the visual metric need to be carried. In the
last video, Split-and-Merge successfully identified seven out of
nine cutting points among which, two are identified with a zero
second diference. However, the most important cutting point
(not identified as weak, and being in the five answers with at
most one second diference) is not identified. In the large interval
around 0:24, Split-and-Merge identified the cutting point at the
extreme right side. Surprisingly, Split-and-Merge performed the
best on the third video which seems to the authors to be the
hardest to naturally break into semantically distinct parts.
6.4</p>
    </sec>
    <sec id="sec-10">
      <title>Discussion on the protocol</title>
      <p>The small number of answers collected to create the optimal
partition may be a threat of validity. However, as shown in Figure
2, a consensus seems to emerge, especially for the second video
which can mitigate this critic. Another drawback in our protocol
is that the results are more qualitative than quantitative: we do
not measure any distance from the Split-and-Merge result to the
optimal partition. We justify this approach by the fact that such
metric may be useful to compare algorithms to each others, but
too ad-hoc to be interesting by itself. In a preliminary evaluation
and taking into account the "fuzziness" of the optimal partition, a
qualitative description of the results appeared to be enough to us.
A last critic that may appear is the lack of clear definition of what
is or should be a cutting point in the instruction for the panel. This
has been done on purpose since a more precise indication would
probably contain explicit references to some feature streams or
metrics on feature streams and indirectly influence the answer.
7</p>
    </sec>
    <sec id="sec-11">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>In this paper we presented the Video Semantic Partitioning
problem defined as finding an optimal partition w.r.t. an unknown and
not so well defined function measuring the semantic. We
reformulated the problem as the optimization of a convex combination
supposedly approximating the unknown function. We justify this
construction by the extraction of feature streams, each of them
holding a part of the semantic contained in the video. To solve
the problem, we introduced a heuristic called Split-and-Merge. Its
main idea is to cut the video in too many pieces before greedily
merging the blocks until the improvement is neglectable. Under
a restrictive assumption, we reduced the continuous problem
to a (almost) discrete problem and adapted the algorithm
consequently. Finally, we presented the results obtained on a few test
videos and discuss the protocol.</p>
      <p>Despite the variability in the quality of results, the results are
encouraging taking into account no hyperparameter tuning was
performed. Surprisingly, the algorithm obtains the best results
on what we considered to be the hardest video and the worst
result on the easiest video.</p>
      <p>In future work, we will investigate how to tune the
hyperparameters to obtain the best out of the algorithm on a set of videos.
We are also interested in comparing the convex combination
approach presented in this paper with a multi-objective approach.
Also, we need to investigate how good the modified algorithm
under the restrictive assumption perform compared to the initial
version. Last but not least, we plan to incorporate more advanced
feature streamed such as objects obtained by video segmentation.</p>
    </sec>
    <sec id="sec-12">
      <title>ACKNOWLEDGMENT</title>
      <p>The authors warmly thank IBM BTO and IBM Krakow Software
Lab for their financial support.</p>
    </sec>
    <sec id="sec-13">
      <title>AUTHOR CONTRIBUTIONS</title>
      <p>Marcin Janiszewski pointed out the problem and guided its
resolution using IBM Watson Services.</p>
      <p>Krzysztof Jamrog implemented the algorithm, taught IBM
Watson Speech-to-Text to increase the transcript quality.
Alexandre Quemy conceived the algorithm, wrote the article,
selected the data, conceived and run the experiments, analyze
the data.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Anne</given-names>
            <surname>Auger</surname>
          </string-name>
          , Johannes Bader, Dimo Brockhof, and
          <string-name>
            <given-names>Eckart</given-names>
            <surname>Zitzler</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Hypervolume-based multiobjective optimization: Theoretical foundations and practical implications</article-title>
          .
          <source>Theoretical Computer Science</source>
          <volume>425</volume>
          ,
          <string-name>
            <surname>Supplement</surname>
            <given-names>C</given-names>
          </string-name>
          (
          <year>2012</year>
          ),
          <fpage>75</fpage>
          -
          <lpage>103</lpage>
          . https://doi.org/10.1016/j.tcs.
          <year>2011</year>
          .
          <volume>03</volume>
          .012 Theoretical Foundations of Evolutionary Computation.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Nyein</given-names>
            <surname>Myint Myint Aung</surname>
          </string-name>
          and Su Su Maung.
          <year>2013</year>
          .
          <article-title>Semantic Based Text Block Segmentation Using WordNet</article-title>
          .
          <source>International Journal of Computer and Communication Engineering</source>
          <volume>2</volume>
          ,
          <issue>5</issue>
          (
          <year>2013</year>
          ),
          <fpage>601</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>David</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>Andrew Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
            , and
            <given-names>Michael I.</given-names>
          </string-name>
          <string-name>
            <surname>Jordan</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Latent Dirichlet Allocation</article-title>
          .
          <source>J. Mach. Learn. Res. 3 (March</source>
          <year>2003</year>
          ),
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          . http://dl.acm.org/ citation.cfm?id=
          <volume>944919</volume>
          .
          <fpage>944937</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Kalyanmoy</given-names>
            <surname>Deb</surname>
          </string-name>
          , Manikanth Mohan, and
          <string-name>
            <given-names>Shikhar</given-names>
            <surname>Mishra</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Evaluating the &amp;Epsi;-Domination Based Multi-Objective Evolutionary Algorithm for a Quick Computation of Pareto-Optimal Solutions</article-title>
          .
          <source>Evol. Comput</source>
          .
          <volume>13</volume>
          ,
          <issue>4</issue>
          (Dec.
          <year>2005</year>
          ),
          <fpage>501</fpage>
          -
          <lpage>525</lpage>
          . https://doi.org/10.1162/106365605774666895
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Jisheng</given-names>
            <surname>Liang</surname>
          </string-name>
          , Ihsin T. Phillips, and
          <string-name>
            <surname>Robert</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Haralick</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Consistent Partition and Labelling of Text Blocks</article-title>
          .
          <source>Pattern Anal. Appl</source>
          .
          <volume>3</volume>
          (
          <issue>2000</issue>
          ),
          <fpage>196</fpage>
          -
          <lpage>208</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Edward</given-names>
            <surname>Loper</surname>
          </string-name>
          and
          <string-name>
            <given-names>Steven</given-names>
            <surname>Bird</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>NLTK: The Natural Language Toolkit</article-title>
          .
          <source>In Proceedings of the ACL-02 Workshop on Efective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume</source>
          <volume>1</volume>
          (
          <issue>ETMTNLP</issue>
          '02).
          <article-title>Association for Computational Linguistics</article-title>
          , Stroudsburg, PA, USA,
          <fpage>63</fpage>
          -
          <lpage>70</lpage>
          . https://doi.org/10.3115/1118108.1118117
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Achille</given-names>
            <surname>Messac</surname>
          </string-name>
          , Cyriaque Puemi-Sukam, and
          <string-name>
            <given-names>Emanuel</given-names>
            <surname>Melachrinoudis</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Aggregate Objective Functions and Pareto Frontiers: Required Relationships and Practical Implications</article-title>
          .
          <source>Optimization and Engineering</source>
          <volume>1</volume>
          ,
          <issue>2</issue>
          (
          <issue>01</issue>
          <year>Jul 2000</year>
          ),
          <fpage>171</fpage>
          -
          <lpage>188</lpage>
          . https://doi.org/10.1023/A:1010035730904
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Radim</given-names>
            <surname>Řehůřek</surname>
          </string-name>
          and
          <string-name>
            <given-names>Petr</given-names>
            <surname>Sojka</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Software Framework for Topic Modelling with Large Corpora</article-title>
          .
          <source>In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA</source>
          , Valletta, Malta,
          <fpage>45</fpage>
          -
          <lpage>50</lpage>
          . http: //is.muni.cz/publication/884893/en.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>