=Paper=
{{Paper
|id=Vol-2083/paper-07
|storemode=property
|title=Unsupervised Video Semantic Partitioning Using IBM Watson and Topic Modelling
|pdfUrl=https://ceur-ws.org/Vol-2083/paper-07.pdf
|volume=Vol-2083
|authors=Alexandre Quemy,Krzysztof Jamrog,Marcin Janiszewski
|dblpUrl=https://dblp.org/rec/conf/edbt/QuemyJJ18
}}
==Unsupervised Video Semantic Partitioning Using IBM Watson and Topic Modelling==
Unsupervised Video Semantic Partitioning Using IBM Watson
And Topic Modelling
Alexandre Quemy Krzysztof Jamrog Marcin Janiszewski
IBM Software Lab IBM Software Lab IBM Software Lab
Cracow, Poland Cracow, Poland Cracow, Poland
Faculty of Computing, Poznań krzysztof.jamrog1@pl.ibm.com marcin.janiszewski@pl.ibm.com
University of Technology
Poznań, Poland
aquemy@pl.ibm.com
ABSTRACT The information extraction plays a preponderant role in the al-
In this paper, we present the problem of Video Semantic Parti- gorithms performances because even the best possible algorithm
tioning that consists in breaking a video into semantically inde- would return poor results on badly extracted information which
pendent blocks. Defined within a framework of optimization, we is equivalent to noise.
present a preliminary heuristic approach to solve the problem, The plan of this paper is as follows: Section 2 formalizes the
called Split-and-Merge. The algorithm itself is unsupervised, but Video Semantic Partitioning problem, Section 3 presents the fea-
the mechanisms to extract data from videos are supervised (for ture extraction mechanisms while in Section 4 the algorithm
some) since they used IBM Watson Services. Finally, we demon- Split-and-Merge is presented. In Section 5 we discuss an opti-
strate on few videos the capabilities of our prototype and discuss mized version of the algorithm under a restrictive assumption
the limitations and future improvements. From the experiments, about the videos. In Section 6, we present some results obtained
we draw two conclusions: (i) the optimal solution to the problem on real videos and conclude this paper in Section 7.
varies from human to human with a large variability from video
to video, (ii) Split-and-Merge demonstrates encouraging quali- 2 VIDEO SEMANTIC PARTITIONING
tative results to find the average optimal solution defined as the A video is a temporal signal on a time interval [0,T ]. We denote by
average solution given by humans. P ([0,T ], m) the set of partitions of [0,T ] into m elements, i.e. the
elements of P ([0,T ], m) are of the form π = (p0 , ..., pm , pm+1 ),
∀i ∈ {0, ..., m}, pi < pi+1 , with p0 = 0 and pm+1 = T , such that
1 INTRODUCTION Sm
i=0 [pi , pi+1 ] = [0,T ].
The problem of text segmentation, that is to say partitioning a Our goal is to find a partition such that each [pi−1 , pi ] is an
text into semantic blocks, has been widely studied (e.g. in [2, 5]) independent semantic block, i.e. holds a different semantic than
but, as far as we know, never extended to videos. Indeed, video its adjacent blocks. Let π ∗ denotes the optimal partition. The
segmentation usually refers to the process of extracting objects first formulation of the Semantic Partition problem is to find a
from a video and not breaking it into semantic chunks. In this minimizing sequence πk such that πk → π ∗ with a convergence
k
paper, we define the Video Semantic Partitioning problem as notion that implies that the number of cutting points of πk con-
an extension of the text segmentation problem and present our verges to those of π ∗1 and that each cutting point of π ∗ is the
primary attempt to solve it. Among the possible applications, limit of a cutting point of πk .
we can highlight a better understanding of the video content for As an element of any partition π is contained in a closed
tagging and indexing in a database or the creation of educational interval defined by two cutting points of π ∗ we can define the
paths by joining semantic blocks from different videos to create following cost function ∀π ∈
S
P ([0,T ], m), ∀pi ∈ π , c (pi ) =
a new one. m ∈N∗
A video is a complex and multidimensional signal holding a |p ∗j − pi ||pi − p ∗j+1 | with pi ∈ [p ∗j , p ∗j+1 ]. Additionally, we define
lot of information in a very short amount of time. In addition, m−1
P
J (π , π ∗ ) = c (pi ) + c (pi+1 ).
the raw information is totally unstructured and has to be ex- i=1
tracted, contrary to text segmentation where this information Under the assumption that π has as many cutting points as π ∗ ,
is mostly structured (the text is already provided as it is). By then if J (p, p ∗ ) = 0 then p = p ∗ and the problem can be seen as
raw, we mean the primary source of semantic. For a given text, it finding a minimizing sequence πk s.t. J (πk , π ∗ ) → 0. In practice,
k
mainly holds in the sentences and words themselves, while the as π ∗ is unknown, there are two problems: the optimal number
support brings some secondary elements (e.g. in [2] the support of cutting points is not known (i) and J is not available (ii).
is a HTML page and the tags are used to delimit some sections). For (i), assuming J is available, if π 0 contains enough cutting
For a video, the primary source is not as well defined: is it the points and some in each interval induced by π ∗ , we can minimize
audio track or the visual information? Most likely a combination J given π0 by removing the unnecessary cuttings points. Assum-
of both. Additionally, this raw information is not directly suit- ing we know how to remove the unnecessary cutting points, if
able for most algorithms and requires a pre-treatment to extract π 0,h is defined by the uniform partition s.t. ∀pi , pi+1 ∈ π0 , pi+1 −
exploitable data. Those techniques are known as Cognitive Com- pi = h, then ∀h 1 , h 2 s.t. h 1 < h 2 , J (π0,h1 , π ∗ ) < J (π 0,h2 , π ∗ ).
puting, i.e. processing and analyzing large unstructured signals. More generally, we can control J by h.
© 2018 Copyright held by the owner/author(s). Published in the Workshop 1 We may not accept the case such that two cutting points are identical, creating an
Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna, empty interval, thus with an empty semantic. However, it may be convenient for
Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted practical algorithms and thus we assume we merge the similar cutting points of πk
under the terms of the Creative Commons license CC-by-nc-nd 4.0. at any time if necessary.
44
For (ii), we need to find an approximation of J . As p ∗ is a 3 FEATURE STREAMS
partition into semantic blocks we will try to rely on the seman- In this section, we review the feature streams we extract from a
tic contained in the video to approximate J . To do so, we are video as well as the associated block similarity.
able to identify n features stream from the video, that is to say, We are able to extract a frame at any moment of the video. For
n sequences of features on [0,T ]. For each i ∈ {1, ..., n}, we two frames, we are able to calculate two metrics: a perceptual
have a similarity metric d i : [0,T ] × [0,T ] → [0, 1] to quantify hash d hash using phash open-source library2 and a pixel-based
how related two blocks are w.r.t. a particular feature stream. In hamming distance d pixel . For two blocks [p j−1 , p j ] and [p j , p j+1 ],
particular, ∀p1 , p2 , di ([p1 , p2 ], [p1 , p2 ]) = 1 because the seman- we made the choice to take into account resp. only the last and
tic of a block is obviously equal to its own semantic. However, the first frame contain in the first and second block. With Watson
[p1 , p2 ] ∩ [p3 , p4 ] = ∅ does not imply di ([p1 , p2 ], [p3 , p4 ]) = 0, i.e. Visual Recognition and for each frame, we are able to obtain a
it is not because two blocks are not overlapping in time that they list of tags with confidence on concepts and entities. For a block
do not have semantic similarity. To ease the notations, given a [p j , p j+1 ], we aggregate the tags of all frames it contains into a
j
partition π , we denote by di the similarity between the blocks the sets Wj and C j respectively for the words and the confidence
j de f
[p j−1 , p j ] and [p j , p j+1 ], i.e. di = di ([p j−1 , p j ], [p j , p j+1 ]). level. For two blocks, we denote by Wj, j+1 the set defined by
The problem of the video partitioning is to find a partition π ∈ Wj ∩ Wj+1 with C j,L j+1 and C j,R j+1 the confidence associated to
S the left and right block.
k ∈N∗ P ([0,T ], m) that minimizes the adjacent block similarity,
that is to say w1
Example: Let us assume Wj = *.w 2 +/ composed of three words,
M X
m ,w 3 -
j
π∗ = argmin (1) c 1, j
associated to the confidence vector C j = *.c 2, j +/. Similarly, let us
S
di
π∈ P ([0,T ],m) 1≤i ≤n j=1
m∈N∗
,c 3, j -
L w1 c 1, j+1 !
with an aggregation operator. L * + * +
assume Wj+1 = .w 3 / with C j+1 = .c 3, j+1 /. Then, Wj, j+1 =
w1
For the purpose of this paper, we will consider as a convex w3
,w 4 - ,c 4, j+1 -
combination of the block similarity, allowing us to reformulate
and the left and right confidence vectors are given by C j,L j+1 =
the problem as follows: ! !
c 1, j c
and C j,R j+1 = 1, j+1 .
c 3, j c 3, j+1
P P
j
argmin
π∗ m
j=1 ωi di
=
π∈
S
P ([0,T ],m) 1≤i ≤n
j
If |Wj−1 ∩ Wj | = 0, the similarity is zero: d tags = 0. Otherwise,
(2)
P m∈N∗
the similarity measure is defined as the Jaccard index ponderated
ωi =1
1≤i ≤n by the confidence:
L R
L j |Wj−1 ∩ Wj | ||C j−1, j − C j−1, j ||1
The choice of the operator is an interesting problem by d tags = 1−
|Wj−1 ∪ Wj | |Wj−1 ∩ Wj |
itself because it is known to greatly influence the algorithms
L R
capacity to find the optimal solution [7]. In the future, we plan to |Wj−1 ∩ Wj | − ||C j−1, j − C j−1, j ||1
investigate the choice of this operator, and in particular treating =
|Wj−1 ∪ Wj |
the problem as multi-objective (optimizing on each feature stream
where |A| denotes the cardinal of A. In other words, two blocks
independently) using the Hypervolume [1] or ε-dominance [4]
are similar not only if they have the same tags but also if their
metric as aggregation function.
confidences are close enough.
As said before, one major difficulty comes from the fact the
optimal number of elements in the partition is not known a priori.
Example: Considering the vectors defined in the previous ex-
Notice that in practice, if an interval is too small the information it
ample, the similarity between the two blocks is
holds is considered as null. For instance, a very small interval does
not hold more than few words or no word at all, no screenshot, j+1 2 − [(c 1, j − c 1, j+1 ) + (c 3, j − c 3, j+1 )]
d tags =
etc. As a result, its similarity with another very small interval 4
would be one or close to one. In other words, dividing too many
times [0,T ] does not minimize the objective function. However, In this preliminary work, we did not consider OCR to extract
having two large blocks separated by a very small block would the text from the frames or image segmentation techniques to
be a minimizing strategy: the small block holds no semantic, obtain the location of objects on a frame. However, using Watson
contrary to the large blocks and as a result, the similarity between Speech-to-Text, we extract the transcript from the audio track.
adjacent blocks is (close to) zero. To handle this problem, we see The service provides the timestamp to be able to locate the text
several possibilities: taking into account this phenomenon into corresponding to a given block. Similarly to Watson Visual Recog-
the definition of the metrics di (i), adding a regularization term nition, Watson Natural Language Understanding is able to extract
on the block sizes to (2) to be sure they are homogenous enough from a text a list of keywords and concepts with a confidence
(ii), fixing the set of possible values for m and a distribution of level (expressing the relevance among the text). This allows us to
j j j
block size w.r.t. T (iii). As (i) is too ad-hoc and we do not see define d keywords and d concepts in a similar way as for d tags .
any practical justification to support (ii), we will make a stronger We normalized the transcript using NLTK [6] which includes
assumption on the optimal partition in Section 5 that will define the tokenization, removing the stopwords, the lemmatization,
both the maximal number of blocks and their respective sizes. 2 http://phash.org/
45
the computation of n-grams from 1 to 4 and filtering the n- p1 p2 p3 p4 p5 p6 p7 p8 ... pm
grams by a minimal number of occurrences. A given partition
π ∈ P ([0,T ], m) can be seen as a corpus of m documents repre- 0 p1∗ p2∗ p3∗ 1
sented by Bag-of-Words. Using Gensim [8], we applied a TF-IDF p1 p2 p3 p4 ... pm
transformation on each block and used a Latent Dirichlet Allo-
cation [3] to express each block in a latent topic space. LDA is 0 p1∗ p2∗ p3∗ 1
a generative probabilistic model that represents documents as a
mixture of topics. Each topic represents a (sparse) distribution
over the dictionary of words used in the documents expressing Figure 1: Illustration of Split-and-Merge. On top, the ini-
the idea that a topic is defined by a high frequency of few terms. tial uniform partition, finer than the optimal partition π ∗ .
The model parameters are then estimated from the real data, At the bottom, the partition after few steps: several cut-
in particular the word distribution per topic and the topic dis- ting points have been removed. It is to be noticed that, as
tribution for each document. For each pair of adjacent blocks, p1 ∗ or p2 ∗ are not on the initial partition, it is impossible
we use the LDA model to determine the distance between the to have them in the output of Split-and-Merge.
j
blocks denoted by d topic . LDA has three hyperparameters: K
the number of topics, α the document-topic prior distribution
and η the topic-word priori distribution. The last two parameters and an improvement threshold under which we consider it is not
control the sparsity of the distribution. In this work, we used interesting enough to merge.
the online hyperparameter strategy provided by Gensim, i.e. the
values for α and η are deduced from the real data. As LDA can
be seen as a dimensionality reduction method, the number of Algorithm 1 Split-and-Merge
dimensions K influences the quality of the model, thus should be T m
1: π 0 ← { m i}i=0
fixed not too low to keep enough information but not too high 2: do
for the documents to be summarized. For this paper, knowing 3:
j
Calculate the family {di }i, j for πk
the approximate size of the Bag-of-Words representation of our Pn
F j ← n1 ωi di
j
4:
data, we fixed K = 10. Further work should focus on hyperpa- i=1
rameter tuning taking into account the length of the video and 5: for F j ∈ {F j | F j1 ≥ F j2 , ∀j 1 > j 2 } do
the number of cutting points at a given moment in a partition 6: if F j ≥ δ and R(πk , j) ≤ η then
(because it influences the transcript length of a block). 7: πk +1 = πk \ {p j }
We would like to draw the attention on the fact that if the 8: break
algorithm we used is not supervised, the feature extraction is 9: end if
supervised for some feature streams. The transcript obtained with 10: end for
Watson Speech-to-Text is rectified (including abbreviation and 11: while πk , πk +1
specific term annotations) and inserted into a custom model. As a
result, from video to video, the feature extraction becomes more
and more relevant, which we believe is the primary requirement
to achieve a good semantic partitioning. Similarly, Watson Visual 5 RESTRICTING THE SEARCH SPACE
Recognition service is trained to recognize certain objects in
relation with the theme of the test videos (e.g. trained to recognize If the audio of the video is intended to describe the video content
specific softwares displayed in the video). (e.g. a documentary or a commented video for educational pur-
poses), we make a stronger assumption: a cutting point between
4 SPLIT-AND-MERGE ALGORITHM two semantic blocks can only occur between two sentences. Not
only this assumption turns the continuous problem into a (al-
The algorithm is composed of two phases. In the first one, called most) discrete problem, but it also provides an upper bound on
Split, we break the video into regular blocks that are small enough the number of cutting points. If we denote by S = {s 1 , ..., s K } the
to be sure they are smaller than the smallest semantic block of set of sentences in the perfect transcript, the number of cutting
the video and thus, in particular, there are several cutting points points is bounded by K. It is only almost discrete, because in
within each of the intervals defined by the optimal partition many cases the time interval between two sentences may be
p ∗ . The Merge phase consists in merging two adjacent blocks quite long and there is still a need to determine where to cut
if they are related enough according to the metrics we defined exactly.
in the previous Section. Iteratively, we merge the blocks with Let [ε + (s j ), ε − (s j+1 )] ⊂ [0,T ] denotes the time interval be-
the higher normalized combined distance defined by F (π ) = tween two consecutive sentences s j and s j+1 . In other words,
P P |π | n
1 j 1 P |π | j P j
∀p ∈ π ∗ , ∃j ∈ {1, ..., K − 1} s.t. p j ∈ [ε + (s j ), ε − (s j+1 )]. To
j=1 ωi di = n |π | j=1 F (π ) with F (π ) = ωi di .
j
n |π |
1≤i ≤n i=1 take advantage of this assumption, we change the initial uni-
However, without a stopping criterion all the blocks will be
form partition π 0 defined in Algorithm 1 By π 0 = {pi | ∀i ∈
merged. To prevent this, we take into consideration the im- |ε − (s i +1 )−ε + (s i ) |
provement implied by removing p j between step k and k + 1: {1, ..., K − 1}, pi = ε + (si ) + 2 }. Contrary to the pre-
F (π vious version, there is no parameter h to control the (best case)
R(πk , j) = F (πk +1) with πk +1 = πk \ {p j }. Split-and-Merge is
)
k precision. However, as we know there is a unique cutting point
detailed in Algorithm 1. It has four hyperparameters: m > 0, in [ε + (s j ), ε − (s j+1 )], we can split it into an arbitrary number of
P
ωi ∈ [0, 1] s.t. ωi = 1, δ ∈ [0, 1] and η ∈ [0, 1[, resp. the initial potential cutting points and select the one that minimizes the
i
number of cutting points, the weights for each feature, a similar- aggregated similarity. This step is done after applying Split-and-
ity threshold under which we do not want to merge two blocks Merge as a refinement step.
46
6 EXPERIMENTS AND RESULTS In a second step, we used Split-and-Merge (without the assump-
The plan of this section is as follows: we present the data we used tion made in Section 5) on the three videos with the following
in Section 6.1, then we elaborate the protocol in Section 6.2. We settings: a uniform initial partition with a length of three seconds
analyse the results in Section 6.3. Last but not least, in Section per block, ωi = n1 , ∀i (i.e. each feature has the same importance),
6.4, we address some problems or possible critics of our protocol. δ = 0.9 and η = 0.1. Finally, we reported the partition found
under the form of three second intervals centered on the cutting
6.1 Videos points. The choice to assimilate a cutting point to a three sec-
ond interval result both from the protocol (people may agree on
We used three short videos (2:16 to 4:00 minutes) available on
Youtube: the same cutting point, but report different time to one or two
seconds difference) and from the algorithm (for computational
(1) Running an experiment in the IBM Quantum Experience3
reasons, extracting and processing the screenshots is relatively
(2) IBM Bluemix - 3 minutes demo (NodeJS, SSO)4
(3) IBM Watson for Oncology Demo5
costly). In the analysis, we consider that two cutting points coin-
cide if there is an intersection between their intervals. In other
Apart from being short, the videos have some specific characteris- words, two cutting points are assimilated if they are separated
tics. The first video presents how to create, run and access to the by less than three seconds, which correspond to the selected
result of a quantum algorithm. The video is relatively naturally precision for the algorithm.
broken down into parts except the moment when it explains the
algorithm where the video switches between displaying a screen
with the software to perform the experiment and an illustration 6.3 Results
using cards. Also, it seems to us that there is a long block in the The results are summarized by Figure 2. Let us analyze first the
middle of the video which would require Split-and-Merge to re- panel answer. For the first video, the cutting points are relatively
ally understand how to remove inappropriate cutting points. The numerous with 11 in total. A lot of them are mostly in blue indi-
second one is very naturally broken down into several parts by a cating a doubt from the panel. Around 1:00, a ten second interval
short notice with the subject of the next part written during the is formed, indicating a variability in the choice of cutting points.
transitions. Between transitions, the speaker is visually present This was expected as mentioned previously in the video descrip-
on the video which can perturbate the visual metrics since he tion. The large block from around 1:00 to 2:25 is also confirmed
tends to move, thus modifying perceptual hash and pixel-based by the answers. For the second video, and as we expected, there
distance more than expected. Finally, the third video has been is a neat consensus: every person from the panel answered the
selected because it is harder to break down since the speaker flow same that is to say, putting a cutting point when the transition is
and presentation seem more "continuous" without neat transition visually indicated. There is one exception with a weak transition,
between screens or topics. Also, it integrated a visual transition in one answer, located on the transition between the black screen
(from black screen to a web page and the contrary) at the begin- at the beginning and the real content of the video. Regarding
ning and at the end, which can trick both humans and the visual the third video, the participants seem to agree on whether or
metrics. not a transition is weak or strong. For the second distinct area
in the timeline, around 24s, the union of intervals made of the
6.2 Protocol cutting points has a nine second length. Checking the answer
We asked people through a public form to indicate their optimal individually confirms that there is no one adding more than one
partition with the following instruction: For each video, describe cutting point in this area. The results globally reflect our own
how you would break it into "steps" (a bit like a chapter in a book). personal answers on the cutting points and more important, the
We added that in case of doubt, one can indicate that a transition panel answers reflect the difficulties we described in the previous
is weak. We respectively collected seven, five and five answers section. Globally, the huge amount of weak transitions in the first
for each video6 . For each answer we obtained, we plot the cutting and last video confirms that the semantic partitioning problem is
points ti as a three second interval centered on ti (in blue for the not well defined as the perception of what exactly is a semantic
weak cutting points, red otherwise) such that we can visually block varies between persons and videos.
observe the areas on the timeline such that people agree there is In the first video, only two out of eleven cutting points have
a cutting point. been found. There are two pairs of red and blue cutting points
separated by only four seconds (in resp. 0:08 and 0.30) and for
Example of answer for the first video: both, Split-and-Merge identified a cutting point in the middle. We
explain this by the fact that the algorithm did not find relevant
0:12
the blue cutting points (the selected cutting points are closer
0:20 W
to the red ones). One good point to notice is the absence of
0:31 W
cutting point in the large middle block. In general, the results
0:38
on this video are disappointing because the video seemed easy
0:59
to break for humans, and none of the most consensual cutting
2:35 W
points (red at 1:10 and 2:28) has been identified. In the second
2:56 W
video, the size of the final partition matches the size of the panel
3:49 W
optimal partition with seven cutting points. However, except for
3 https://www.youtube.com/watch?v=pYD6bvKLI_c two points, there are all misplaced: the one around 0:25 is about
4 https://www.youtube.com/watch?v=xE_I8BgDroA five second late and the ones 2:20 and 2:40 are three seconds
5 https://www.youtube.com/watch?v=338CIHlVi7A
in advance. The two others are clearly misplaced and almost in
6 Due to the required time to complete the form, people were able to answer only
the worst possible location, i.e. in the middle of an interval (at
for some videos. On top of that, a fourth video was initially planned, but as we
received less than five answers, we decided not to report the results. least not the same). It is quite surprising that the video with the
47
Figure 2: Timelines for the three videos. From the top to the bottom, video 1, 2 and 3. The distribution of weak cutting
points is indicated in blue, the normal ones in red. A cutting point is materialized by a three second interval and the
opacity represents how much an area has been selected as cutting point by the panel.
most clear consensus on the optimal partition does not provide construction by the extraction of feature streams, each of them
better results. The fact that the speaker appears on the video may holding a part of the semantic contained in the video. To solve
be an explanation as between two frames the number of pixels the problem, we introduced a heuristic called Split-and-Merge. Its
that changed is most likely to be higher without changing the main idea is to cut the video in too many pieces before greedily
semantics. This would imply that the visual metrics play a more merging the blocks until the improvement is neglectable. Under
important role than we first expected, at least for videos with this a restrictive assumption, we reduced the continuous problem
characteristic. Further experiments with different weights ωi and to a (almost) discrete problem and adapted the algorithm conse-
higher thresholds for the visual metric need to be carried. In the quently. Finally, we presented the results obtained on a few test
last video, Split-and-Merge successfully identified seven out of videos and discuss the protocol.
nine cutting points among which, two are identified with a zero Despite the variability in the quality of results, the results are
second difference. However, the most important cutting point encouraging taking into account no hyperparameter tuning was
(not identified as weak, and being in the five answers with at performed. Surprisingly, the algorithm obtains the best results
most one second difference) is not identified. In the large interval on what we considered to be the hardest video and the worst
around 0:24, Split-and-Merge identified the cutting point at the result on the easiest video.
extreme right side. Surprisingly, Split-and-Merge performed the In future work, we will investigate how to tune the hyperpa-
best on the third video which seems to the authors to be the rameters to obtain the best out of the algorithm on a set of videos.
hardest to naturally break into semantically distinct parts. We are also interested in comparing the convex combination ap-
proach presented in this paper with a multi-objective approach.
6.4 Discussion on the protocol Also, we need to investigate how good the modified algorithm
The small number of answers collected to create the optimal under the restrictive assumption perform compared to the initial
partition may be a threat of validity. However, as shown in Figure version. Last but not least, we plan to incorporate more advanced
2, a consensus seems to emerge, especially for the second video feature streamed such as objects obtained by video segmentation.
which can mitigate this critic. Another drawback in our protocol
is that the results are more qualitative than quantitative: we do
ACKNOWLEDGMENT
not measure any distance from the Split-and-Merge result to the The authors warmly thank IBM BTO and IBM Krakow Software
optimal partition. We justify this approach by the fact that such Lab for their financial support.
metric may be useful to compare algorithms to each others, but
too ad-hoc to be interesting by itself. In a preliminary evaluation AUTHOR CONTRIBUTIONS
and taking into account the "fuzziness" of the optimal partition, a Marcin Janiszewski pointed out the problem and guided its reso-
qualitative description of the results appeared to be enough to us. lution using IBM Watson Services.
A last critic that may appear is the lack of clear definition of what Krzysztof Jamrog implemented the algorithm, taught IBM Wat-
is or should be a cutting point in the instruction for the panel. This son Speech-to-Text to increase the transcript quality.
has been done on purpose since a more precise indication would Alexandre Quemy conceived the algorithm, wrote the article,
probably contain explicit references to some feature streams or selected the data, conceived and run the experiments, analyze
metrics on feature streams and indirectly influence the answer. the data.
7 CONCLUSION AND FUTURE WORK
In this paper we presented the Video Semantic Partitioning prob-
lem defined as finding an optimal partition w.r.t. an unknown and
not so well defined function measuring the semantic. We reformu-
lated the problem as the optimization of a convex combination
supposedly approximating the unknown function. We justify this
48
REFERENCES
[1] Anne Auger, Johannes Bader, Dimo Brockhoff, and Eckart Zitzler. 2012.
Hypervolume-based multiobjective optimization: Theoretical foundations and
practical implications. Theoretical Computer Science 425, Supplement C (2012),
75 – 103. https://doi.org/10.1016/j.tcs.2011.03.012 Theoretical Foundations of
Evolutionary Computation.
[2] Nyein Myint Myint Aung and Su Su Maung. 2013. Semantic Based Text Block
Segmentation Using WordNet. International Journal of Computer and Commu-
nication Engineering 2, 5 (2013), 601.
[3] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet
Allocation. J. Mach. Learn. Res. 3 (March 2003), 993–1022. http://dl.acm.org/
citation.cfm?id=944919.944937
[4] Kalyanmoy Deb, Manikanth Mohan, and Shikhar Mishra. 2005. Evaluating the
&Epsi;-Domination Based Multi-Objective Evolutionary Algorithm for a Quick
Computation of Pareto-Optimal Solutions. Evol. Comput. 13, 4 (Dec. 2005),
501–525. https://doi.org/10.1162/106365605774666895
[5] Jisheng Liang, Ihsin T. Phillips, and Robert M. Haralick. 2000. Consistent
Partition and Labelling of Text Blocks. Pattern Anal. Appl. 3 (2000), 196–208.
[6] Edward Loper and Steven Bird. 2002. NLTK: The Natural Language Toolkit.
In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for
Teaching Natural Language Processing and Computational Linguistics - Volume
1 (ETMTNLP ’02). Association for Computational Linguistics, Stroudsburg, PA,
USA, 63–70. https://doi.org/10.3115/1118108.1118117
[7] Achille Messac, Cyriaque Puemi-Sukam, and Emanuel Melachrinoudis. 2000.
Aggregate Objective Functions and Pareto Frontiers: Required Relationships
and Practical Implications. Optimization and Engineering 1, 2 (01 Jul 2000),
171–188. https://doi.org/10.1023/A:1010035730904
[8] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Mod-
elling with Large Corpora. In Proceedings of the LREC 2010 Workshop on
New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http:
//is.muni.cz/publication/884893/en.
49