Identification of Narrative Peaks in Clips:
                                  Text Features Perform Best.

                               Joep J.M. Kierkels1, Mohammad Soleymani1, Thierry Pun1
                                    1
                                     University of Geneva, Computer Science Department
                                          Battelle Building A, 7 Route de Drize
                                          CH – 1227 Carouge, Geneva, Switzerland
                               {Joep.Kierkels, Mohammad.Soleymani, Thierry.Pun}@Unige.ch


       Abstract. A methodology is proposed to identify narrative peaks in video clips. Three basic clip properties
       are evaluated which reflect on video, audio and text related features in the clip. Furthermore, the expected
       distribution of narrative peaks throughout the clip is determined and exploited for future predictions. Results
       show that only the text related feature, related to the usage of distinct words throughout the clip, and the
       expected peak-distribution are of use when finding the peaks. On the training set, our best detector had an
       accuracy of 47% in finding narrative peaks. On the test set, this accuracy dropped to 24%.

       Keywords: Feature detection, Video analysis, Attention.


1 Introduction

A challenging issue in content-based video analysis techniques is the detection of sections that evoke increased
levels of interest or attention in viewers of video-clips or documentaries. Once such sections are detected, other
less relevant sections may be removed from a recording. By doing so, a summary of a clip can be created which
allows for faster browsing through relevant sections. This will save valuable time of any viewer who merely
wants to see an overview of the clip. Past studies on highlight detection often focus on analyzing sports-videos
[1], in which highlights usually show specific patterns related to ball or player movement. Although clips usually
contain audio-, video, and spoken text-content, many existing approaches focus on merely one of these [2;3]. In
the current paper, we will attempt to show results for all three modalities and try to identify which modality is
actually the most valuable in the detection of segments which supposedly include narrative peaks.
For our participation in the VideoCLEF 2009 subtask on “Affect and Appeal” [4], we propose in this paper a
methodology to identify narrative peaks in video clips. The clips that were used in this subtask were all taken
from a Dutch program called “Beeldenstorm”. They were in Dutch language, had duration between seven and
nine minutes, consisted of video and audio, and had speech transcripts available. Detection accuracy was
determined by comparison against manual annotations on narrative peaks provided by three (Dutch speaking)
annotators.
   While viewing the clips, we failed to see any clear indicators as to which specific audiovisual features could
be used to identify narrative peaks, even when looking at the annotations that were provided. Furthermore we
noticed that there was little consistency among the annotators because more than three narrative peaks were
indicated for all clips. This led to our belief that tailoring any detection method to a single person’s view on
narrative peaks would not be fruitful and hence we decided to work only with basic features. We expect these
features to be indicators of narrative peaks that are common to most observers, including the annotators.
   Our approach for detecting peaks consists of a top-down search for relevant features, e.g., first we computed
possibly relevant features and secondly we establish which of these features really enhance detection accuracy.
We separately treated three different modalities.
        • Video, taken from the available .mpg files, was used to determine at what place in the clip frames
             showed the largest change compared to a preceding frame.
        • Audio, taken from an mp3 conversion of the .mpg frames, was used to determine at what place in the
             clip the speaker has an elevated pitch or has an increased speech volume.
        • Text, taken from the available .mp7 files, was used to determine at what place in the clip the speaker
             introduced a new topic.
Next to this, we considered the expected distribution of narrative peaks over clips. Details on how all these steps
were implemented are given in Section 2, followed by results of our approach on the given training data in
Section 3. In Section 4 several conclusions are drawn from these results.
In the VideoCLEF subtask, the focus of detecting segments of increased interest is on the data part, e.g., we
analyze parts of the shown video-clip to predict their impact on a viewer. Even though it is outside the scope of
the subtask, it is worth to mention that there exists a second approach to identifying segments of increased
interest. This second approach focuses not on the data but directly on the reactions of a viewer, e.g., by
monitoring his physiological activity such as heart-rate [5] or by filming his facial expressions [6]. Based on
such reactions, the affective state of a viewer can be estimated and one can estimate levels of excitation,
attention and interest in a viewer [7]. By themselves, physiological activity measures can thus be used to
estimate interest, but they could also be used to validate the outcomes of data-based techniques. Because the
evaluation in the VideoCLEF task will be against the explicit annotations provided by the three annotators we
did not include recordings of physiological activity in our VideoCLEF contribution.


2 Feature extraction

For the different modalities, feature extraction will be described separately in the following subsections. As the
topic of detecting affective peaks is quite unexplored, we decided to implement only basic features. This
provides an initial idea of which features are useful, and future studies could focus on enhancing the relevant
basic features. Feature extraction was implemented using Matlab.


2.1 Video features

Our key assumption for video features was that dramatic tension is related to big changes in video. It is a film
editors choice to include change between different frames [8], and we believe that this may be used to stress the
importance of certain parts in the clip. Video is recorded at 25 frames/second. Because our narrative peak
detector will output a 10 s window of enhanced dramatic tension, this precision level which is too large and
merely slows down computations, furthermore frame changes are often not too obvious on subsequent frames.
Our treatment of video-frames starts at frame 1 and subsequently jumps about 0.5 s to frame 13, frame 25, and so
on.
All frames are converted to grayscale levels. The difference between subsequent frames is computed as:
                                N    M
                        Δ vid = ∑∑ F13 (n, m) − F1 (n, m) ,                       (1.1)
                                n =1 m =1
in which N and M are the width and the height of the frames and F13 indicates the matrix containing pixels values
of the 13th frame. As a next step, the change in Δvid, indicated as dΔvid, over subsequent observations is
determined and compared against a threshold value that reflects how much change in this value is observed when
a scene changes. This threshold was determined on the training-set and set to 5*105. If dΔvid is below this
threshold, it is set to zero. As a final step, the resulting dΔvid is smoothed by averaging over a 10 s window, and
the smoothed resulting signal is scaled to have a maximum absolute value of one and subsequently to have a
mean of zero. Next, it is down-sampled again by a factor 2, resulting in vector video which contains 1 value per
second as is illustrated in Fig. 1A.
                             A                                                   B                                                     C
                           Video                                                Audio                                                 Text
               1                                                  0.5                                                    1


                                                     Arb. score


                                                                                                          Arb. score
Arb. score


             0.5                                                    0                                                  0.5

               0                                                  -0.5                                                   0

             -0.5                                                  -1                                                  -0.5
                 0   100   200     300   400   500                   0    100   200     300   400   500                    0   100   200     300   400   500
                             time(s)                                              time(s)                                              time(s)
                                                                    1


                                                     Arb. score
                                                                  0.5

                                                                    0

                                                                  -0.5
                                                                      0   100   200     300   400   500
                                                                                  time(s)


Figure 1. Illustration of single modality feature values computed over time. A: Video feature, B: Audio features, C: Text
feature. All figures are based on BG_37016.


2.2 Audio features

The key assumption for audio was that a speaker has an elevated pitch or has an increased speech volume when
applying dramatic tension, as suggested in [9;10]. Audio is recorded at 44.1 kHz. The audio signal is divided in
0.5 s segments for which the average pitch of the speaker’s voice is computed by imposing a Kaiser window and
applying a Fast Fourier Transform. In the transformed signal, the frequency with maximum power is determined
and is assumed to be the average pitch of the speaker’s voice over this window. Next the difference in average
pitch between subsequent segments is computed. If a segment’s average pitch is less than 2.5 times as high as the
pitch of the preceding segment, its pitch value is set to zero. This way, only those segments with strong increase
in pitch (supposed indicator of dramatic tension) are kept.
   Speech volume is determined by computing the averaged absolute value of the audio signal within the 0.5 s
segment. As a final step again, the resulting signals for pitch and volume are both smoothed by averaging over a
10 s window, and the smoothed resulting signal is scaled to have a maximum absolute value of one and
subsequently to have a mean of zero. Next, they are down-sampled by a factor 2, resulting in vectors audio1 and
audio2 which both contain 1 value per second as is illustrated in Fig. 1B.


2.3 Text features

The main assumption for text is that dramatic tension starts by the introduction of a new topic, and hence
involves the introduction of new vocabulary related to this topic. Text transcripts are obtained from the available
.mp7 files. A histogram was computed for all unique words that occurred in the clip, counting their number of
occurrences. Words that occurred only once were considered to be non-specific and were ignored. Words that
occurred more than five times were considered too general and were also ignored. The remaining set of words is
considered to be topic specific. Based on this set of words, we estimated where the changes in used vocabulary
are the largest. A vector v filled with zeros was initialized, having a length equal to the number of seconds in the
clip. For each remaining word, its first and last appearance in the .mp7 file was determined and was rounded off
to whole seconds, subsequently all elements in v in between the elements corresponding to these obtained
timestamps are increased by one. Again, the resulting vector v is averaged over a 10 s window, scaled and set to
zero mean. The resulting vector text is illustrated in Fig. 1C.


2.4 Distribution of narrative peaks

A clip is directed by a program director and is intended to hold the attention of the viewer. To this end, it is
expected that points of dramatic tension are distributed over the duration of the whole clip, and that not all
moments during a clip are equally likely to have dramatic tension.
For each dramatic tension-point as indicated by the annotators, its time of occurrence was determined (mean of
start and stop timestamp) and a histogram, illustrated in Fig. 2, was created based on these occurrences. Based on
this histogram, a weighting vector w was created for each recording. Vector w contains one element for each
second of the clip. Each element’s value is determined according to the histogram.
                                                      8

                                                      7

                                                      6

                                                      5


                                         peak count
                                                      4

                                                      3

                                                      2

                                                      1

                                                      0
                                                       0   50 100 150 200 250 300 350 400 450 500
                                                                         time(s)


Figure 2. Histogram that illustrates when dramatic tension-points occur in the clips according to the annotators. Note that
during the first several seconds there is no tension-point at all.


2.5 Fusion and selection

For fusion of the features, our approach merely consisted in giving equal importance to all used features. After
fusion, the weights vector w can be applied and the final indicator of dramatic tension drama is derived as
(shown for all three features):
                                     ( audio1 + audio2 ) + text ⎞ .
                                                                                        T
                           ⎛
               drama = w ⋅ ⎜ video +                            ⎟                                    (1.1)
                           ⎝                  2                 ⎠
The estimated three points of increased dramatic tension are then obtained by selecting the three maxima from
drama. Our estimates for the three dramatic points are constructed by selecting the intervals starting 5s before
these peaks and ending 5s afterwards. If either the second or third highest point in drama is within 10s of the
highest point, the point is ignored in order to avoid having an overlap between the detected segments of
increased dramatic tension. In those cases, the next highest point is used (provided that the new point is not
within 10s)


3 Evaluation schemes and Results

Different combinations of the derived features were made and subsequently evaluated against the training data.
The schemes we tested are listed in table 1. If no weights are used (Scheme 8) vector w contains only ones.

Table 1. Schemes for feature combinations.

               Scheme number     Used features Weights                  Scheme number       Used features      Weights
                     1           Video          Yes                           5             Video, Text         Yes
                     2           Audio          Yes                           6             Audio, Text         Yes
                     3           Text           Yes                           7             Video, Audio, Text  Yes
                     4           Video, Audio   Yes                           8             Text                No


Scoring of evaluation results is performed based on agreement with the reviewers’ annotations. Each time a
peak that we detected coincides with (at least) one reviewer’s annotation, a point is added. A maximum of three
points can thus be scored per clip and since there are five clips in the training set, the maximum score for any
scheme is 15. The obtained scores are shown in table 2.
Table 2. Results in training sets.
                Scheme number BG_36941 BG_37007 BG_37016 BG_37036 BG_37111 Total
                      1          0        0        1        1        1      3
                      2          2        1        1        1        1      6
                      3          2        1        1        2        1      7
                      4          0        1        2        1        1      5
                      5          1        2        2        1        0      6
                      6          2        1        1        2        1      7
                      7          1        1        2        1        0      5
                      8          0        1        1        1        0      3


3 Discussion

   As can be seen in table 2, the best performing schemes are scheme 3 and scheme 6 which both result in 7
accurately predicted narrative peaks and hence an accuracy of 47%. These two schemes both include the text
based feature and the weights vector. Scheme 6 also contains the audio based feature but fails to achieve an
increased accuracy because of this inclusion. Considering that there is also strong disagreement between
annotators, an accuracy of 47% (compared against the joint annotations of three annotators) shows the potential
of using the automated narrative peak detector. The fact that this best performing scheme is only based on a text
based feature corresponds well to our initial observation that there is no clear audiovisual characteristic of a
narrative peak when observing the clips. All non-Dutch speaking observers failed to see indicators of narrative
peaks. The observation that narrative peaks seem to correspond to the introduction of new topics in the clip can
only be made when an observer understands the spoken content. It is expected that this observation is also true
for clips in other languages.
In our contributions to VideoCLEF, we included five runs, mainly corresponding to some of the different
schemes that were previously used in table 1. The results of our runs on the test-data, and their explanations are
given in table 3. For number 5, all narrative peaks were randomly selected (for comparison).
Evaluation of these runs was performed in two ways: Peak-based (similar to our scoring system on the training
data) and Point-based which can be explained as follows; If a peak that we detected coincides with annotations
of more than one reviewer annotation, multiple points are added. Hence the maximum-maximum score for a clip
can be nine when annotators fully agree on segments, the minimum-maximum score remains three when
annotators fully disagree.
The difference between the two scoring system lies in the fact that the Point-based scoring system awards more
than one point to segments which were selected by more then one annotator. If annotators agree on segments
with increased dramatic tension, there will be (in total over three annotators) less annotated segments and hence
the probability that by chance our automated approach selects an annotated segment will decrease. Therefore,
awarding more points to the detection of these less probable segments seems logical. Moreover, a segment on
which all annotators agree must be a really relevant segment of increased tension. On the other hand, this Point-
based approach to scoring gives equal points to having just one correctly detected segment in a clip (annotated
by all three annotators) and to detecting all three segments correctly (each of them by one annotator). When
considering that annotators may have different tastes and that one annotator could fully disagree with the other
two annotators, it is unsatisfying that a system that would fully resemble this annotator’s taste gets rewarded
only the same number of points as a system that predicts merely one correct segment. In this view, a 100%
correspondence with a human annotator should lead to an optimal score for this clip, since the program cannot be
expected to outperform the people that are hired to evaluate it.
Because our runs were selected based on the results that were obtained using the Peak-based scoring system,
results on the test data are mainly compared to this scoring.
Table 3. Results on the test set.
                            run number (scheme nr) Score (Peak-based)a Score (Point-based)
                                 1          3              33                  39
                                 2          7              30                  41
                                 3          6              33                  42
                                 4          8              32                  43
                                 5          --             32                  43

First of all, it should be noted that results are never far better than random level, as can be seen by comparing to
run number 5. Surprisingly, the Peak-based and Point-based scores show a distinctly different ranking of the
runs. Run 1 performed the worst under the Point-based scoring, yet it performed best under the Peak-based
scoring system. Based on the results obtained on the clips in the test set, it was expected that runs 1 and 3 would
perform best. This is clearly reflected in the results we obtain when using the same evaluation method on the
test clips, the Peak-based evaluation. However, with the Point-based scoring system this effect disappears. This
may indicate that the main feature that we used, the text based feature based on the introduction of a new topic,
does not reflect properly the notion of dramatic tension for all annotators, but is biased towards a single
annotator. In Fig. 3, the score is shown when calculated based on only the annotations of a single annotator. It
should be noted that in this setting, the Point-based and Peak-based scoring system are identical.

                                                run1         run2         run3         run4         run5
                                           20           20           20           20           20

                                           15           15           15           15           15

                                           10           10           10           10           10

                                           5            5            5            5            5

                                           0            0            0            0            0
                                                1 2 3        1 2 3        1 2 3        1 2 3        1 2 3


                                                                     80

                                                                     60

                                                                     40

                                                                     20

                                                                     0
                                                                          1 2 3


Figure 3. Histograms that illustrates scoring based on single annotators (x axis). The upper row of histograms indicate
scoring (y axis) for the different runs. The lower histogram shows the joint histogram of all runs except run 5, which involved
random selection of segments.

In this figure, it can be seen that scoring based only on annotator 3 performed significantly worse than scoring
based only on annotator 1 or annotator 2. This indicates that our approach is biased towards the opinions of some
reviewers.
Because our runs were selected based on those schemes that performed best under Peak-based metric, one should
mainly compare to the Peak-based results on the test-clips. Knowing now the exact scoring system that was
eventually employed in the VideoCLEF scoring, it would be recommendable to also compute the results on the
training set using the Point-based metric. Possible, this would lead to a different ranking of the evaluated
schemes that is more similar to the ranked results VideoCLEF reported on the Point-based evaluation. However,
such a reflection on the training data was not performed because of the strict deadlines for submitting working
notes (seven working-days after releasing the results).


a The Peak-based score reported here deviates slightly   from the official Peak-based score reported by the VideoCLEF
  organizers.
  In the VideoCLEF score, nearby peaks of different annotators were merged in order to create a new 10s segment. This
  implies that when an estimated peak is nearby a peak detected ONLY by annotator 2, this estimated peak will score points.
  However, if annotator 1 or 3 also selected a nearby peak, the interval in which points can be scored shifts and the same
  estimated peak may not score a point anymore. Since the second situation actually involves an estimated peak which is not
  far from two annotated peaks, it seems contradictory to the authors to not award points here. For evaluating the training
  data, points were awarded under similar circumstances and hence we did award points to such peaks also for the test data.
4 Conclusions

   The subtask described in the VideoCLEF 2009 Benchmark Evaluation has proven to be a challenging and
difficult one. Failing to see obvious features when viewing the clips and only seeing a mild connection between
new topics and dramatic tension peaks, we resorted to the detection of the start of new topics in the text
annotations of the provided video clips and the use of some basic video- and audio-based features. In our initial
evaluation based on the training clips, the text based feature proved to be the most relevant one and hence our
submitted evaluation-runs were centred around this feature. When using a consistent evaluation of training and
test clips, the text based feature also led to our best results on the test data. The overall detection accuracy based
on the text-based feature dropped from 47% correct detection on the training data to 24% on the test data. It
should be stated that results on the test data were just mildly above random level.
   The reported results based on the Point-based scoring differed strongly from the results obtained using the
scoring system that was employed on the training data. It was shown that this is probably caused by a bias of our
method towards the annotations given by annotators 1 and 2.
   Given the challenging task that was given, it is our strong belief that the indication that text based features
(related to the introduction of new topics) perform well, is a valuable contribution in the search for an improved
dramatic tension detector.


Acknowledgement

   The research leading to these results has received funding from the European Community's Seventh
Framework Programme [FP7/2007-2011] under grant agreement n° 216444 (see Article II.30. of the Grant
Agreement), NoE PetaMedia.


References

  1. Liu, C. X., Huang, Q. M., Jiang, S. Q., Xing, L. Y., Ye, Q. X., Gao, W.: A framework for flexible
     summarization of racquet sports video using multiple modalities. Computer Vision and Image
     Understanding. 113(3), 415--424 (2009)
  2. Gao, Y., Wang, W. B., Yong, J. H., Gu, H. J.: Dynamic video summarization using two-level redundancy
     detection. Multimedia Tools and Applications. 42(2), 233--250 (2009)
  3. Otsuka, I., Nakane, K., Divakaran, A., Hatanaka, K., Ogawa, M.: A highlight scene detection and video
     summarization system using audio feature for a Personal Video Recorder. IEEE Transactions on
     Consumer Electronics. 51(1), 112--116 (2005)
  4. Website, http://www.cdvp.dcu.ie/VideoCLEF/
  5. Soleymani, M., Chanel, G., Kierkels, J. J. M., Pun, T.: Affective Characterization of Movie Scenes Based
     on Multimedia Content Analysis and User's Physiological Emotional Responses. In: IEEE International
     Symposium on Multimedia, (2008)
  6. Valstar, M. F., Gunes, H., Pantic, M.: How to Distinguish Posed from Spontaneous Smiles using
     Geometric Features. In: ACM Int'l Conf.Multimodal Interfaces (ICMI'07), pp. 38--45. (2007)
  7. Kierkels, J. J. M., Pun, T.: Towards detection of interest during movie scenes. In: PetaMedia Workshop
     on Implicit, Human-Centered Tagging (HCT'08), Abstract only, (2008)
  8. May, J., Dean, M. P., Barnard, P. J.: Using film cutting techniques in interface design. Human-Computer
     Interaction. 18(4), 325--372 (2003)
  9. Alku, P., Vintturi, J., Vilkman, E.: Measuring the effect of fundamental frequency raising as a strategy for
     increasing vocal intensity in soft, normal and loud phonation. Speech Communication. 38(3--4), 321-334
     (2002)
 10. Wennerstrom, A.: Intonation and evaluation in oral narratives. Journal of Pragmatics. 33(8), 1183-1206
     (2001)