<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploiting Speech Recognition Transcripts for Narrative Peak Detection in Short-Form Documentaries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martha Larson</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bart Jochems</string-name>
          <email>fb.e.h.jochems@student</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ewine Smits</string-name>
          <email>e.a.p.smitsg@tudelft.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Human Media Interaction, University of Twente</institution>
          ,
          <country country="NL">Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Mediamatics, Delft University of Technology</institution>
          ,
          <country country="NL">Netherlands</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Roeland Ordelman</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Narrative peaks are points at which the viewer perceives a spike in the level of dramatic tension within the narrative flow of a video. This paper reports on four approaches to narrative peak detection in television documentaries that were developed by a joint team consisting of members from Delft University of Technology and the University of Twente within the framework of the VideoCLEF 2009 Affect Detection task. The approaches make use of speech recognition transcripts and seek to exploit various sources of evidence in order to automatically identify narrative peaks. These sources include speaker style (word choice), stylistic devices (use of repetitions), strategies strengthening viewers' feelings of involvement (direct audience address) and emotional speech. These approaches are compared to a challenging baseline that predicts the presence of narrative peaks at fixed points in the video, presumed to be dictated by natural narrative rhythm or production convention. Two approaches are tied in delivering top narrative peak detection results. One uses counts of first and second person pronouns to identify points in the video where viewers feel most directly involved. The other uses affective word ratings to calculate scores reflecting emotional language.</p>
      </abstract>
      <kwd-group>
        <kwd>Spoken content</kwd>
        <kwd>Narrative</kwd>
        <kwd>Dramatic tension</kwd>
        <kwd>Speech recognition transcripts</kwd>
        <kwd>Short-form documentaries</kwd>
        <kwd>Dutch language</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        While watching video content, viewers feel fluctuations in their emotional response that can be attributed to
their perception of changes in the level of dramatic tension. In the literature on affective analysis of video,
two types of content have received particular attention: sports games and movies [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These two cases
differ with respect to the source of viewer-perceived dramatic tension. In the case of sports, tension spikes
arise as a result of the unpredictable interactions of the players within the rules and physical constraints of
the game. In the case of movies, dramatic tension is carefully crafted into the content by a team including
scriptwriters, performers, special effects experts, directors and producers. The difference between the two
cases is the amount and nature of human intention – i.e., premeditation, planning, intervention – involved
in the creation of the sequence of events that plays out over time (and space). We refer to that sequence
as a narrative and to high points in the dramatic tension within that narrative as narrative peaks. We are
interested in investigating a third case of video content, namely television documentaries. We consider
documentaries to be a form of “edu-tainment,” whose purpose is both to inform and entertain the audience.
The approaches described and tested here have been developed in order to detect narrative peaks within
documentary videos.
      </p>
      <p>Our work differs in an important respect from previous work in the domains of sports and movies.
Dramatic tension in documentaries is never completely spontaneous – the narrative curve follows a
previously laid out plan, for example a script or an outline, that is carried out during the process of production.
However, dramatic tension is characteristically less tightly controlled in a documentary than it would be
in a movie. In a movie, the entire content is subordinated to the plot, whereas a documentary may follow
one or more story lines, but it simultaneously pursues the goal of providing the viewer with factual subject
matter. Because of these differences, we chose to dedicate separate and specific attention to the affective
analysis of documentaries and in particular to the automatic detection of narrative peaks.</p>
      <p>This area of investigation is quite challenging since fluctuations in dramatic tension in television
documentaries are not associated with conventionalized events. If an event is a conventional trigger, a broad
spectrum of viewers will agree about its contribution to the drama of the video content – think of goals in
the game of soccer or a kiss in a romantic comedy. The subtleness with which narrative peaks manifest
themselves in video documentaries makes the task challenging with respect to the difficulty of both
automatically detecting such peaks and also evaluating the detection algorithm. Our interest is contextualized
within the broader goal of automatic prediction of topic-independent viewer preference. Given two videos
with comparable informational content, viewers will often decide to choose to watch one over the other.
Our ultimate research aim is to explore the contribution that analysis of affective aspects of video content
can make to the automatic prediction of viewer preference.</p>
      <p>This paper reports on joint work carried out by research groups at two universities in the Netherlands,
Delft University of Technology1 and the University of Twente, on the Affect Detection task of the
VideoCLEF2 track of the 2009 Cross-Language Evaluation Forum (CLEF)3 benchmark evaluations. The Affect
Detection task involves automatically identifying narrative peaks in short-form documentaries. In the rest
of this paper, we first give a brief description of the data and the task. Then, we present the approach that
we took to the task and give the details of the algorithms used in each of the five runs that we submitted.
We report the results achieved by these runs and then conclude with a summary and outlook.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Experimental Setup</title>
      <sec id="sec-2-1">
        <title>Data Set and Task Definition</title>
        <p>The data set for the VideoCLEF 2009 Affect Detection task consisted of 45 episodes from the
Dutchlanguage short-form documentary series called Beeldenstorm (in English, ‘Iconoclasm’). The series treats
topics in the visual arts, integrating elements from history, culture and current events. Beeldenstorm is
hosted by Prof. Henk van Os, known not only for his art expertise, but also for his narrative ability. Henk
van Os is highly acclaimed and appreciated in the Netherlands, where he has established his ability to
appeal to a broad audience.4</p>
        <p>Constraining the corpus to contain episodes from Beeldenstorm limits the spoken content to a single
speaker speaking within the style of a single documentary series. This limitation is imposed in order to help
control effects that could be introduced by variability in style or skill. Experimentation of the ability of
algorithms to transfer performance to other domains is planned for future years. An additional advantage
of using the Beeldenstorm series is that the episodes are relatively short, approximately eight minutes in
length. Because they are short, the assessors who create the ground truth for the test collection are able to
1Delft University of Technology and Dublin City University are the coordinators of VideoCLEF
2http://www.cdvp.dcu.ie/VideoCLEF/
3http://www.clef-campaign.org/
4http://www.avro.nl/tv/programmas az/beeldenstorm/
watch each video in its entirety. In short, the Beeldenstorm program provides a highly suitable corpus for
developing and evaluating algorithms for narrative peak detection.</p>
        <p>Ground truth was created for the Beeldenstorm by a team of assessors who speak Dutch natively or at
an advanced level. The assessors were told that the Beeldenstorm series is known to contain humorous and
moving moments and told that they could use that information to formulate an opinion of what constitutes
a narrative peak. They were asked to mark the three points in the video where their perception of the level
of dramatic tension reached the highest peaks. Peaks were required to be a maximum of ten seconds in
length.</p>
        <p>
          For the Affect Detection task of VideoCLEF 2009, task participants were supplied with an example set
containing five Beeldenstorm episodes in which example narrative peaks had been identified by a human
assessor. On the basis of their observations and generalizations concerning the peaks marked in the
example set, the task participants designed algorithms capable of automatically detecting similar peaks in the test
set. The test set contained 45 videos and was mutually exclusive with the example set. Participants were
required to identify the three highest peaks in each episode. Up to five different runs (i.e., system outputs
created according to different experimental conditions) could be submitted. Further details about the data
set and the Affect Detection task for VideoCLEF 2009 can be found in the track overview paper [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
Participants were provided with additional resources accompanying the test data, including transcripts generated
by an automatic speech recognition system [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Our approaches, described in the next section, focus on
exploiting the contents of the speech transcripts for the purpose of automatically detecting narrative peaks.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Narrative Peak Detection Approaches</title>
        <p>Our approaches consist of a sophisticated baseline and four other techniques for using speech recognition
transcripts to automatically detect narrative peaks. We describe each algorithm in turn.
2.2.1</p>
        <sec id="sec-2-2-1">
          <title>Fixing Time Points (duotu09fix)</title>
          <p>Our baseline approach duotu09fix5 hypothesizes fixed time points for three narrative peaks in each episode.
These points are completely independent of episode content and are the same for every episode. This
approach attempts to exploit regularities that exist in the narrative structure of every episode of a documentary
series as a result of production conventions or of general documentary structure (i.e., a documentary
consists or an opening, a body and a conclusion). We chose this approach in order to establish a challenging
baseline against which our speech-transcript-based peak detection algorithms can be compared. In order
to choose the three fixed time points we analyzed the peak positions in the example set. In the examples,
the midpoint of the first peak occurred between 28 secs and 1 min 6 secs after the start of the video. The
midpoint of the final peak occurred between 6 mins 42 secs and 7 mins 40 secs into the video. We fixed
a peak at the average position of the initial peak (44 secs) and the final peak (7 mins 9 secs). We added a
third located at the average midpoint of the episode: 3 mins 40 secs. The fact that four of the five example
episodes have a peak within 10 seconds of this point confirmed that we had made a good choice for the
third fixed point peak.
2.2.2</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Counting Indicator Words (duotu09ind)</title>
          <p>We viewed the example videos and examined the words that were spoken during the narrative peaks that
the assessor had marked in these videos. We formulated the hypothesis that the speaker applies a narrow
range of strategies for creating narrative peaks in the documentary. These strategies might be reflected in
a relatively limited vocabulary of words that could be used as indicators in order to predict the position of
narrative peaks.</p>
          <p>We compiled a list of narrative peak indicators by analyzing the words spoken during each of the
example peaks and compiled a list of words and word-stems that seemed relatively independent of the
topic at the point in the video and which could be plausibly characteristic of the general word use of the
speaker during peaks. The indicator words selected are listed in Table 1. It is noteworthy that most of these
words are adjectives or adverbs and that they have a basic positive or negative meaning, or they serve as
5textbfduotu is an acronym indicating the combined efforts of Delft University of Technology and the University of Twente
an intensifier. The word maar, ‘but’, appears to be an exception to this generalization. We included this
word because it occurred in 20% of the peaks in the example set. ‘But’ is a lexical item frequently used to
indicate a contrast with previously established state of knowledge or expectations. We hypothesize that its
importance is related to the suspense introduced by statements that contrast with established knowledge or
viewer expectations.</p>
          <p>The duotu09ind algorithm detects narrative peaks using the following sequence of steps. First, a set
of all possible peak candidates was established by moving a 10-second sliding window over the speech
recognition transcripts, advancing the window by one word at each step. Each peak candidate is maximally
10 seconds in length, but can be shorter if the speech in the window lasts for less than the 10-second
duration of the window. Peak candidates of less than three seconds in length are discarded. Then, the peak
candidates are ranked with respect to the raw count of the indicator words (cf. Table 1) that they contain.
The size limitation of the sliding window already introduces a normalizing effect and for this reason we do
not undertake further normalization of the raw counts. Finally, peak candidates are chosen from the ranked
list, starting at the top, until a total of three peaks has been selected. If a candidate has a midpoint that falls
within eight seconds of the midpoint of a previously selected candidate occurring in the list, that candidate
is discarded and the next candidate from the list is considered instead.
2.2.3</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>Counting Word Repetitions (duotu09rep)</title>
          <p>Analysis of the word distributions in the example set suggested that repetition may be a stylistic device that
is deployed to create peaks. Particular examples of the use of repetition during narrative peaks in the
example episodes include, ...kunsttulpen mooie kunsttulpen... (‘...artificial-tulips, beautiful artificial-tulips...’),6
...het is werkelijk een ervaring, een ervaring van stilte... (‘...it is really an experience, an experience of
tranquility...’),7 and ...wordt belangrijk, is altijd belangrijk geweest... ( ‘...will be important, has always
been important...’).8 We do not attempt to measure repetition of phrases or of morphologically related
6from Beeldenstorm episode Tulpomanie, ‘Tulip mania’
7from Beeldenstorm episode Rust bij Rothko, ‘Peace with Rothko’
8from Beeldenstorm episode Maria Magdalena, ‘Mary Magdalene’
words, but rather assume that counting repeated word forms will yield an adequate indicator or places in
the documentary where repetition is being applied as a stylistic device.</p>
          <p>The duotu09rep algorithm uses the same list of peak candidates described in the previous section in
the explanation of duotu09ind. The peak candidates are ranked by the number of occurrences they contain
of words that occur multiple times. In order to eliminate the impact of function words, stop word removal
is performed before the peak candidates are scored. Three peaks are selected starting from the top of the
ranked list of peak candidates, using the same procedure as was described above.
2.2.4</p>
        </sec>
        <sec id="sec-2-2-4">
          <title>Counting First and Second Person Pronouns (duotu09pro)</title>
          <p>We conjecture that dramatic tension rises along with the level to which the viewers feel that they are
directly involved in the video content they are watching. The duotu09pro approach identifies two possible
conditions of heightened viewer involvement: when viewers feel that the speaker in the videos is
addressing them directly or as individuals, or, second, when viewers feel that the speaker is sharing something
personal. Although we do not examine this aspect more closely here, it is possible that the importance of
personal connection or personal revelation in documentary video is related to the fact that viewers perceive
it to be a relatively rare event, which triggers them to sit up and take notice.</p>
          <p>In the duotu09pro approach we use second person pronominal forms (e.g., u, ‘you’; uw ‘your’) to
identify audience directed speech and first person pronominal forms (e.g., ik, ‘I’) to identify personal revelation
of the speaker. Notice that first person plural forms (e.g., wij ‘we’) might actually be correlated with either
case, serving generally to draw the audience into the narrative. Cases of narrative peaks that support the
viability of this approach occur in the example set, e.g., ...ziet u hoe diep de tulp in ons nationale volksziel is
ingedrongen... (‘...you see how deeply the tulip has penetrated our national consciousness...’).9 In the case
of Beeldenstorm, second person informal pronominal forms (e.g., je, ‘you, your’) should also be attributed
this general role as well since they are used as impersonal pronouns to describe the thoughts and actions
of a hypothetical person, rather than the viewer directly. This point is illustrated by the following narrative
peak from the example set ...en als je nou naar Amsterdam gaat, naar het Museum Willet-Holthuysen, kijk,
daar heb je wat ik ‘total design’ zou willen noemen... (‘...and if you (informal) go to Amsterdam to the
Willet-Holthuysen Museum, that’s where you’ll (informal) find what I call total design.’)10 Dutch usage
conventions prevent Prof. van Os from addressing his audience using the informal, although it must also
be kept in mind that his ability to stretch conventions is part of his narrative talent.</p>
          <p>The duotu09pro algorithm uses the same list of peak candidates and the same method of choosing from
the ranked candidate lists that was used in duotu09ind and duotu09rep. For duotu09pro, the candidates
are ranked according to the raw count of first and second person pronominal forms that they contain.
Again, no normalization was applied to the raw count. It should also be noted that in this case no stop
word removal was applied since first and second person pronouns are themselves function words and are
included in standard formulations of stop word lists.
2.2.5</p>
        </sec>
        <sec id="sec-2-2-5">
          <title>Calculating Affective Ratings (duotu09rat)</title>
          <p>
            Our final approach to narrative peak detection is based on the hypothesis that dramatic tension rises when
the speaker in the video uses speech made vivid by emotion. We conjecture that narrative peaks contain
more emotion than other parts of the narrative. Human speech is an important conduit for the
communication of emotions. Although emotion can be conveyed by prosodic variation, including changes in loudness,
pitch and speed, emotion is also conveyed by the choice of lexical items. People tend to use specific words
to express their emotions because there is a conventionalized relationship between certain words and
certain emotions. In the field of psychology, one way of establishing the connection between word forms and
emotions is to ask subjects to list the English words that describe specific emotions [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ].
          </p>
          <p>
            The duotu09rat approach uses an affective rating score that is calculated in a straightfoward manner
using known affective levels of words in order to identify narrative peaks. The approach makes use of
Whissell’s Dictionary of Affect in Language as deployed in the implementation of [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ], which is available
9from Beeldenstorm episode Tulpomanie, ‘Tulip mania’
10from Beeldenstorm episode Leven met kunst, ‘Living with art’
online.11 This dictionary of words and scores focuses on the scales of pleasantness and arousal levels. The
scales are alternately called evaluation and activation. Dietz and Lang [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] transformed these two scales to
the two-dimensional emotion space depicted in Figure 1. Under our approach, narrative peaks are identified
with a high arousal emotion combined with either a very pleasant or unpleasant emotion. In order to score
words, we combine the evaluation and the activation scores into an overall affective word score. In order
to apply the dictionary, we first translate the Dutch-language speech recognition transcripts into English
using the Google Language API.12
          </p>
          <p>The duotu09rat algorithm uses the same list of peak candidates used in duotu09ind, duotu09rep and
duotu09pro. Candidates are ranked according to the average affective word score of the words that they
contain. Words that are not contained in the dictionary are excluded from the calculation. Selection of
peaks proceeds as in the other approaches with the exception of the fact the peak proximity condition was
set to be more stringent. Edges of peaks are required to be 4 secs apart from each other. The imposition
of the more stringent condition reflects an incidental difference in the experimental set up and does not
represent an optimized value.
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experimental Results</title>
      <p>We tested our five experimental approaches on the 45 videos in the test set. Evaluation of results was carried
out by comparing the peak positions hypothesized by each experimental system with peak positions that
were set by human assessors. In total, three assessors viewed each of the test videos and set peaks at the
three points where he or she felt most highly affected by narrative tension created by the video content. In
total the assessors identified 293 distinct narrative peaks in the 45 test episodes. Peaks identified by different
assessors were considered to be the same peak if they overlapped by at least two seconds. This value was
set on the basis of observations by the assessor on characteristic distances between peaks. Overlapping
peaks were merged by fitting the overlapped region with a ten second window. This process was applied
so that merged peaks could never exceed the specified peak length of ten seconds.</p>
      <p>Two methods of scoring the experiments were applied, the point-based approach and the peak-based
approach. Under point-based scoring, a peak hypothesis scores a point for each assessor who selected a
reference peak that is within eight seconds of that hypothesis peak. The total number of points returned
by the run is the reported run score. A single episode can earn a run between three points (assessors chose
completely different peaks) and nine points (assessors all chose the same peaks). In reality, no episode
however, falls at either of these extremes. The distribution of the peaks in the files is such that a perfect run
would earn 246 points. Under peak-based scoring, the total number of correct peaks is reported as the run
score. Three different types of reference peaks are defined for peak-based scoring. The difference is related
to the number of assessors required to agree for a point in the video to be counted as a peak. Of these 293
11http://technology.calumet.purdue.edu/met/gneff/Publications/ica02/affectdictionary.html
12http://code.google.com/intl/nl/apis/ajaxlanguage/
measure
point-based
peak-based &gt; 1 assessor
(“personal peaks”)
peak-based &gt; 2 assessors
(“pair peaks”)
peak-based &gt; 3 assessors
(“general peaks”)
total peaks identified, 203 peaks are “personal peaks” (peaks identified by only one assessor), 90 are “pair
peaks” (peaks that are identified by at least two assessors) and 22 are “general peaks” (peaks upon which
all three assessors agreed). Peak-based scores are reported separately for each of these types of peaks. A
summary of the results of the evaluation of our five approaches is given in Table2.</p>
      <p>From these results it can be seen that duotu09pro, the approach that counted first and second person
pronouns, and duotu09rat, the approach that made use of affective word scores are the best performing
approaches. The approach relying on a list of peak indicator words, i.e., duotu09ind, performed surprisingly
well considering that the list was formulated on the basis of a very limited number of examples.</p>
      <p>It should be kept in mind, that the performance of a random classifier on the narrative peak detection
task reaches a relatively high level since the videos are relatively short. Via simulation we calculated that an
approach that randomly picks points at which to hypothesize three peaks in a file will automatically score,
on average, approximately 40 points under the point-based scoring method. Under the peak-based method
it would score on average 28 correct “personal peaks”, nine correct “pair peaks” and two correct “general
peaks.” In light of these statistics, the approach duotu09rep, which counted use of repeated words, deserves
further comment. This approach failed to achieve the performance level of the random baseline detector,
which indicates that repetitions, as they are counted by our implementation of the algorithm, actually are
a negative indicator for the existence of a peak. We believe that this result may be due to the fact that
assessors tend not to set peaks at places where there might be disfluencies or unintentional repetitions. The
speech recognition transcripts contain a high level of noise and it is conceivable that this noise contributes
to creating word repetitions where none existed in the original speech. Such an effect could further prevent
stylistic repetition from being effectively exploited for the purpose of peak detection.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Outlook</title>
      <p>We have proposed five approaches to the automatic detection of narrative peaks in short-form
documentaries and have evaluated these approaches within the framework of the VideoCLEF 2009 Affect Detection
task, which uses a test set consisting of episodes from the Dutch language documentary on the visual arts
called Beeldenstorm. Our proposed approaches exploit speech recognition transcripts. The two most
successful algorithms are based on the idea that narrative peaks are perceived where particularly emotional
speech is being used (duotu09rat) or when the viewer feels specifically addressed by or involved in the
video (duotu09pro). These two approaches easily beat both the random baseline and also a challenging
baseline approach hypothesizing narrative peaks at set positions in the video. Approaches based on
capturing speaking style, either by using a set of indicator words typical for the speaker, or by trying to determine
where repetition is being used as a stylistic device, proved less helpful. However, the experiments reported
here are not extensive enough to exclude the possibility that they would perform well given a different
implementation.</p>
      <p>Future work will involve returning to many of the questions opened here, for example, while selecting
peak-indicator words, we noticed that contrasts introduced by the word ‘but’ appear to often be associated
with narrative peaks. Stylistic devices in addition to repetition, for example, use of questions, could also
prove to be helpful. Under our approach, peak candidates are represented by their spoken content. We
would also like to investigate the enrichment of the representations of peak candidates using words derived
from surrounding regions in the speech transcripts or from an appropriate external text collection. Finally,
we intend to develop peak detection methods based on the combination of information sources, in
particular, exploring whether using pronoun occurrence based information can provide enhancement to affect
based rating.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>The research leading to these results was carried out to a substantial degree within the PetaMedia Network
of Excellence and has received funding from the European Commission’s 7th Framework Program under
grant agreement no. 216444.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Richard</given-names>
            <surname>Dietz</surname>
          </string-name>
          and
          <string-name>
            <given-names>Annie</given-names>
            <surname>Lang</surname>
          </string-name>
          .
          <article-title>Aefective agents: Effects of agent affect on arousal, attention, liking and learning</article-title>
          .
          <source>In Proceedings of the Third Annual Cognitive Technology Conference</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Alan</given-names>
            <surname>Hanjalic</surname>
          </string-name>
          and
          <string-name>
            <surname>Li-Qun Xu</surname>
          </string-name>
          .
          <article-title>Affective video content representation and modeling</article-title>
          . Multimedia, IEEE Transactions on,
          <volume>7</volume>
          (
          <issue>1</issue>
          ):
          <fpage>143</fpage>
          -
          <lpage>154</lpage>
          , Feb.
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Marijn</given-names>
            <surname>Huijbregts</surname>
          </string-name>
          , Roeland Ordelman, and Franciska de Jong.
          <article-title>Annotation of heterogeneous multimedia content using automatic speech recognition</article-title>
          .
          <source>In Proceedings of SAMT</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Martha</given-names>
            <surname>Larson</surname>
          </string-name>
          , Eamonn Newman, and
          <string-name>
            <given-names>Gareth</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <source>Overview of VideoCLEF</source>
          <year>2009</year>
          :
          <article-title>New perspectives on speech-based multimedia content enrichment</article-title>
          . In Francesca Borri, Alessandro Nardi, and Carol Peters, editors,
          <source>Working Notes of CLEF</source>
          <year>2009</year>
          ,
          <year>September 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Gregory</given-names>
            <surname>Neff</surname>
          </string-name>
          , Bonita Neff, and
          <string-name>
            <given-names>Paul</given-names>
            <surname>Crandon</surname>
          </string-name>
          .
          <article-title>Assessing the affective aspect of languaging:the development of software for public relations</article-title>
          .
          <source>In The 52nd Annual Conference of the International Communication Association</source>
          ,
          <year>July 2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Robert</given-names>
            <surname>Plutchik</surname>
          </string-name>
          .
          <source>The Psychology and Biology of Emotion</source>
          . New York: HarperCollins,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>