<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of VideoCLEF 2009: New Perspectives on Speech-based Multimedia Content Enrichment</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martha Larson</string-name>
          <email>m.a.larson@tudelft.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eamonn Newman</string-name>
          <email>eamonn.newman@computing.dcu.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gareth J. F. Jones</string-name>
          <email>gareth.jones@computing.dcu.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Measurement, Performance, Experimentation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Digital Video Processing, Dublin City University</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Mediamatics, Delft University of Technology</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>VideoCLEF 2009 offered three tasks related to enriching video content for improved multimedia access in a multilingual environment. For each task, video data (Dutch-language television, predominantly documentaries) accompanied by speech recognition transcripts were provided. The Subject Classification Task involved automatic tagging of videos with subject theme labels. The best performance was achieved by approaching subject tagging as an information retrieval task and using both speech recognition transcripts and archival metadata. Alternatively, classifiers were trained using either the training data provided or data collected from Wikipedia or via general Web search. The Affect Task involved detecting narrative peaks, defined as points where viewers perceive heightened dramatic tension. The task was carried out on the “Beeldenstorm” collection containing 45 short-form documentaries on the visual arts. The best runs exploited affective vocabulary and audience directed speech. Other approaches included using topic changes, elevated speaking pitch, increased speaking intensity and radical visual changes. The Linking Task, also called “Finding Related Resources Across Languages,” involved linking video to material on the same subject in a different language. Participants were provided with a list of multimedia anchors (short video segments) in the Dutch-language “Beeldenstorm” collection and were expected to return target pages drawn from English-language Wikipedia. The best performing methods used the transcript of the speech spoken during the multimedia anchor to build a query to search an index of the Dutchlanguage Wikipedia. The Dutch Wikipedia pages returned were used to identify related English pages. Participants also experimented with pseudo-relevance feedback, query translation and methods that targeted proper names.</p>
      </abstract>
      <kwd-group>
        <kwd>H</kwd>
        <kwd>3 [Information Storage and Retrieval]</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>1 Content Analysis and Indexing</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>3 Information Search and Retrieval</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>4 Systems and Software</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>7 Digital Libraries</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>VideoCLEF 20091 is a track of the CLEF2 benchmark campaign and is devoted to tasks aimed at
improving access to video content in multilingual environments. VideoCLEF develops new video retrieval related
tasks and data sets with which to evaluate these tasks. During VideoCLEF 2009, three tasks were carried
out. The Subject Classification Task required participants to automatically tag videos with subject theme
labels (e.g., ‘factories,’ ‘physics,’ ‘poverty’, ‘cultural identity’ and ‘zoos’). The Affect Task, also called
“Narrative peak detection,” involved automatically detecting dramatic tension in short-form documentaries.
Finally, “Finding Related Resources Across Languages,” referred to as the Linking Task, required
participants to automatically link video to Web content that is in a different language, but on the same subject. The
data sets for these tasks contained Dutch-language television content supplied by the Netherlands Institute
of Sound and Vision3 (called in Dutch Beeld &amp; Geluid), which is one of the largest audio/video archives in
Europe. Each participating site had access to video data, speech recognition transcripts, shot boundaries,
shot-level keyframes and archival metadata supplied by VideoCLEF. Sites developed their own approaches
to the tasks and were allowed to chose the method and features that they found most appropriate. Seven
groups made submissions of task results for evaluation.</p>
      <p>
        In 2009, the VideoCLEF track ran for the first time as a full track within the Cross-Language
Evaluation Forum (CLEF) evaluation campaign. The track was piloted last year as VideoCLEF 2008 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The
VideoCLEF track is successor to the Cross-Language Speech Retrieval (CL-SR) track, which ran at CLEF
from 2005 to 2007 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. VideoCLEF seeks to extend the results of CL-SR to the broader challenge of video
retrieval. VideoCLEF is intended to complement the TRECVid benchmark [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] by running tasks related to
the subject matter treated by video and emphasizing the importance of spoken content (via speech
recognition transcripts). TRECVid has traditionally focused on what is depicted in the visual channel. In contrast,
VideoCLEF concentrates on what is described in a video, in other words, what a video is about.
      </p>
      <p>This paper describes the data sets and the tasks of VideoCLEF 2009 and summarizes the results
achieved by the participating sites. We finish with a conclusion and an outlook for VideoCLEF 2010.
For additional information concerning individual approaches used in 2009, please refer to the working
notes papers of the individual sites.
1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Data</title>
      <p>VideoCLEF 2009 used two data sets both containing Dutch-language television programs. Note that these
programs are predominantly documentaries with the addition of some talk shows. This means that the data
contains a great deal of conversational speech, including opinionated and subjective speech and speech that
has been only loosely planned. In this way, the VideoCLEF data is different and more challenging than
broadcast news data which largely involves scripted speech.</p>
      <p>
        The VideoCLEF 2009 Subject Classification Task ran on TRECVid 2007/2008 data from Beeld &amp;
Geluid. The Affect Task and Linking Task both ran on a data set containing material from the short-form
documentary Beeldenstorm, also supplied by Beeld &amp; Geluid. For both data sets, Dutch-language speech
recognition transcripts were supplied by the University of Twente [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The shot segmentation and the
shot-level keyframe data were provided by Dublin City University [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Further details are given in the
following.
1.1.1
      </p>
      <sec id="sec-2-1">
        <title>TRECVid 2007/2008 data set</title>
        <p>In 2009, VideoCLEF attempted to encourage cross-over from the TRECVid community by recycling the
TRECVid data set for the Subject Classification Task. Notice that the Subject Classification Task is a
fundamentally different task than what ran at TRECVid in 2007 and 2008. Subject Classification involves
automatically assigning subject labels to videos at the episode level. The subject matter of the entire video
is important, not just the concepts visible in the visual channel and not just the shot-level topic.
1http://www.cdvp.dcu.ie/VideoCLEF
2http://www.clef-campaign.org/
3http://www.beeldengeluid.nl</p>
        <p>Classifying video, i.e., taking a video and assigning it a topic class subject label, is exactly what the
archive staff does at Sound and Vision when they annotate video material that is to be stored in the archive.
The class labels used for the VideoCLEF 2009 Subject Classification Task are a subset of labels that are
used by archive staff. As a result, we (1) have gold standard topic class labels with which to evaluate
classification (2) can be relatively certain that if these labels are already used for retrieval of material from
the archive then they are relevant for video search in an archive setting, and we assume, beyond. Original
Dutch-language examples of subject labels can be examined in the search engine. 4</p>
        <p>In the VideoCLEF 2009 Subject Classification Task, archivist-assigned subject labels were used as
ground truth.5 The training set is a large subset of TRECVid 2007 and contains 212 videos. The test set
is a large subset of TRECVid 2008 and contains 206 videos. Each video is an individual episode of a
television show. Their length varies widely with the average length being around 30 minutes. Participants
were also free to collection their own training data, if they wished. Note that the VideoCLEF 2009 Subject
Classification set excludes several videos in the TRECVid collection for which archival metadata was not
available.
1.1.2</p>
        <p>Beeldenstorm data set
For both the Affect Task and Linking Task a data set consisting of 45 episodes of the documentary
series Beeldenstorm (Eng. Iconoclasm) was used. The Beeldenstorm series consists of short-form
Dutchlanguage video documentaries about the visual arts. Each episode lasts approximately eight minutes.
Beeldenstorm is hosted by Prof. Henk van Os, known and widely appreciated, not only for his art
expertise, but also for his narrative ability. This data set is also supplied by Beeld &amp; Geluid, but it is mutually
exclusive with the TRECVid 2007/2008 data set. The narrative ability of Prof. van Os makes the
Beeldenstorm set an interesting corpus to use for affect detection and the domain of visual arts offers a wide number
of possibilities for interesting multimedia links for the linking task. Finally, the fact that each episode is
short makes it possible for assessors to watch the entire episode when creating the ground truth. Knowledge
of the complete context is important for relevance judgments for cross-language related resources and also
for defining narrative peaks.</p>
        <p>
          The ground truth for the Affect Task and Linking Task was created by a team of three Dutch-speaking
assessors during a nine-day assessment and annotation event at Dublin City University referred to as Dublin
Days. The videos were annotated with the ground truth with the support of the Anvil6 Video Annotation
Research Tool [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Anvil makes it possible to generate frame-accurate video annotations in a graphic
interface. Particularly important for our purposes was the support offered by Anvil for user-defined annotation
schemes. Details of the ground truth creation are included in the discussions of the individual tasks in the
following section.
2
2.1
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Task</title>
      <sec id="sec-3-1">
        <title>Subject Classification Task</title>
        <p>The goal of the Subject Classification Task is automatic subject tagging. Theme-based subject tags are
assigned automatically to videos. The purpose of these tags is to make the videos findable to users that are
searching and browsing the collection. The information needs (i.e., queries) of the users are not specified
4Visit the search engine at http://zoeken.beeldengeluid.nl. A keyword search will return a results list with a column labeled
Trefwoorden or keywords. These are the topic class subject labels that are used in the archive.</p>
        <p>5In total 46 labels were used: aanslagen (attacks), armoede (poverty), burgeroorlogen (civil wars), criminaliteit (crime),
culturele identiteit (cultural identity), dagelijks leven (daily life), dieren (animals), dierentuinen (zoos), economie (economy), etnische
minderheden (ethnic minorities), fabrieken (factories), families (families), gehandicapten (disabled), geneeskunde (medicine),
geneesmiddelen (pharmaceutical drug), genocide (genocide), geschiedenis (history), gezinnen (families), havens (harbors), hersenen
(brain), illegalen (undocumented immigrants), journalisten (journalist), kinderen (children), landschappen (landscapes), media
(media), militairen (military personnel), musea (museums), muziek (music), natuur (nature), natuurkunde (physics), ouderen (seniors),
pers (press), politiek (politics), processen (lawsuits), rechtszittingen (court hearings), reizen (travel), taal (language),
verkiezingen (elections), verkiezingscampagnes (electoral campaigns), voedsel (food), voetbal (soccer), vogels (birds), vrouwen (women),
wederopbouw (reconstruction), wetenschappelijk onderzoek (scientific research), ziekenhuizen (hospitals).</p>
        <p>
          6http://www.anvil-software.de/
at the time of tagging. In VideoCLEF 2009, the Subject Classification Task had the specific goal of
reproducing the subject labels that were hand assigned to the test set videos by archivists at Beeld &amp; Geluid.
Since these subject labels are currently in use to archive and retrieve video in the setting of a large archive,
we are confident in their usefulness for search and browsing in real-world information retrieval scenarios.
The Subject Classification Task was introduced during the VideoCLEF 2008 pilot [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In 2009, the number
of videos in the collection was increased from 50 to 418 and the number of subject labels increased from
10 to 46.
2.2
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>The Subject Classification Task is evaluated using Mean Average Precision (MAP). This choice of score is
motivated by the popularity of techniques that approach the subject tagging task as an information retrieval
problem. These techniques return, for each subject label, a ranked list of videos that should receive that
label. MAP is calculated by taking the mean of the Average Precision over all subject labels. For each
subject label, precision scores are calculated by moving down the results list and calculating precision at
each position where a relevant document is retrieved. Average Precision is calculated by taking the average
of the precision at each position. Calculations were performed using version 8.1 of the trec eval7
scoring package.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Techniques</title>
      <p>Computer Science, Chemnitz University of Technology, Germany The task was treated as an
information retrieval task. The test set was indexed using an information retrieval system and was queried using
the subject labels as queries. Documents returned as relevant to a given subject label were tagged with that
label. The number of documents receiving a given label was controlled by a threshold. The submitted runs
varied with respect to whether or not the archival metadata was indexed in addition to the speech
recognition transcripts. They also varied with respect to whether expansion was applied to the class label (i.e., the
query). Expansion was performed by augmenting the original query with the most frequent term occurring
in the top five documents returned by an initial retrieval round. If fewer than two documents were returned,
queries were expanded using a thesaurus.</p>
    </sec>
    <sec id="sec-6">
      <title>SINAI Research Group, University of Jae´n, Spain The SINAI8 group approached the task as a catego</title>
      <p>rization problem, training SVMs using the training data provided. One run, SINAI svm nometadata,
extracted feature vectors from the speech transcripts alone and one run, SINAI svm withmetadata,
made use of both speech recognition transcripts and metadata.</p>
      <sec id="sec-6-1">
        <title>Computer Science, Alexandru Ioan Cuza University, Romania Classifiers were trained using data</title>
        <p>collected from Wikipedia or via general Web search. Results are not reported here, however, since the
submitted runs were not carried out on the current test data set.
2.4</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Results</title>
      <p>
        The MAP of the results of the task are reported in Table 1. The results confirm the viability of techniques
that approach the Subject Classification Task as an information retrieval task. Such techniques proved
useful in VideoCLEF 2008 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and also provide the best results in 2009 where the size of the collection and
the label set increased. Also, consistent with VideoCLEF 2008 observations, performance is better when
archival metadata is used in addition to speech recognition transcripts. Finally, after VideoCLEF 2008, we
decided that we wanted to provide a training data set of speech transcripts generated video in 2009 to see
whether training classifiers on data from the same domain as the test data would improve performance.
7http://trec.nist.gov/trec eval
8SINAI stands for Sistemas Inteligentes de Acceso a la Informacio´n
run ID
cut1 sc asr baseline
cut2 sc asr expanded
cut3 sc asr meta baseline
cut4 sc asr meta expanded
cut5 sc asr meta expanded
SINAI svm nometadata
SINAI svm withmetadata
The results of the runs submitted this year suggest that training classifiers on speech transcripts of
samedomain video does not provide significantly better performance than exploiting Web data to support an
information-retrieval-based approach.
      </p>
      <p>There is general awareness shared by VideoCLEF participants that although MAP is a useful tool, it
may not be the ideal evaluation metric for this task. The reader can refer to the working notes papers of
the individual participants for discussion. The ultimate goal of subject tagging is to generate a set of tags
for each video that will allow users to find that video while searching or browsing. The utility of a tag
assigned to a given video is therefore not entirely independent of the other tags assigned. Under the current
formulation of the task, the presence or absence of the tag is the only information that is of use to the
searcher. The ranking of a video in a list of videos that are assigned the same tag is for this reason not
directly relevant to the utility of that tag for the user.
3
3.1</p>
      <sec id="sec-7-1">
        <title>Affect Task</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Task</title>
      <p>The goal of the Affect Task at VideoCLEF 2009 was to automatically detect narrative peaks in
documentaries. Narrative peaks were defined to be those places in a video where viewers report feeling a heightened
emotional effect due to dramatic tension. This task was new in 2009. The ultimate aim of the Affect Task is
to move beyond the information content of the video and to analyze the video with respect to characteristics
that are important for viewers, but not related to the video topic.</p>
      <p>
        Narrative peak detection builds on and extends work in affective analysis of video content carried out
in the areas of sports and movies, cf. e.g., [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Viewers perceive an affective peak in sports videos due to
tension arising from the spontaneous interaction of players within the constraints of the physical world and
the rules and conventions of the game. Viewers perceive an affective peak in a movie due to the action or
the plot line, which is carefully planned by the script writer and the filmmaker.
      </p>
      <p>Narrative peaks in documentaries are a new domain in that they do not fall into either category.
Documentaries convey information and often have storylines, but don’t have the all-dominating plot trajectory
of a movie. Documentaries often include extemporaneous narrative or interviews, and therefore also have
a spontaneous component. The affective curve experienced by a viewer watching a documentary can be
expected to be relatively subtly modulated.</p>
      <p>
        It is important to differentiate narrative peak detection from other cases of affect detection, such as
hotspot detection in meetings. Hotspots are moments during meetings where people are highly involved in
the discussion [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Hotspots can be self-reported by meeting participants or annotated in meeting video by
viewers. In either case, it is the participant and not the viewer whose affective reaction is being detected.
      </p>
      <p>We chose the the Beeldenstorm series for the narrative peak detection task in order to make the task
as simple and straightforward as possible in its initial year. Beeldenstorm features a single speaker, the
host Prof. van Os, and covers a topical domain, the visual arts, that is rich enough to be interesting, yet
is relatively constrained. These characteristics help us to control for the effects of personal style of the
host and of viewer familiarity with topic in the affect and appeal task. Further, as mentioned above, the
fact that the documentaries are short makes it possible for annotators to watch them in their entirety when
annotating narrative peaks.
3.2</p>
    </sec>
    <sec id="sec-9">
      <title>Evaluation</title>
      <p>For the purposes of evaluation, as mentioned above, three Dutch speakers annotated the Beeldenstorm
collection by each identifying the three top narrative peaks in each video. Annotators were asked to mark
the peaks where they felt the dramatic tension reached its highest level. They were not supplied with
an explicit definition of a narrative peak. Instead, all annotators needed to form independent opinions of
where they perceived narrative peaks. In order to make the task less abstract, they were supplied with the
information that the Beeldenstorm series is associated with humorous and moving moments. They were
told that they could use that information to formulate their notion of what constitutes a narrative peak.
Peaks were required to be a maximum of ten seconds in length.</p>
      <p>Although the annotators did not consult with each other about specific peaks, the team did engage in
discussion during the definition process. The discussion ensured that there was underlying consensus about
the approach to the task. In particular, it was necessary to check that annotators understood that a peak must
be a high point in the storyline as measured by their perceptions of their own emotional reaction. Dramatic
objects or facts in the spoken or visual content that were not part of the storyline as it was created by the
narrator/producer were not considered narrative peaks. Regions in the video where the annotator guessed
that the speaker or producer had intended there to be a peak, but where the annotator did not feel any
dramatic tension were not considered to be peaks. An example of this would be a joke that the annotator
did not understand completely.</p>
      <p>The first two episodes for which the annotators defined peaks were discarded in order to assure that the
annotators perception of a narrative peak had stabilized. This warm-up exercise was particularly important
in light of the fact that at the end of the annotation effort, assessors reported that it was necessary to become
familiar with the style and allow an affinity for the series to develop before they started to feel an emotional
reaction to narrative peaks in the video.</p>
      <p>The peaks identified by the assessors were considered to be a reflection of underlying “true” peaks in
the narrative of the video. We assumed that the variation between assessors is the result of noise due to
effects such as personal idiosyncracies. In order to generate a ground truth most highly reflective of “true”
peaks, the peaks identified by the assessors were merged. The assessment team consisted of three members
who each identified three peaks in 45 videos for a total of 405 peaks. The assessors were able to give
a rough estimate of the minimum distance between peaks and on the basis of their observations, it was
decided to consider two peaks that overlapped by at least two seconds to be realization of the same peak.
After merging the peaks, 292 of the 405 peaks turned out to be distinct. The merging process was carried
out by fitting a 10 second window to overlapping assessor peaks in order to ensure that merged peaks could
never exceed the specified peak length of 10 seconds.</p>
      <p>Evaluation involved the application of two scoring methods, the point-based approach and the
peakbased approach. Under point-based scoring, the peaks chosen by each assessor are assessed without
merging. A hypothesized peak receives a point in every case in which it falls within eight seconds of an assessor
peak. The run score is the total number of peaks returned by all peak hypothesis in the run. A single
episode can earn a run between three points (assessors chose completely different peaks) and nine points
(assessors all chose the same peaks). There are no episodes in the set that fall at either of these extremes.
The distribution of the peaks in the files is such that a perfect run would earn 246 points. Under peak-based
scoring, a hypothesis is counted as correct if it falls within an 8 second window of a peak representing a
merger of assessor annotations. Three different types of merged reference peaks are defined for peak-based
scoring. Three different peak-based scores are reported that differ in the number of assessors required to
agree in order for a region in the video to be considered a peak. Of the 293 total peaks identified, 203 peaks
are “personal peaks” (peaks identified by only one assessor), 90 are “pair peaks” (peaks that are identified
by at least two assessors) and 22 are “general peaks” (peaks upon which all three assessors agreed).
Narrative peak detection techniques were developed that used the visual channel, the audio channel and the
speech recognition transcript. Each group took a different approach.</p>
      <sec id="sec-9-1">
        <title>Computer Science, Alexandru Ioan Cuza University, Romania Based on the hypothesis that speakers</title>
        <p>raise their voices at narrative peaks, three runs were developed that made use of the intensity of the audio
signal. A score was computed for each group of words that involved a comparison of intensity means and
other statistics for sequential groups of words. The top three scoring points were hypothesized as peaks.</p>
      </sec>
      <sec id="sec-9-2">
        <title>Computer Vision and Multimedia Laboratory, University of Geneva, Switzerland The assumption</title>
        <p>was made that dramatic peaks correspond to the introduction of a new topic and thus correspond to change
in word use as reflected in the speech recognition transcripts. Additionally, the video and audio channel
effects assumed to be indicative of peaks were explored. Finally, a weighting was deployed that gave
more emphasis to positions at which peaks were expected to occur based on the distribution of peaks in
the development data. The weighting is used in unige-cvml1 , unige-cvml2 and unige-cvml3 .
Run unige-cvml1 uses text features alone. Run unige-cvml3 uses text plus elevated speaker pitch.
Run unige-cvml2 uses text, elevated pitch and quick changes in the video. Run unige-cvml4 uses
text only and no weighting. Run unige-cvml5 sets peaks randomly to provide a random baseline for
comparsion.</p>
        <p>Delft University of Technology and University of Twente, Netherlands Only features extracted from
the speech transcripts were exploited. Run duotu09fix predicted peaks at fixed points chosen by
analyzing the development data. Run duotu09ind used indicator words as cues of narrative peaks.
Indicator words were chosen by analyzing the development data. Run duotu09rep applied the assumption
that word repetition, reflecting the use of an important rhetorical device, would indicate a peak. Run
duotu09pro used pronouns as indicators of audience directed speech and assumed that high pronoun
densities would correspond to points where viewers feel maximum involvement. Run duotu09rat
exploited the affective scores of words, building on the hypothesis that use of affective speech characterizes
narrative peaks.
3.4</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Results</title>
      <p>The results of the task are reported in Table 2. The results make clear that it is quite challenging to
effectively support the detection of narrative peaks using audio and video features. Recall that unige-cvml5
is a randomly generated run. Most runs failed to yield results appreciably better than this random baseline.
The best scoring approaches exploited the speech recognition transcripts, in particular, the occurrence of
pronouns reflecting user directed speech and the use of words with high effective ratings.</p>
      <p>Because of the newness of the Narrative Peak Detection Task, the method of scoring is still a subject of
discussion. The scoring method was designed such that algorithms were given as much credit as possible
for agreement between the peaks they hypothesized and the peaks chosen by the annotators. See the
working notes papers of individual participants for some additional discussion.
4
4.1</p>
      <sec id="sec-10-1">
        <title>Linking Task</title>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Task</title>
      <p>The Linking Task, also called “Finding Related Resources Across Languages,” involves linking episodes of
the Beeldenstorm documentary (Dutch language) to Wikipedia articles about related subject matter (English
language). This task was new in 2009. Participants were supplied with 165 multimedia anchors, short
(ca. 10 seconds) segments, pre-defined in the 45 episodes that make up the Beeldenstorm collection. For
each anchor, participants were asked to automatically generate a list of English language Wikipedia pages
relevant to the anchor, ordered from the most to the least relevant.</p>
      <p>duotu09fix
duotu09ind
duotu09rep
duotu09pro
duotu09rat
unige-cvml1
unige-cvml2
unige-cvml3
unige-cvml4
unige-cvml5
uaic-run1
uaic-run2
uaic-run3</p>
      <p>Notice that this task is designed such that it goes beyond a named-entity linking task. Although a
multimedia anchor may contain a named entity (e.g., a person, place or organization) that is mentioned in
the speech channel, it is not always the case. The topic being discussed in the video at the point of the
anchor may not be explicitly named. Also, the representation of a topic in the video may be split between
the visual and the speech channel.
4.2</p>
    </sec>
    <sec id="sec-12">
      <title>Evaluation</title>
      <p>
        The ground truth for the linking task was created by the assessors. We adapted the four graded relevance
levels used in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for application in the Linking Task. Level 3 links are referred to as primary links and are
defined as “highly relevant – the page is the single page most relevant for supporting understanding of the
video in the region of the anchor.” There is only a single primary link per multimedia anchor representing
the one best page to which that anchor can be linked. Level 2 links are referred to as secondary links and
are defined as “fairly relevant – the page treats a subtopic (aspects) of the video in the region of the anchor.”
The final two levels: Level 1 (defined as: “marginally relevant, the page is not appropriate for the anchor”)
and Level 0 (defined as “irrelevant, the page is unrelated to the anchor”), were conflated and regarded as
irrelevant. Links classified as Level 1 are generic links, e.g., “painting,” or links involving a specific word
that is mentioned, but is not really central to the topic of the video at that point.
      </p>
      <p>Primary link evaluation For each video, the primary link was defined by consensus among three
assessors. The assessors were required to watch the entire episode so as to have the context to decide the
primary link. Primary links were evaluated using recall (correct links/total links) and Mean Reciprocal
Rank (MRR).</p>
      <p>Related resource evaluation For each video, a set of related resources was defined. This set necessarily
includes the primary link. It also includes other secondary links that the assessors found relevant. Only
one assessor needed to find a secondary link relevant for it to be included. However, the assessors agreed
on the general criteria to be applied when chosing a secondary link. Related resources were evaluated with
MRR. The list of secondary links is not exhaustive, for this reason, no recall score is reported.</p>
      <sec id="sec-12-1">
        <title>Centre for Digital Video Processing, Dublin City University, Ireland The words spoken between the</title>
        <p>start point and the end point of the multimedia anchor (as transcribed in the speech recognition transcript)
were used as a query and fired off against an index of Wikipedia. For dcu run1 and dcu run2 the
Dutch Wikipedia was queried and the corresponding English page was returned. Stemming was applied in
dcu run2. Dutch pages did not always have corresponding English pages. For dcu run3, the query was
translated first and fired off against an English language Wikipedia index. For dcu run4 a Dutch query
expanded using psuedo-relevance feedback was used.</p>
        <p>TNO Information and Communication Technology, Netherlands A set of existing approaches were
combined in order to implement a sophisticated baseline to provide a starting point for future research. A
wikify tool was used to find links in the Dutch speech recognition transcripts and in English translations
of the transcripts. Particular attention was given to proper names, with one strategy giving preference to
links to articles with proper-name titles and another strategy ensuring that proper name information was
preserved under translation.
4.4</p>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>Results</title>
      <p>The results of the task are reported in Table 3 (primary link evaluation) and Table 4 (related resource
evaluation). The best run used a combination of different strategies, referred to by TNO as a “cocktail.” The
techniques applied by DCU achieved a lower overall score, but proved to provide more robust
improvements across queries. Details can be found in the working notes papers.</p>
      <sec id="sec-13-1">
        <title>Conclusions and Outlook</title>
        <p>In 2009, VideoCLEF participants carried out three tasks, Subject Classification, Narrative Peak Detection
and Finding Related Resources Across Languages. These tasks generate enrichment for spoken content
that can be used to provide improvement in multimedia access and retrieval.</p>
        <p>With the exception of the Narrative Peak Detection Task, participants concentrated largely on features
derived from the speech recognition transcripts and did not exploit other audio information or information
derived from the visual channel. Looking towards next year, we will continue to suggest that participants
use a wider range of features. We plan to keep up our efforts to encourage cross-over from the TRECVid
community, for example, by recycling the TRECVid data set for the Subject Classification Task.</p>
        <p>We see the Subject Classification Task as developing increasingly towards a tag recommendation task,
where systems are required to assign tags to videos. The tag set might not necessarily be known in advance.
We expect that the formulation of this task as an information retrieval task will continue to prove useful
and helpful, although we wish to move to metrics for evaluation that will better reflect the utility of the
assigned tags in a real-world search or browsing situation.</p>
        <p>In 2010, we intend to continue working with the collections from Beeld &amp; Geluid, but also add an
additional data set of social video. Using this data set we hope to offer a task that will allow participants
to make use of social information, i.e., friendship relationships between users in an online community, in
order to improve video retrieval.
6</p>
      </sec>
      <sec id="sec-13-2">
        <title>Acknowledgements</title>
        <p>We are grateful to TrebleCLEF,9 a Coordination Action of European Commission’s Seventh Framework
Programme for a grant that made possible the creation of a data set for the Narrative Peak Detection
Task and the Linking Task. Thank you to the University of Twente for supplying the speech recognition
transcripts and to the Netherlands Institute of Sound and Vision for supplying the video. Thank you to
Dublin City University for providing the shot segmentation and keyframes and also for hosting the team of
Dutch-speaking video assessors during the Dublin Days event. We’d also like to express our appreciation to
Michael Kipp for use of the Anvil Video Annotation Research Tool. The work that went into VideoCLEF
2009 has been supported, in part, by PetaMedia Network of Excellence and has received funding from the
European Commission’s Seventh Framework Programme under grant agreement no. 216444.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Calic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sav</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Izquierdo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marlow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Murphy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N. O</given-names>
            <surname>'Connor</surname>
          </string-name>
          .
          <article-title>Temporal video segmentation for real-time key frame extraction</article-title>
          .
          <source>In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanjalic and L-Q Xu</surname>
          </string-name>
          .
          <article-title>Affective video content representation and modeling</article-title>
          .
          <source>Transactions on</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          ):
          <fpage>143</fpage>
          -
          <lpage>154</lpage>
          , Feb.
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Huijbregts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          , and F. de Jong.
          <article-title>Annotation of heterogeneous multimedia content using automatic speech recognition</article-title>
          .
          <source>In Proceedings of the International Conference on Semantic and Digital Media Technologies (SAMT)</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Keka</surname>
          </string-name>
          <article-title>¨la¨inen and K. Ja¨rvelin. Using graded relevance assessments in IR evaluation</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          ,
          <volume>53</volume>
          (
          <issue>13</issue>
          ):
          <fpage>1120</fpage>
          -
          <lpage>1129</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kipp</surname>
          </string-name>
          .
          <article-title>Anvil - a generic annotation tool for multimodal dialogue</article-title>
          .
          <source>In Proceedings of Eurospeech</source>
          , pages
          <fpage>1367</fpage>
          -
          <lpage>1370</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          , E. Newman, and
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          . Overview of VideoCLEF 2008:
          <article-title>Automatic generation of topic-based feeds for dual language audio-visual content</article-title>
          .
          <source>In Proceedings of the CLEF 2008 Workshop</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Pecina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hoffmannova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          .
          <article-title>Overview of the CLEF 2007 cross-language speech retrieval</article-title>
          .
          <source>In Proceedings of the CLEF 2007 Workshop</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Smeaton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Over</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Kraaij</surname>
          </string-name>
          .
          <article-title>Evaluation campaigns and TRECVid</article-title>
          .
          <source>In Proceedings of the ACM International Workshop on Multimedia Information Retrieval (MIR)</source>
          , pages
          <fpage>321</fpage>
          -
          <lpage>330</lpage>
          , New York, NY, USA,
          <year>2006</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wrede</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Shriberg</surname>
          </string-name>
          .
          <article-title>Spotting “hot spots” in meetings: Human judgments and prosodic cues</article-title>
          .
          <source>In Proceedings of Eurospeech</source>
          , pages
          <fpage>2805</fpage>
          -
          <lpage>2808</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>