=Paper=
{{Paper
|id=Vol-1184/paper11
|storemode=property
|title=Weaving the Web(VTT) of Data
|pdfUrl=https://ceur-ws.org/Vol-1184/ldow2014_paper_11.pdf
|volume=Vol-1184
|dblpUrl=https://dblp.org/rec/conf/www/SteinerMVCEP14
}}
==Weaving the Web(VTT) of Data==
https://ceur-ws.org/Vol-1184/ldow2014_paper_11.pdf
Weaving the Web(VTT) of Data
∗
Thomas Steiner Hannes Mühleisen Ruben Verborgh
CNRS, Université de Lyon Database Architectures Group Multimedia Lab
LIRIS, UMR5205 CWI, Science Park 123 Ghent University – iMinds
Université Lyon 1, France 1098 XG Amsterdam, NL B-9050 Gent, Belgium
tsteiner@liris.cnrs.fr hannes@cwi.nl ruben.verborgh@ugent.be
Pierre-Antoine Champin Benoît Encelle Yannick Prié
CNRS, Université de Lyon CNRS, Université de Lyon LINA – UMR 6241 CNRS
LIRIS, UMR5205 LIRIS, UMR5205 Université de Nantes
Université Lyon 1, France Université Lyon 1, France 44322 Nantes Cedex 3
pachampin@liris.cnrs.fr bencelle@liris.cnrs.fr yannick.prie@univ-nantes.fr
ABSTRACT 1. INTRODUCTION
Video has become a first class citizen on the Web with broad
support in all common Web browsers. Where with struc- 1.1 From to
tured mark-up on webpages we have made the vision of the In the “ancient” times of HTML 4.01 [25], the
Web of Data a reality, in this paper, we propose a new vi- tag1 was intended for allowing authors to make use of mul-
sion that we name the Web(VTT) of Data, alongside with timedia features like including images, applets (programs
concrete steps to realize this vision. It is based on the that were automatically downloaded and ran on the user’s
evolving standards WebVTT for adding timed text tracks machine), video clips, and other HTML documents in their
to videos and JSON-LD, a JSON-based format to serial- pages. The tag was seen as a future-proof all-purpose so-
ize Linked Data. Just like the Web of Data that is based lution to generic object inclusion. In an tag,
on the relationships among structured data, the Web(VTT) HTML authors can specify everything required by an ob-
of Data is based on relationships among videos based on ject for its presentation by a user agent: source code, ini-
WebVTT files, which we use as Web-native spatiotemporal tial values, and run-time data. While most user agents
Linked Data containers with JSON-LD payloads. In a first have “built-in mechanisms for rendering common data types
step, we provide necessary background information on the such as text, GIF images, colors, fonts, and a handful of
technologies we use. In a second step, we perform a large- graphic elements”, to render data types they did not support
scale analysis of the 148 terabyte size Common Crawl corpus natively—namely videos—user agents generally ran external
in order to get a better understanding of the status quo of applications and depended on plugins like Adobe Flash.2 .
Web video deployment and address the challenge of integrat- While the above paragraph is provocatively written in
ing the detected videos in the Common Crawl corpus into past tense and while the tag is still part of both
the Web(VTT) of Data. In a third step, we open-source the current World Wide Web Consortium (W3C) HTML5
an online video annotation creation and consumption tool, specification [2] and the Web Hypertext Application Tech-
targeted at videos not contained in the Common Crawl cor- nology Working Group (WHATWG) “Living Standard”,3
pus and for integrating future video creations, allowing for more and more Web video is now powered by the native
weaving the Web(VTT) of Data tighter, video by video. and well-standardized tag that no longer depends
on plugins. What currently still hinders the full adoption
Categories and Subject Descriptors of , besides some licensing challenges around video
codecs, is its lack of Digital Rights Management (DRM) sup-
H.5.1 [Multimedia Information Systems]: Video
port and the fierce debate around it, albeit the Director of
Keywords the W3C has confirmed4 that work in form of the Encrypted
Media Extensions [8] on “playback of protected content ” was
JSON-LD, Linked Data, media fragments, Semantic Web, in the scope of the HTML Working Group. However, it can
video annotation, Web of Data, WebVTT, Web(VTT) of Data well be said that HTML5 video has finally become a first
∗Second affiliation: Google Germany GmbH, Hamburg, DE class Web citizen that all modern browsers fully support.
1
HTML 4.01 tag (uppercased in the spirit of
the epoch): http://www.w3.org/TR/REC-html40/struct/
objects.html#edef-OBJECT
2
Adobe Flash: http://get.adobe.com/flashplayer/
3
HTML5 tag in the “Living Standard” (now
lowercased): http://www.whatwg.org/specs/web-apps/
current-work/#the-object-element
4
New Charter for the HTML Working Group:
Copyright is held by the author/owner(s). http://lists.w3.org/Archives/Public/public-html-
LDOW2014, April 8, 2014, Seoul, Korea.. admin/2013Sep/0129.html
1.2 Contributions and Paper Structure different kinds of WebVTT tracks: subtitles, captions,
We are motivated by the vision of a Web(VTT) of Data, descriptions, chapters, and metadata, detailed in Ta-
a global network of videos and connected content that is ble 1 and specified in HTML5 [2]. In this paper, we are
based on relationships among videos based on WebVTT especially interested in text tracks of kind metadata that
files, which we use as Web-native spatiotemporal containers are meant to be used from a scripting context and that are
of Linked Data with JSON-LD payloads. The paper makes not displayed by user agents. For scripting purposes, the
four contributions, including transparent code and data. video element has a property called textTracks that re-
turns a TextTrackList of TextTrack members, each of
i) Large-Scale Common Crawl study of the state which correspond to track elements. A TextTrack has
of Web video: we have examined the 148 terabyte a cues property that returns a TextTrackCueList of
size Common Crawl corpus and determined statistics individual TextTrackCue items. Important for us, both
on the usage of the , , and TextTrack and TextTrackCue elements can be dynami-
tags and their implications for Linked Data. cally generated. Listing 1 shows a sample WebVTT file.
ii) WebVTT conversion to RDF-based Linked Data:
we propose a general conversion process for “triplify- JSON-LD.
ing” existing WebVTT, i.e., for turning WebVTT into The JavaScript Object Notation 5 (JSON) is a (despite the
a specialized concrete syntax of RDF. This process is name) language-independent textual syntax for serializing
implemented in form of an online conversion tool. objects, arrays, numbers, strings, booleans, and null. Linked
Data [4] describes a method of publishing structured data
iii) Online video annotation format and editor: we so that it can be interlinked and become more useful, which
have created an online video annotation format and an builds upon standard Web technologies such as HTTP, RDF
editor prototype implementing it that serves for the and URIs. Based on top of JSON, the JavaScript Object
creation and consumption of semantic spatiotemporal Notation for Linked Data (JSON-LD, [27]) is a method for
video annotations turning videos into Linked Data. transporting Linked Data with a smooth upgrade path from
iv) Data and code: source code and data are available. JSON to JSON-LD. JSON-LD properties like title can be
mapped to taxonomic concepts (like dc:title from Dublin
The remainder of the paper is structured as follows. Sec- Core6 ) via so-called data contexts.
tion 2 provides an overview of the enabling technologies that
5
we require for our approach. Section 3 describes a large- JavaScript Object Notation: http://json.org/
6
scale study of the state of Web video deployment based on Dublin Core: http://dublincore.org/documents/dces/
the Common Crawl corpus. Section 4 deals with the integra-
tion of existing videos into the Web(VTT) of Data through
a tool called LinkedVTT. Section 5 presents an online video
annotation format and an editor that implements this for- WEBVTT
mat. We look at related work in Section 6 and close with
conclusions and an outlook on future work in Section 7. 00:01.000 --> 00:04.000
Never drink liquid nitrogen.
2. TECHNOLOGIES OVERVIEW 00:05.000 --> 00:09.000
In this section, we lay the foundations of the set of tech- It will perforate your stomach.
nologies that enable our vision of the Web(VTT) of Data.
The tag allows authors to specify explicit exter- Listing 1: Example WebVTT file with two cues
nal timed text tracks for videos. With the tag,
authors can specify multiple alternative media resources for
a video. Both do not represent anything on their own and WebVTT Kind Description and Default Behavior
are only meaningful as direct child nodes of a tag. subtitles Transcription or translation of speech,
suitable for when sound is available but
not understood. Overlaid on the video.
Web Video Text Tracks format (WebVTT). captions Transcription or translation of the dia-
The Web Video Text Tracks format (WebVTT, [24]) is in- logue, sound effects, and other relevant
tended for marking up external text track resources mainly audio information, suitable for when
for the purpose of captioning video content. The recom- sound is unavailable or not clearly au-
dible. Overlaid on the video; labeled as
mended file extension is vtt, the MIME type is text/vtt. appropriate for the hard-of-hearing.
WebVTT files are encoded in UTF-8 and start with the re- descriptions Textual descriptions of the video com-
quired string WEBVTT. Each file consists of items called cues ponent of the media resource, intended
that are separated by an empty line. Each cue has a start for audio synthesis when the visual
time and an end time in hh:mm:ss.milliseconds for- component is obscured, unavailable, or
mat, separated by a stylized ASCII arrow -->. The cue unusable. Synthesized as audio.
chapters Chapter titles, intended to be used for
payload follows in the line after the cue timings part and navigating the media resource. Dis-
can span multiple lines. Typically, the cue payload contains played as an interactive (potentially
plain text, but can also contain textual data serialization nested) list in the user agent’s interface.
formats like JSON, which later on in the paper we will show metadata Metadata intended for use from script
is essential for our proposed approach to semantic video an- context. Not displayed by user agent.
notation. Cues optionally can have unique WebVTT iden-
tifiers. WebVTT-compliant Web browsers [9] support five Table 1: WebVTT text track kinds in HTML5 [2]
Media Fragments URI. run is hierarchically organized in segments directories that
Media Fragments URI [30] specifies a syntax for construct- contain the WARC files with the HTTP requests and re-
ing URIs of media fragments and explains how to handle sponses for each fetch, and individual Web Archive Meta-
them over the HTTP protocol. The syntax is based on the data (WAT, [10]) files, which describe the metadata of each
specification of name-value pairs that can be used in URI request and response. While the Common Crawl corpus
query strings and URI fragment identifiers to restrict a me- gets bigger with each crawl run, it obviously does not repre-
dia resource to a certain fragment. Media Fragments URI sent the “whole Web”, which is an illusive concept anyway,
supports temporal and spatial media fragments. The tem- given that a simple calendar Web application can produce
poral dimension is denoted by the parameter name t and an infinite number of pages. Common Crawl decides on
specified as an interval with begin time and end time, with the to-be-included pages based on an implementation11 of
the begin time defaulting to 0 seconds and the end time the PageRank [23] algorithm, albeit the inclusion strategy
defaulting to the media item’s duration. The spatial dimen- is unknown—despite the foundation’s focus on transparency.
sion selects a rectangular area of pixels from media items.
Rectangles can be specified as pixel coordinates or percent- 3.2 On the Quest for WebVTT
ages. Rectangle selection is denoted by the parameter name We have analyzed the entire 148 terabytes of crawl data
xywh. The value is either pixel: or percent: followed using an Elastic Compute Cloud job whose code was made
by four comma-separated integers. The integers denote x, y, available as open-source.12 Rather than parse each docu-
width, and height respectively, with x = 0 and y = 0 being ment as HTML, we have tested them for the regular expres-
the top left corner of the media item. If percent: is used, sion ]*>(.*?) , an approach that also
x and width are interpreted as a percentage of the width of in previous experiments proved very efficient [3, 22]. We
the original media item, y and height of the original height. tested exactly 2,247,615,323 webpages that had returned
a successful HTTP response to the Common Crawl bot,
Ontology for Media Resources. and had to skip exactly 46,524,336 non-HTML documents.
The Ontology for Media Resources [17] serves to bridge On these webpages, we detected exactly 2,963,766
different description methods of media resources and to pro- tags, resulting in a 1.37 gigabyte raw text file that we have
vide a core set of descriptive properties. It also defines map- made available publicly.13 This means that on average only
pings to common metadata formats. Combined with Me- ≈0.132% of all webpages contain HTML5 video. The whole
dia Fragments URI, this allows for making ontologically an- job took five hours on 80 c1.xlarge machines and costed $555,
chored statements about media items and fragments thereof. consisting of $468 for Amazon EC2, plus an additional $87
for Amazon Elastic MapReduce (Amazon EMR).14
3. LARGE-SCALE COMMON CRAWL
STUDY OF THE STATE OF WEB VIDEO 3.3 Text Track Statistics
From all 2,963,766 tags, only 1,456 (≈ 0.049%)
Part of the objectives behind the Web(VTT) of Data is to
had a child node. Upon closer examination of
create a truly interconnected global network of and between
the kinds of these 1,456 nodes (see Table 1 for
videos containing Linked Data pointers to related content of
an explanation of the various kinds), we saw that the over-
all sorts, where diverse views are not filtered by the network
whelming majority are unsurprisingly used for subtitles
bubble, but where serendipitously new views can be discov-
or captions. Almost no chapter usage was detected and
ered by taking untrodden Linked Data paths. In order to get
neither metadata nor description usage at all. The full
there, we have conducted a large-scale study based on the
details can be seen in Table 2. Looking at the languages
Common Crawl corpus to get a better understanding of the
used in the captions and subtitles, these were almost exclu-
status quo of Web video and timed text track deployment.
sively English and French, as can be seen in Table 3. The
3.1 Common Crawl track labels listed in Table 4 indeed confirm this observa-
tion. In case of multiple tracks for one video, one track
The Common Crawl Foundation7 is a non-profit organiza-
can be marked as the default track. This happens through
tion founded in 2008 by Gil Elbaz. Its objective is to democ-
a boolean attribute,15 whose value either needs to be the
ratize access to Web information by producing and main-
empty string or the attribute’s name, which is “default” in
taining an open repository of Web crawl data that is uni-
the concrete case. Table 5 shows that this was used cor-
versally accessible and analyzable. All Common Crawl data
rectly in almost all cases. When we tried to determine the
is stored on Amazon Simple Storage Service (Amazon S3)8
MIME type of the actual text tracks, we relied on the file ex-
and accessible to anyone via Amazon Elastic Compute Cloud
tension of the values given in the attributes.
(Amazon EC2),9 allowing the data to be downloaded in
As a significant amount of text tracks seems to be dynam-
bulk, as well as directly be accessed for map-reduce process-
ing in EC2. The, at time of writing, latest dataset was col- 11
Common Crawl PageRank code: https://github.com/
lected at the end of 2013, contains approximately 2.3 billion commoncrawl/commoncrawl-crawler/tree/master/src/
webpages and is 148 terabyte in size [11]. Crawl raw data is org/commoncrawl/service/pagerank
12
stored in the Web ARChive format (WARC, [14]), an evolu- EC2 job: https://github.com/tomayac/postdoc/blob/
tion of the previously used Archive File Format (ARC, [6]), master/demos/warczenschwein/
13
which was developed at the Internet Archive.10 Each crawl 2,963,766 tags: https://drive.google.com/
7
file/d/0B9LlSNwL2H8YdWVIQmJDaE81UEk
Common Crawl: http://commoncrawl.org/ 14
Amazon EMR: http://aws.amazon.com/
8
Amazon S3: http://aws.amazon.com/s3/ elasticmapreduce/
9 15
Amazon EC2: http://aws.amazon.com/ec2/ HTML boolean attributes: http://www.whatwg.org/
10
Internet Archive: https://archive.org/ specs/web-apps/current-work/#boolean-attributes
ically generated on-the-fly—and thus had no file extension tags Count
1 826
but a video identifier in the URL instead—we used an ap- 3 404
proximation to check if some part of the URL matched the 2 173
regular expression /\bvtt\b/gi. Based on this approxi- 0 49
mation, a little over half of all text tracks are in WebVTT 4 4
format with the extension .vtt or rarely .webvtt. The
predecessor SubRip file format16 can still be encountered in Table 8: Number of tags per with
about a quarter of all text tracks. In between SubRip and (zero tags means the video URL was provided via
WebVTT, a format originally called WebSRT (Web Subtitle ; 1,405 videos did not have a src attribute, 51 videos had one)
Resource Tracks) existed that shared the .srt file exten-
sion. The full distribution details are available in Table 6. tags Count
0 7,828,032
Looking at the number of text tracks per video, almost all 1 1,139,240
videos had only exactly one text track rather than multiple, 3 138,540
as detailed in Table 7, meaning that the broad majority of 4 83,121
all videos are subtitled or captioned in only one language. 2 77,853
6 804
5 179
16
7 137
SubRip file format: http://www.matroska.org/ 8 64
technical/specs/subtitles/srt.html 10 22
9 9
13 8
Count 11 6
captions 915
subtitles 525 Table 9: Number of tags per with
chapters 2 or without (zero tags means the video URL was
undefined 10
provided via )
Table 2: Distribution of values for
Count
Count video/mp4 1,285
en 1,242 video/webm 94
fr 117 video/x-ms-wmv 10
de 8 video/ogg 5
Others 7 Others 6
undefined 78 undefined 58
Table 3: Distribution of values for Table 10: Distribution of values for
of tags with
Count
English 1,069 Count
Français 117 video/mp4 1,204,744
Others 41 video/webm 163,715
undefined 229 video/mp4; codecs=”avc1.42E01E, mp4a.40.2” 10,700
text/json 2,841
Table 4: Distribution of values for video/flv 2,281
video/x-ms-wmv 2,105
Count video/flash 2,023
default 650 video/ogg 1,529
‘’ 526 video/youtube 1,528
true 1 application/x-mpegURL 1,257
undefined 279
Table 11: Distribution of values for
Table 5: Distribution of values for of tags with or without (with more
than 1,000 occurrences)
File extensions of Count
probably .vtt 696
.srt 390
.vtt or .webvtt 66 3.4 Video Statistics
no extension 304 As in Section 5 we will report on ways to make seman-
tic statements about videos on the Web, we have addition-
Table 6: Distribution of values for ally compiled some video statistics. Unlike with images on
tags Count the Web, where semantic statements in Resource Descrip-
1 1,446 tion Framework (RDF) can be made based on the image’s
0 9 URL [21], with Web video, the situation is another. Due
9 1 to different Web browsers supporting different video codecs,
it is a common practice to provide videos in different en-
Table 7: Number of tags per tag codings. The user’s Web browser then dynamically selects
(zero tags means the tag had an unparseable ) a version it can play. This is realized through the
tag. Table 8 shows the observed numbers of how this data model can easily be mapped to RDF-based
tag child nodes per tag with tag, with Linked Data, and thus allowing for many other usage sce-
the result that up to four sources are given for essentially narios for this data. For this purpose, we propose an RDF-
the “same” video. Table 9 confirms this observation for Schema ontology17 conveying the WebVTT data model. In
the entire collection of all tags with or without the rest of the paper, terms from this ontology will be pre-
tag. Table 10 shows the distribution of values ceded by the vtt: prefix. An online implementation of this
for the attribute of tags with interpretation process that we have titled LinkedVTT is like-
tag, the clear leaders being the MP4 format fol- wise available online.18 It takes the URL of any WebVTT
lowed by WebM, a trend that again is also reflected in Ta- file, the contents of a raw WebVTT file, or a YouTube URL
ble 11 within the entire collection of all tags with of any video with closed captions as an input, and applies
or without tag. the conversion from WebVTT to Linked Data on-the-fly.
3.5 Implications on Linked Data for Videos 4.1 Basic Interpretation
The biggest issue with this practice of putting multiple A WebVTT file defines a set of cues, which are described
sources is that rather than having one unique identifier (URL) by a pair of timestamps and a payload. In other words, each
per video, there can be multiple identifiers. Listing 2 shows cue is an annotation of the video, associating a temporal
a minimal example. Unless one repeats all statements for video fragment to the payload, delimited by the two times-
each source, there will always remain unclear sources with- tamps. As there is a standard way of identifying temporal
out structured data. We note that a video in encoding A and spatial video fragments with a URI [30] it is straight-
and the “same” video in encoding B may not be marked forward to represent this annotation as an RDF triple. We
as , because statements about the encoding therefore propose a property vtt:annotatedBy to serve
format of one video do not apply to the other, the iden- as predicate for those triples. To keep the context of each
tity symmetry condition would thus be violated. In prac- annotation, we use the notion of RDF dataset [7]. Each
tice, a solution similar to specifying canonical URLs in Web vtt:annotatedBy triple is enclosed in a named graph,
search [15] seems feasible. Another approach is to require whose name is either a URI, based on the cue identifier if it
a unique identifier in the attribute, which al- has one, or a blank node if the cue has no identifier. The de-
lows for addressing the video with fragment identifiers. More fault graph of the dataset describes its overall structure, link-
advanced approaches to the problem stemming from the bib- ing the dataset URI to all the URIs and blank nodes identi-
liographic universe like FRBR [29] are possible, but for the fying its cues with the vtt:hasCue property. In the default
concrete use case seem quite complex. graph, each cue is also linked to the Media Fragments URI
it describes, with the vtt:describesFragment property.
As the notion of dataset is a recent addition to the RDF
4.2 Advanced Interpretation
date for cues of such tracks. JSON has a textual syntax that
is easy to author and easy to process in a Web browser and
Listing 2: Specifying a license for an image and elsewhere. Furthermore, JSON-LD [27] provides a standard
attempt to do the same for a video with two sources way to interpret JSON data as Linked Data, which fits nicely
(the license of kitten.webm stays unclear) with our approach. More precisely, whenever the payload of
a cue successfully parses as a JSON object, we consider that
this object is meant to represent the annotated media frag-
ment itself, and interpret it as JSON-LD. In consequence,
4. WEBVTT CONVERSION TO all properties of the JSON object are applied directly to the
RDF-BASED LINKED DATA fragment, and embedded structures can be used to describe
The WebVTT specification [24] defines a syntax for con- other resources related to that fragment, e.g., depicted per-
veying timed video text tracks, and a semantics for this sons, locations, topics, related videos or video fragments, or
syntax in terms of how Web browsers should process such 17
RDF-Schema ontology: http://champin.net/2014/
tracks. It achieves this by specifying an underlying data linkedvtt/onto#
18
model for those tracks. The aim of this section is to show LinkedVTT: http://champin.net/2014/linkedvtt/
isolation before applying named entity extraction on them.
WEBVTT Representative examples based on this idea are [18, 19, 20]
by Li et al. or also [28] by us. In combination with Media
cue1 Fragments URI, spatiotemporal annotations can be created
00:00:00.000 --> 00:00:12.000 with good precision and reasonable time effort both on-the-
{ fly or in bulk for static storage in a triple store.
"@context": "http://champin.net/2014/linkedvtt/
demonstrator-context.json",
"tags": ["wind scene", "opening credits"],
5. ONLINE VIDEO ANNOTATION
"contributors": ["http://ex.org/sintel"] FORMAT AND EDITOR
} Complementary to the conversion process presented in
Section 4, in this section we focus on facilitating the online
Listing 3: Sample WebVTT metadata file with creation and consumption of metadata tracks for future
JSON-LD payload in a cue identified as “cue1” video creations and videos not contained in the Common
Crawl corpus. We begin with the annotation model.
5.1 Annotation Model
spatiotemporal video tags. In this case, all the triples gen-
Our annotation model is the same as the one produced
erated from parsing the payload as JSON-LD replace the
by the interpretation process presented above. Annotations
vtt:annotatedBy triple in the cue’s named graph. List-
take the form of RDF statements (subject-predicate-object),
ing 3 gives an example of such JSON-LD payload. We note
where the subject is any temporal or spatiotemporal frag-
that it includes the JSON-LD specific @context key, to allow
ment of the video, identified by the corresponding Media
its interpretation as Linked Data. This context can be spec-
Fragments URI. They are encoded as TextTrackCues with
ified in each cue, but below we also provide an alternative
JSON-LD payloads such as the one shown in Listing 3.
way to declare it once for the entire WebVTT file.
A dedicated data context defines their semantics.
4.3 Linked Data Related Metadata 5.2 WebVTT Editor
In addition to the cues, WebVTT files can contain meta- We have implemented this annotation model in form of
data headers described as key-value pairs. While the WebVTT an online demonstrator prototype. The demonstrator inter-
specification defines a number of metadata headers, it leaves prets the existing metadata track for a video and reacts on
it open for extensions. We propose three extended meta- annotations when the currentTime of the media resource
data headers listed below. Most WebVTT currently does matches the startTime or endTime of a cue. We call ex-
not contain these metadata headers, but we argue that they isting annotations Read annotations. Users can add Write
allow for an easy transition from plain WebVTT to Linked annotations by creating new TextTrackCues at the desired
Data WebVTT, just like JSON-LD makes it easy to turn start and end times and by providing their JSON-LD pay-
plain JSON into Linked Data by adding a @context prop- loads. The editor facilitates this task through a graphical
erty. Further more, other metadata headers will be evalu- user interface, abstracting the underlying details. Figure 1
ated against the JSON-LD context, and can produce addi- shows a screenshot of the WebVTT editor. Newly generated
tional triples with the WebVTT file as its subject. annotations get directly interpreted and can be persistently
stored locally or in the future remotely for collaborative edit-
@base Sets the base URI used for resolving relative URIs.
ing. We have developed a WebVTT to JSON-LD converter,
This applies to any relative URIs that would be found in
capable of transforming WebVTT metadata tracks following
the JSON-LD descriptions, but also to generate URIs for
our annotation model into JSON-LD for the Web of Data.
cues based on their identifiers. It defaults to the URI of
This allows for straight-forward local annotation creation
the WebVTT file.
with Semantic Web compliance upon global publication.
@context This key can be used multiple times; each value
is the URI of a JSON-LD context that should be used to 5.2.1 Semantic Annotation Types
interpret the JSON payloads in the WebVTT file. Our JSON-LD context eases common annotation tasks
by defining the semantics of a few useful JSON properties
@video Sets the URI for the video for generating media described below. According to this context, Listing 3 is in-
fragment URIs. If not present, the video URI must be terpreted as in Listing 4 (RDF in JSON-LD syntax) and
provided externally, e.g., the attribute of Listing 5 (RDF in N-Triples syntax). More advanced anno-
the video containing the WebVTT track. This metadata tation tasks can be supported by extending the data context.
header is a direct response to an issue that we have outlined
in Subsection 3.5. Plain Text Tags Annotations of type tags allow for add-
ing plain text tags to a media fragment. They are inter-
4.4 Integrating Existing Videos Into the preted as Common Tag [13] format ctag:label.
Web(VTT) of Data Semantic Tags Annotations of type semanticTags al-
Given the currently rather manageable amount of videos low for adding semantic tags to a media fragment. Unlike
with captions or subtitles as outlined in Subsection 3.3, ap- plain text tags, semantic tags are references to well-defined
proaches for the automated semantic lifting based on timed concepts complete with their own URIs. They are inter-
text track data are feasible. These approaches extract the preted as Common Tag [13] format ctag:means. Spa-
transcribed text snippets from cues and either convert them tiotemporal semantic tags allow for interesting Linked Data
into one consistent block of text or treat each text snippet in experiences if the tags point to well-connected concepts.
{ .
"@id": "http://ex.org/metadata.vtt", .
"cues": [{ .
linkedvtt/demonstrator-context.json", "wind scene" .
=0:0.0,0:12.0", "opening credits" .
"contributors": ["http://ex.org/sintel"] .
}
Listing 5: RDF triples based on the JSON-LD code
Listing 4: Generated JSON-LD file based on the from Listing 4
WebVTT file shown in Listing 3 (flat interpretation)
5.3 Interpretation Layer
Contributors The contributors annotation type al- In our WebVTT editor, we propose an interpretation layer
lows for denoting the contributors in a media fragment, capable of dealing with the herein defined annotation types.
like its actors. They are interpreted as Ontology for Media We thus make an open world assumption by supporting
Resources [17] format ma:hasContributor. a set of pre-defined values for predicate and object listed
below, and ignoring unknown ones. This permits others
Summary The summary annotation type allows for sum- to extend—or even completely replace—our interpretation
marizing a media fragment (note, not the whole video like layer. If a TextTrackCue has a WebVTT identifier, we
kind description tracks) with plain text. They are inter- use it to address its annotations via the metadata track’s
preted as ma:description [17]. URI and corresponding cue fragment identifier, allowing for
meta annotations of annotations, e.g., to attach provenance
or license information to them.
5.2.2 Presentation-Oriented Annotation Types
Presentation-oriented annotations—similar to temporal 5.4 Evaluation
style sheets—do not generate RDF data, but only impact We evaluate or annotation model and related technology
the way videos get presented. stack based on a state-of-the-art hypervideo model by Sadal-
lah et al. [26] that builds on a careful study of prior art.
Visual Effect Annotations of type visualEffect allow The CHM Hypervideo Model.
for applying visual effects in the syntax of Cascading Style Sadallah et al. define hypervideo as “interactive video-cen-
Sheets19 (CSS) to a media fragment, e.g., filters, zoom, tric hypermedia document built upon audiovisual content”.
transparency, and 2D/3D transformations and animations. The authors identify three common hypervideo characteris-
tics, namely (i) interactivity, which, e.g., can enable richer
Audial Effect The audialEffect annotation type allows navigational possibilities, (ii) non-linearity, which allows for
for applying audial effects to a media fragment. Currently, features like video montages, and finally (iii) enrichments
we support modifying the volume from 0 to 1. that include all sorts of supplementary material besides and
on top of hypervideos. The authors have examined hyper-
Playback Rate The playbackRate annotation type al- video systems of recent years and found recurring patterns,
lows for specifying the effective playback rate of a media summarized and compared to our approach in the following.
fragment. The playback rate is expressed as a floating point
multiple or fraction of the intrinsic video speed.
Video player and controls Hypervideo systems by defi-
nition provide one or multiple video players, however, the
HTML Overlay Via the htmlOverlay annotation type, corresponding video controls are not necessarily exposed.
overlays in freeform HTML code can be added to a media
fragment. Examples are graphical, textual, or combined X Our approach uses the (optionally customizable) default
overlays that can contain links to (temporal fragments of) HTML5 player that includes hidable controls (Figure 1).
other videos or within the current video. Timeline A timeline is the spatial representation of tem-
porally situated metadata in a video. The most common
timeline pattern shows the time along the x-axis and cor-
19
Cascading Style Sheets: http://www.w3.org/Style/CSS/ responding metadata along the y-axis.
X Our approach supports temporal metadata. Customiz- other content directly to moving images. PopcornMaker22
able timeline visualizations exist20 and can be added. is an interactive Web authoring environment that allows for
videos to be annotated on a video timeline. While Popcorn
Textual or graphical overlay Additional textual or gra- media annotations are essentially JavaScript programs, our
phical information can be displayed in form of overlays on approach is based on directly indexable WebVTT files.
the video. Overlays can also serve as external or video-
internal hyperlinks, referred to as hotspots.
X We realize overlays and links with htmlOverlay types. 7. CONCLUSIONS AND FUTURE WORK
Figure 1 shows both a graphical (yellow box) and two tex- In this paper, we have introduced our vision of the
tual overlays (red and green texts). Web(VTT) of Data, a global network of videos and con-
nected content that is based on relationships among videos
Textual or graphical table of contents If a video is log- based on WebVTT files, which we use as Web-native spa-
ically separated into different parts, a table of contents lists tiotemporal containers of Linked Data with JSON-LD pay-
these in textual or graphical form, makes them navigable, loads. With the recent graduation of the JSON-LD syntax
or visually summarizes them, referred to as video map. as an official W3C Recommendation and a major search en-
X Textual tables of contents are directly supported via gine company23 supporting embedded JSON-LD documents
WebVTT text tracks of type chapters. Graphical tables in HTML documents,24 JSON-LD definitely is here to stay.
of contents can be created based thereon. Likewise for WebVTT, which in the more recent past has
been natively implemented by all major Web browser ven-
Transcript The textual document of the transcribed au- dors, the future is bright. We combine both technologies in
diovisual content of a video allows for following along the a fruitful way that is focused both at common Web search
video by reading and also serves for in-video navigation. engines as well as at the entire Linked Data stack of tech-
nologies. Using WebVTT as a container for JSON-LD is
X Subtitles and captions are natively supported by
both innovative and natural. Making commonly understood
WebVTT tracks of the types subtitles and captions. Fig-
semantic statements about video fragments on the Web has
ure 1 shows active subtitles (white text).
become feasible thanks to Media Fragments URI, a stan-
dard that allows for applying Linked Data approaches to
6. RELATED WORK moving images on a temporal and spatiotemporal axis. We
With our annotation approach, we leverage WebVTT me- have organized this paper in three major steps. (i) in or-
tadata tracks as a means for tying semantic JSON-LD anno- der to get a better understanding of the status quo of Web
tations to temporal or spatiotemporal video fragments. As video deployment, we have performed a large-scale analy-
each tag by pure definition is bound to exactly one sis of the 148 terabyte size Common Crawl corpus, (ii) we
tag, and as modern search engines parse and inter- have addressed the challenge of integrating existing videos
pret JSON-LD annotations, a unique relation of annotations in the Common Crawl corpus into the Web(VTT) of Data
to video content is made. In consequence, related work can by proposing a WebVTT conversion to RDF-based Linked
be regarded under the angles of online annotation creation Data, and (iii) we have open-sourced an online video anno-
and large-scale Linked Data efforts for video. Many have tation creation and consumption tool, targeted at videos not
combined Linked Data and video, typical examples are [16] contained in the Common Crawl corpus and for integrating
by Lambert et al. and [12] by Hausenblas et al. We have future video creations. In this paper, we have combined Big
already described the text track enriching approaches [18, Data and Small Data. On the Big Data side, we have learned
19, 20, 28] in Subsection 4.4, [20] being closest to our idea from the Common Crawl corpus which kind of timed text
of a Web(VTT) of Data, albeit their approach is centered tracks are out there, which allowed us to propose a realistic
around their application Synote. The online video hosting approach to integrating it into the Web(VTT) of Data. On
platform YouTube lets video publishers add video annota- the Small Data side, we have implemented an online editor
tions in a closed proprietary format. From 2009 to 2010, for the creation of semantic video annotations that can be
YouTube had a feature called Collaborative Annotations [1] applied video by video, so that the Web(VTT) of Data gets
that allowed video consumers to collaboratively create video woven tighter and tighter with each new addition.
annotations. Unlike the format of YouTube, our format Future work has several dimensions. Beginning from video
is open and standards-based. In [31], Van Deursen et al. annotation, a first concrete research task is to work on our
present a system that combines Media Fragments URI and editor prototype. While a lot of efforts can be put in the
the Ontology for Media Resources in an HTML5 Web ap- editor itself, far more added value is created by propos-
plication to convert rich media fragment annotations into ing an extension to the most well-known online video an-
a WebVTT file that can be used by HTML5-enabled play- notation stack, the Popcorn.js and PopcornMaker projects.
ers to show the annotations in a synchronized way. Building A minimal Popcorn.js example annotation can be seen in
on their work, we additionally allow for writing annotations Listing 6. Rather than storing the annotations as steps of
by letting annotators create WebVTT cues with an editor. a JavaScript program that “artificially” need to be aligned
The Component-based Hypervideo Model Popcorn.js21 is an to the corresponding parts of the video, an extension to
HTML5 JavaScript media framework for the creation of me-
22
dia mixes by adding interactivity and context to online video PopcornMaker: https://popcorn.webmaker.org/
23
by letting users link social media, feeds, visualizations, and JSON-LD in Gmail: https://developers.google.com/
gmail/actions/reference/formats/json-ld
20 24
D3 timeline implementation: https://github.com/ Embedding JSON-LD in HTML Documents:
jiahuang/d3-timeline http://www.w3.org/TR/json-ld/#embedding-json-
21
Popcorn.js: http://popcornjs.org/ ld-in-html-documents
Figure 1: WebVTT editor interpreting the spatiotemporal annotation “cue2” that identifies the highlighted
spatial fragment as ex:actors/sintel/shaman; while in parallel modifying “cue3” with tag, volume, playback
rate, and style (¬ left: Graphical User Interface with JSON-LD debug view, center: Chrome Devel-
oper Tools with highlighted tag, ® right: raw WebVTT file
metadata.vtt with highlighted “cue2” and “cue3”)
Popcorn.js could use our approach of leveraging naturally
temporally aligned WebVTT cues with JSON-LD payloads
for the annotations. We have been able to play video in
Web browsers plugin-free for a couple of years now, the
next step is adding resources to videos to make them more
accessible and provide more options to the viewer. Straight-
The research presented in this paper was partially supported
Listing 6: Popcorn.js example
by the French National Agency for Research project Specta-
cle En Ligne(s), project reference ANR-12-CORP-0015.
8. REFERENCES Annotation and Browsing for Distance Learning. In
[1] S. Bar et al. YouTube’s Collaborative Annotations. In SemHE ’10: The Second International Workshop on
Webcentives ’09, 1st International Workshop on Semantic Web Applications in Higher Education, 2010.
Motivation and Incentives, pages 18–19, 2009. [17] W. Lee, W. Bailer, T. Bürger, et al. Ontology for
[2] R. Berjon, S. Faulkner, T. Leithead, et al. HTML5, Media Resources 1.0. Recommendation, W3C, Feb.
A Vocabulary and Associated APIs for HTML and 2012. http://www.w3.org/TR/mediaont-10/.
XHTML. Candidate Recommendation, W3C, 2013. [18] Y. Li, G. Rizzo, J. L. Redondo Garcı́a, R. Troncy,
http://www.w3.org/TR/html5/. M. Wald, and G. Wills. Enriching Media Fragments
[3] C. Bizer, K. Eckert, R. Meusel, H. Mühleisen, with Named Entities for Video Classification. In
M. Schuhmacher, and J. Völker. Deployment of RDFa, Proceedings of the 22nd International Conference on World
Microdata, and Microformats on the Web – A Wide Web Companion, WWW ’13 Companion, pages
Quantitative Analysis. In H. Alani, L. Kagal, 469–476, Republic and Canton of Geneva,
A. Fokoue, P. Groth, C. Biemann, J. Parreira, Switzerland, 2013. International World Wide Web
L. Aroyo, N. Noy, C. Welty, and K. Janowicz, editors, Conferences Steering Committee.
The Semantic Web – ISWC 2013, volume 8219 of Lecture [19] Y. Li, G. Rizzo, R. Troncy, M. Wald, and G. Wills.
Notes in Computer Science, pages 17–32. Springer Creating Enriched YouTube Media Fragments with
Berlin Heidelberg, 2013. NERD Using Timed-Text. In 11th International
[4] C. Bizer, T. Heath, and T. Berners-Lee. Linked Semantic Web Conference (ISWC2012), November 2012.
Data—The Story So Far. Int. J. Semantic Web Inf. [20] Y. Li, M. Wald, T. Omitola, N. Shadbolt, and
Syst., 5(3):1–22, 2009. G. Wills. Synote: Weaving Media Fragments and
[5] C. Bizer, T. Heath, T. Berners-Lee, and Linked Data. In Bizer et al. [5].
M. Hausenblas, editors. WWW2012 Workshop on Linked [21] P. Linsley. Specifying an image’s license using RDFa,
Data on the Web, Lyon, France, 16 April, 2012, volume Aug. 2009. http://googlewebmastercentral.
937 of CEUR Workshop Proceedings. CEUR-WS.org, blogspot.com/2009/08/specifying-images-
2012. license-using-rdfa.html.
[6] M. Burner and B. Kahle. Arc File Format. Technical [22] H. Mühleisen and C. Bizer. Web Data Commons –
report, Jan. 1996. http://archive.org/web/ Extracting Structured Data from Two Large Web
researcher/ArcFileFormat.php. Corpora. In Bizer et al. [5].
[7] R. Cyganiak, D. Wood, and M. Lanthaler. RDF 1.1 [23] L. Page, S. Brin, R. Motwani, and T. Winograd. The
Concepts and Abstract Syntax. Proposed PageRank Citation Ranking: Bringing Order to the
Recommendation, W3C, Jan. 2014. Web. Technical report, Stanford InfoLab, Nov. 1999.
http://www.w3.org/TR/rdf11-concepts/. [24] S. Pfeiffer and I. Hickson. WebVTT: The Web Video
[8] D. Dorwin, A. Bateman, and M. Watson. Encrypted Text Tracks Format. Draft Community Group
Media Extensions. Working Draft, W3C, Oct. 2013. Specification, W3C, Nov. 2013.
http://www.w3.org/TR/encrypted-media/. http://dev.w3.org/html5/webvtt/.
[9] S. Dutton. Getting Started With the Track Element, [25] D. Raggett, A. Le Hors, and I. Jacobs. HTML 4.01
Feb. 2012. http://www.html5rocks.com/en/ Specification. Recommendation, W3C, Dec. 1999.
tutorials/track/basics/. http://www.w3.org/TR/html401.
[10] V. Goel. Web Archive Metadata File Specification. [26] M. Sadallah et al. CHM: An Annotation- and
Technical report, Apr. 2011. https: Component-based Hypervideo Model for the Web.
//webarchive.jira.com/wiki/display/Iresearch/ Multimedia Tools and Applications, pages 1–35, 2012.
Web+Archive+Metadata+File+Specification. [27] M. Sporny, D. Longley, G. Kellogg, et al.
[11] L. Green. Winter 2013 Crawl Data Now Available, JSON-LD 1.0, A JSON-based Serialization for Linked
Jan. 2014. http://commoncrawl.org/winter-2013- Data. Proposed Recommendation, W3C, Nov. 2013.
crawl-data-now-available/. http://www.w3.org/TR/json-ld/.
[12] M. Hausenblas, R. Troncy, Y. Raimond, and [28] T. Steiner. SemWebVid – Making Video a First Class
T. Bürger. Interlinking Multimedia: How to Apply Semantic Web Citizen and a First Class Web
Linked Data Principles to Multimedia Fragments. In Bourgeois. In A. Polleres and H. Chen, editors,
Linked Data on the Web Workshop (LDOW 09), in Proceedings of the ISWC 2010 Posters & Demonstrations
conjunction with the 18th International World Wide Web Track: Collected Abstracts, Shanghai, China, November 9,
Conference (WWW 09), 2009. 2010, volume 658 of CEUR Workshop Proceedings ISSN
[13] A. Iskold, A. Passant, V. Miličić, et al. Common Tag 1613-0073, pages 97–100, Nov. 2010.
Specification, June 2009. [29] B. Tillett. FRBR: A Conceptual Model for the
http://commontag.org/Specification. Bibliographic Universe. Technical report, 2004.
[14] ISO 28500. Information and documentation – The http://www.loc.gov/cds/downloads/FRBR.PDF.
WARC File Format. International Standard, 2008. [30] R. Troncy, E. Mannens, S. Pfeiffer, et al. Media
http://bibnum.bnf.fr/WARC/WARC_ISO_28500_ Fragments URI 1.0 (basic). Recommendation, W3C,
version1_latestdraft.pdf. Sept. 2012. http://www.w3.org/TR/media-frags/.
[15] J. Kupke and M. Ohye. Specify your canonical, Feb. [31] D. Van Deursen, W. Van Lancker, E. Mannens, et al.
2009. http://googlewebmastercentral.blogspot. Experiencing Standardized Media Fragment
de/2009/02/specify-your-canonical.html. Annotations Within HTML5. Multimedia Tools and
[16] D. Lambert and H. Q. Yu. Linked Data based Video Applications, pages 1–20, 2012.