Creating Enriched YouTube Media Fragments
           With NERD Using Timed-Text

Yunjia Li1 , Giuseppe Rizzo2 , Raphaël Troncy2 , Mike Wald1 , and Gary Wills1
                         1
                         University of Southampton, UK
        yl2@ecs.soton.ac.uk, mw@ecs.soton.ac.uk, gbw@ecs.soton.ac.uk
                    2
                      EURECOM, Sophia Antipolis, France,
            giuseppe.rizzo@eurecom.fr, raphael.troncy@eurecom.fr


      Abstract. This demo enables the automatic creation of semantically
      annotated YouTube media fragments. A video is first ingested in the
      Synote system and a new method enables to retrieve its associated sub-
      titles or closed captions. Next, NERD is used to extract named entities
      from the transcripts which are then temporally aligned with the video.
      The entities are disambiguated in the LOD cloud and a user interface
      enables to browse through the entities detected in a video or get more in-
      formation. We evaluated our application with 60 videos from 3 YouTube
      channels.

      Keywords: Media fragment, media annotation, NERD


1    Introduction
New W3C standards such as HTML5, Media Fragment URI and the Ontology
for Media Resources have finally made videos a first class citizen on the Web.
Indexing a video at a fine grained level such as the scene is, however, not yet a
common practice on popular video sharing platform. In this demo, we propose
to use NERD for extracting named entities from timed text associated to videos
in order to generate media fragments annotated with resources from the LOD
cloud. Our contributions include a new combined strategy for extracting named
entities, temporal alignment of the named entities with the video and a user
interface for browsing the enriched videos.
    The LEMO multimedia annotation framework provides a unified model to
annotate media fragments while the annotations are enriched with contextually
relevant information from the LOD cloud [1]. Yovisto provides both automatic
video annotations based on video analysis and collaborative user-generated an-
notations which are further linked to entities in the LOD cloud with the objective
to improve the searchability of videos [5]. SemWebVid automatically generates
RDF video descriptions using their closed captions [4]. The captions are ana-
lyzed by 3 web services (AlchemyAPI, OpenCalais and Zemanta) but chunked
into blocks which make loose the context for the NLP tools. In this demo, we pro-
pose a new combined strategy using 10 different NER tools based on NERD [3].
In addition, we propose a new method to get the subtitles of a video and to
analyze them globally while re-creating the temporal alignment.
2    Technical Architecture
This demo is powered by the integration and extension of two systems: Synote [2]
and NERD [3] (Figure 1a). A user creates a new recording in Synote from any


Fig. 1. a) Synote and NERD integration architecture. b) The Synote UI enriched with
NERD and DBpedia.
YouTube video. (1) The system first extracts the metadata and the subtitles if
available using the YouTube API:
GET api/timedtext?v=videoid&lang=en&format=srt&name=trackname
In this request, four parameters are required: the YouTube video id v, the language
of the subtitles lang, the timed-text format format and the track name. (2) A prior
request is necessary for getting the track name since it is specified by the video owner.
GET api/timedtext?v=videoid&type=list
(3) The timed text is passed to the NERD client API which sends it to the NERD
server. The named entity extraction is then performed on the entire context of the
SRT file. (4) NERD returns a list of named entities with their type and a URI that
disambiguates them, and a temporal window reference startNPT and endNPT cor-
responding to the SRT block where the entity appears. NERD exploits a combined
strategy where 10 different extractors are used together. The named entity types are
aligned yielding to a classification in 8 main types plus the general Thing concept. (5)
On receiving the NERD response, Synote constructs media fragment URIs and uses
the Jena RDF API to serialize the fragment annotations in RDF. The vocabularies
NERD3 , Ontology for Media Resource4 , Open Annotation5 and String Ontology in
NIF6 are used. Finally, the user interface shows the linking between named entities
and media fragments, together with the YouTube video and interactive subtitles. The
named entities and related metadata extracted from the subtitles are retrieved through
SPARQL queries (6.a, 6.b). If a named entity has been disambiguated with a DBpedia
URI (6.c), a SPARQL query is sent to get further data about the entity (e.g. label,
abstract, depiction) which is displayed alongside with the named entities.
3
  http://nerd.eurecom.fr/ontology
4
  http://www.w3.org/ns/ma-ont
5
  http://www.openannotation.org/spec/core
6
  http://nlp2rdf.lod2.eu/schema/string
3     Walk Through Demo
A live demo can be found at http://linkeddata.synote.org7 . A user first logged in on
Synote. When going to the recording creation page, a user can start the ingestion of
a YouTube video. The recording is then available in the recording list. The “NERD
Subtitle” button enables to launch the named extraction process. When completed,
a “Preview Named Entities” button enables to go to the player page where named
entities can be used to seek in particular video fragments.
    Figure 1b shows the screenshot of a preview page. The right column displays the
named entities found grouped according to the 8 main NERD categories. The YouTube
video is included in the left column together with the interactive subtitles. The named
entities are highlighted in different colours according to their categories. If a media
fragment is used in the preview page URI, the video starts playing from the media
fragment start time and stops playing when the end time is reached. When clicking
on a named entity, the video jumps to the media fragment that corresponds to the
subtitle block where the named entity has been extracted. If a named entity has been
disambiguated with a DBpedia URI, the entity is underlined. In addition, when the
entity is hover, a pop-up window shows additional information such as the generic
label, abstract and depiction properties. For named entities of type Person, the birth
date is displayed while latitude and longitude information are given for Location.


4     Evaluation
We filtered the videos which have subtitles for 3 different channels: People and Blogs,
Sports and Science and Technology and collected 60 videos in total (the top 20 for
each category). Videos have different duration ranging from from 32 to 4505 seconds
and different popularity ranging from 18 to 2,836,535 views (on July 30th, 2012). The
corpora is available at http://goo.gl/YhchP and can be visually explored in Synote at
http://goo.gl/XmMqp after being logged in with the iswc2012 account. The video #16
is the only one discarded because its subtitles are written in Romanian. The evaluation
consisted in two steps: i) be able to get all subtitles and ii) perform entity recognition
using NERD. We combined all extractors supported by NERD and we aligned the
classification results to 8 main types (Event is only supported by OpenCalais in beta)
plus the general type Thing used as fallback in the case NERD cannot find a more
specific type. We define the following variables: number of documents per category nd ;
total number of words nw ; number of words per document ratio rw ; total number of
entities ne ; number of entities per document re (Table 1). We observe that Science
and Technology videos tend to be more about people and organizations while Sports
videos mention more often locations, time and amount. People and Blogs videos have
less useful information although it is interesting to see that this type of video can be
used to train event detection.


5     Conclusion
This demo paper presents a system that creates media fragments from YouTube videos
and annotates their subtitles using NERD. The process includes named entities extrac-
tion in timed-text documents. Those entities annotate and enrich media fragments with
7
    As credentials, please insert for both user and password: “iswc2012”.
pointers to the LOD cloud. We provide a lightweight evaluation of the system in order
to show that we are effectively able to retrieve the subtitles of YouTube videos and to
run named entities extractions. Although a more thorough analysis will be needed, we
already show that videos exhibit a very different behavior in terms of named entities
depending on their genre.

                        People and Blogs Sports Science and Technology
          nd                    19            20           20
          nw                  7,187        21,944        39,661
          rw                 378.26      1,097.20       1,983.05
          ne                   610           897          1,303
          re                  32.11         44.85         65.15
          Thing                6.68         15.35         14.75
          Person               4.42          9.75        14.55
          Function             0.74         7.35          1.15
          Organization         3.63          9.20        12.25
          Location             3.89         8.05          6.40
          Product              3.26          2.60         6.40
          Time                 3.95        13.80          3.35
          Amount               5.47         9.30          6.30
          Event               0.05           0.00         0.00
Table 1. Upper part shows the average number of named entities extracted. Lower
part shows the average number of entities for the 8 NERD top categories grouped by
video channels.

Acknowledgments
The research leading to this paper was partially supported by the French National
Agency under contracts ANR.11.EITS.006.01, “Open Innovation Platform for Semantic
Media” (OpenSEM) and the European Union’s 7th Framework Programme via the
projects LinkedTV (GA 287911).


References
1. Haslhofer, B., Jochum, W., King, R., Sadilek, C., Schellner, K.: The LEMO an-
   notation framework: weaving multimedia annotations with the web. International
   Journal on Digital Libraries 10(1), 15–32 (2009)
2. Li, Y., Wald, M., Omitola, T., Shadbolt, N., Wills, G.: Synote: Weaving Media
   Fragments and Linked Data. In: 5th International Workshop on Linked Data on the
   Web (LDOW’12) (2012)
3. Rizzo, G., Troncy, R.: NERD: A Framework for Unifying Named Entity Recognition
   and Disambiguation Extraction Tools. In: 13th Conference of the European Chapter
   of the Association for computational Linguistics (EACL’12) (2012)
4. Steiner, T.: SemWebVid - Making Video a First Class Semantic Web Citizen
   and a First Class Web Bourgeois. In: 9th International Semantic Web Conference
   (ISWC’10) (2010)
5. Waitelonis, J., Ludwig, N., Sack, H.: Use what you have: Yovisto video search engine
   takes a semantic turn. In: 5th International Conference on Semantic and digital
   media technologies (SAMT’10) (2011)