<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>April</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Automated interlinking of speech radio archives</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yves Raimond</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chris Lowis BBC R</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>D London</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>United Kingdom</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>yves.raimond</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>chris.lowis}@bbc.co.uk</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Linked Data</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Concept Tagging</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Speech Processing</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <volume>16</volume>
      <issue>2012</issue>
      <abstract>
        <p>The BBC is currently tagging programmes manually, using DBpedia as a source of tag identi ers, and a list of suggested tags extracted from their synopsis. These tags are then used to help navigation and topic-based search of BBC programmes. However, given the very large number of programmes available in the archive, most of them having very little metadata attached to them, we need a way of automatically assigning tags to programmes. We describe a framework to do so, using speech recognition, text processing and concept tagging techniques. We evaluate this framework against manually applied tags, and compare it with related work. We nd that this framework is good enough to bootstrap the interlinking process of archived content.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The BBC (British Broadcasting Corporation) has
broadcast radio programmes since 1922. Over the years, it has
accumulated a very large archive of programmes. A number
of cataloguing e orts have been made to improve the ease
with which people can nd content in this archive. This
cataloguing e ort has been geared towards reuse, in other
words to enable programme makers to easily nd snippets
of content to include in their own, newly commissioned,
programmes. The coverage of the catalogue is not uniform
across the BBC's archive, for example it excludes the BBC
World Service, which has been broadcasting since 1932.
Creating this metadata is a time and resource expensive process;
a detailed analysis of a 30 minute programme can take a
professional archivist 8 to 9 hours. Moreover, as this data is
geared towards professional reuse, it is often not
appropriate for driving user-facing systems | it is either too shallow
(not all programmes are being classi ed) or too deep
(information about individual shots or rushes).</p>
      <p>
        Since 2009 the places, people, subjects or organisations
mentioned in new programmes have been \tagged" with
DBpedia [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] web identi ers. These tags allow the BBC's
audience to easily nd programmes relating to particular
topics, by presenting them through a navigable web interface
at http://bbc.co.uk/programmes. The tool used by
editors to tag programmes suggests tags based on the textual
content, for example a synopsis, or title, associated with a
programme. Tags are then manually associated with the
programme. The entire tagging process is described in more
details in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. A bene t of using Linked Data web identi ers
as tags is that they are unambiguous, and that we can
retrieve more information about those tags when needed. For
example, programmes tagged with places can be plot on a
map, or aggregation pages can be enriched with
information about the corresponding topic. By having these anchor
points in the Linked Data web, we can accommodate a wide
range of unforeseen use-cases.
      </p>
      <p>This process of manual tagging is naturally very
timeconsuming, and with the emphasis on delivering new
content, would take considerable time to apply to the entire
archive. This problem is compounded by the lack of
availability of textual meta-data for a signi cant percentage of
the archive which prevents the bootstrapping of the tagging
process.</p>
      <p>On a more positive note, the full audio content is, in the
case of the World Service radio archive, available in
digital form. The archive currently holds around 70,000
programmes, which amounts to about two and a half years of
continuous audio. In this paper, we describe a framework
to automatically interlink such an archive with the Linked
Data Web, by automatically tagging individual programmes
with Linked Data web identi ers.</p>
      <p>We start by describing related work. We then describe a
novel approach which uses an open-source speech
recognition engine, and how we process the transcripts it generates
to extract relevant tags that can be used to annotate the
corresponding radio programme. We evaluate the results
by comparing the tags generated by this method with those
manually applied by editors to BBC programmes. We
compare the results of our method with those obtained by other,
existing methods.</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>This paper is concerned with two topics: the classi cation
of the BBC archive and, more generally, the problem of
automatically applying semantic labels to a piece of recorded
audio.</p>
      <p>
        There has been a number of attempts at trying to
automatically classify the BBC archive. The THISL system [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
applies an automated speech recognition system (ABBOT)
on BBC news broadcasts and uses a bag-of-words model on
the resulting transcripts for programme retrieval. The Rich
News system [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] also uses ABBOT for speech recognition. It
then segments the transcripts using bag-of-words similarity
between consecutive segments using Choi's C99 algorithm
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For each segment, a set of keyphrases is extracted and
used, along with the broadcast date of the programme, to
nd content within the BBC News website. Information
associated with retrieved news articles is then used to annotate
the topical segment. Recent work at the BBC classi es the
mood of archived programmes using their theme tunes [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
and ultimately intends to help users browse the archive by
mood.
      </p>
      <p>
        Several researchers have tried to automatically reproduce
the labelling task of a piece of speech audio. The rst work
in that area [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] details a supervised automated classi
cation method which can assign a particular piece of audio to
one of six topic classes. Paa et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] describe a
classi er that can assign speech audio to genre topics drawn
from the Media Topic taxonomy of the International Press
Telecommunications Council1. Makhoul et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] describe
a well-integrated system of technologies for indexing and
information retrieval on automatically transcribed audio.
Their topic assignment algorithm is a probabilistic Hidden
Markov Model whose parameters are trained on a corpus
of existing documents with human assigned topic labels.
Olsson and Oard describe techniques for assigning topic
labels to automated transcripts [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Here the topic labels
are drawn from the CLEF CL-SR2 English oral history
thesaurus. Their techniques leverage temporal aspects of the
target audio such as the fact they typically have a
chronological narrative. This means that labels can be assigned with
a greater probability based on the co-occurrence within the
transcript.
      </p>
      <p>In comparison to the technique presented in this paper
the Olsson and Oard and Makhoul methods are supervised.
They require the models to be trained on an existing set
of transcripts and their corresponding topics as assigned by
a human indexer. The technique presented here attempts
topic classi cation in an unsupervised manner using the
automated transcript alone, as we will see later.</p>
      <p>
        There is a signi cant corpus of work on discovering the
main topics of textual documents. A number of possible
approaches have been investigated, including:
● Probabilistic topic models, e.g. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], where documents
are modelled as being drawn from a nite mixture of
underlying topics;
● Term assignment, e.g. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], where the Medical Subject
Heading (MeSH) vocabulary is used as a controlled
vocabulary, and a classi er is trained to associate
doc1See http://www.iptc.org, last accessed November 2011
2Cross-Language Evaluation Forum Cross-Language Speech
Retrieval track
uments with terms in that vocabulary;
● Keyphrase extraction, e.g. [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], where a classi er is
trained to assign probabilities to possible keyphrases;
● Automated tagging, e.g. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], where similar and
already tagged documents are found, and used as a basis
for suggesting tags.
      </p>
      <p>
        The work that is most related to ours is [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], where
Wikipedia web identi ers are used as tags and automatically
assigned to textual documents. A `keyphraseness' measure
is rst used to identify words that are likely to be speci c
to the topics expressed in the document. Each candidate is
then associated with a Wikipedia article capturing its
meaning. We use a similar work ow, but introduce a new
automated tagging algorithm based on structured data available
as Linked Data, and suitable for automatically generated
transcripts, which can be very noisy.
3.
      </p>
    </sec>
    <sec id="sec-3">
      <title>BACKGROUND</title>
      <p>
        In order to nd appropriate tags to apply to programmes
within the archive, we build on top of the Enhanced
Topicbased Vector Space Model proposed in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and further
described and evaluated in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. We describe this model in this
Section.
3.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>Vector Space Model</title>
      <sec id="sec-4-1">
        <title>First, we de ne a couple of concepts:</title>
        <p>● term | a symbol, e.g. `cat' or `house';
● document | an ordered set of terms;
● corpus | a set of documents.</p>
        <p>We then consider a vector per document d⃗, where each
dimension corresponds to a term t with a weighting wd;t.
TFIDF proved a very popular way of deriving those weights,
and includes both local (the TF is relevant to the document)
and global (the IDF is relevant to the corpus) factors.
Document similarity can then be captured by the cosine similarity
between the two document vectors.</p>
        <p>cos(d⃗i; d⃗j) =</p>
        <p>d⃗id⃗j
Yd⃗iYYd⃗jY
3.2</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Topic-based Vector Space Model</title>
      <p>A Topic-based Vector Space Model (TVSM) considers
documents as vectors in a vector space, in which all dimensions
are so-called fundamental topics, which are de ned as being
inter-independent. We consider a vector t⃗ in that space for
each term. The normalised and weighted sum of all the term
vectors in a document gives us a document vector d⃗.
d⃗=</p>
      <p>1</p>
      <p>Y ∑ wd;tt⃗Y Q wd;tt⃗</p>
      <p>As above, we consider the similarity between two
documents as being the cosine similarity between the two
document vectors. We can compute this similarity by knowing
the length of the term vectors t⃗ and their angles between one
another. TVSM does not specify an approach for obtaining
those lengths and angles.</p>
    </sec>
    <sec id="sec-6">
      <title>Topic-based</title>
    </sec>
    <sec id="sec-7">
      <title>Vector</title>
    </sec>
    <sec id="sec-8">
      <title>Space 3.3 Enhanced Model</title>
      <p>An Enhanced Topic-based Vector Space Model (eTVSM)
embeds ontological information in TVSM, by obtaining
document similarities not by using similarities between terms,
but by using mappings from those terms to an ontological
space | a vector space capturing the structure of an
ontology.</p>
      <p>
        This is particularly relevant for us, as we want to tag
programmes with web identi ers, which can themselves link to
various web ontologies. For example, DBpedia web
identiers link to the DBpedia ontology, to a SKOS
categorisation system [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] derived from the Wikipedia categories, and
to the YAGO ontology [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. In the following, we formalise
eTVSM in such a context.
● interpretation | a particular term can have
multiple interpretations. For example, the term bar
has at least two interpretations: d:Bar_(music) and
d:Bar_(unit)3;
● category | a particular interpretation has a number
of categories associated with it, which can be
considered as anchor points within the ontological space. For
example, d:Bar_(music) is associated with the
categories c:Musical_notation and c:Rhythm.
      </p>
      <p>We consider the following de nitions:
● T is the set of all terms with t being a speci c term,
e.g. bar;
● I is the set of all interpretations with i being a speci c
interpretation, e.g. d:Bar_(music);
● C is the set of all categories with c being a speci c
category, e.g. c:Rhythm;
● I(t) ∈ ´(I) is the term to interpretations assignment,
where ´(I) is the powerset of all interpretations, e.g.</p>
      <p>I(bar) = {d ∶ Bar (music); d ∶ Bar (unit)};
● g(i) is the interpretation weight;
3We use the namespaces de ned in Section 9.
● C(i) ∈ ´(C) is the interpretation to categories
assignment, where ´(C) is the powerset of all categories, e.g.</p>
      <p>C(d ∶ Bar (music)) = {c ∶ Musical notation; c ∶ Rhythm}.</p>
      <p>We assume that we have a vector space in which we can
assign to each category c a vector c⃗. We then de ne an
interpretation vector ⃗i:
⃗i =</p>
      <p>g(i)
Y Q
c∈C(i)</p>
      <p>c</p>
      <p>Q ⃗
c⃗Y c∈C(i)</p>
      <p>We consider the similarity between two interpretations as
being the cosine similarity between the two associated
vectors. We consider the similarity between two documents as
being the cosine similarity between the weighted sum of
interpretation vectors in each document. We de ne how we
construct those weights and the vector space for our
interpretation vectors in the following section.
4.</p>
    </sec>
    <sec id="sec-9">
      <title>AUTOMATED TAGGING OF SPEECH</title>
    </sec>
    <sec id="sec-10">
      <title>AUDIO</title>
      <p>In the following, we propose a method to use the audio in
order to automatically assign tags to programmes within the
archive, with those tags being drawn from the Linked Data
cloud. We start by transcribing the audio and identify terms
within the transcripts that could correspond to potential
tags. We then build an eTVSM-based model enabling us to
disambiguate those terms and rank the corresponding tags.
A depiction of the work ow of our automated tagging system
is available in Figure 1.
4.1</p>
    </sec>
    <sec id="sec-11">
      <title>Automatic transcription</title>
      <p>
        After investigating the various open-source options for
multiple speaker automated speech recognition [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] in the
context of broadcast news, we settled on the open source
CMU Sphinx-3 software, with the HUB4 acoustic model [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]
and a language model extracted from the GigaWord corpus4.
The full set of parameters used by our system is available
in Section 8, and was chosen for both speech recognition
accuracy (minimal word error rate, or WER) and
processing speed (how much faster than real time the transcribing
process is).
      </p>
      <p>The results of this speech recognition step are very noisy.</p>
      <p>The WER in those transcripts varies a lot from programme
to programme, depending on the year the programme was
recorded in, the accent of the di erent speakers in the
programme, and the background noise in the programme. An
average value of the WER for two programmes respectively
from 1981 and 2011 is of 47%. However, the WER can go up
to 90% on radio dramas that have lots of background noise
and di erent speakers. A full study of the WER obtained
on the World Service archive remains to be done.</p>
      <p>
        The WER reported by the THISL system and their
ABBOT speech recognition component [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is of 36.6%. The
di erence in WER is due to two factors. Firstly, the dataset
on which the THISL system works does not span several
decades, hence there is less disparity in terms of accents and
topics. The THISL dataset is also only holding news
programmes, which makes it less heterogeneous in terms of
programme genres than the World Service archive. Secondly, a
4See http://www.ldc.upenn.edu/Catalog/CatalogEntry.
jsp?catalogId=LDC2003T05, last accessed November 2011
speci c acoustic model and language model was trained for
this particular dataset within THISL, i.e. news outputs from
1998 and 1999. We use an o -the-shelf recogniser (CMU
Sphinx-3), acoustic model (HUB4) and language model
(GigaWord).
      </p>
      <p>In the following, we try to mitigate the noisiness of those
transcriptions in order to derive an accurate list of tags to
be applied to the programme.</p>
      <p>
        We start by generating a list of web identi ers used by
BBC editors to tag programmes. Those web identi ers
identify people, places, subjects and organisations within
DBpedia. For each of those identi ers, we dereference them and
get their label from their rdfs:label property. We strip
out any disambiguation string from the label, and apply the
Porter Stemmer algorithm [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] to it in order to get to a
corresponding term. This de nes our set of terms T .
      </p>
      <p>We consider the set of all DBpedia web identi ers as our
set of interpretations I.</p>
      <p>We store, for each stemmed label, the set of web
identi ers it could correspond to. This gives us our term to
interpretations assignment I(t).</p>
      <p>We de ne the interpretation weight as follows:
1
g(i) = Sj ∶ j ∈ I(t); t ∈ T (i)S
where T (i) is the set of terms corresponding to an
interpretation i. The weight of an interpretation will be inverse
proportional to the number of possible interpretations of the
corresponding terms.
4.2.2</p>
      <p>Categories</p>
      <p>For a given interpretation i, we construct C(i) using the
following SPARQL query:
SELECT ?category WHERE { {
&lt;i&gt; &lt;http://purl.org/dc/terms/subject&gt; ?category
} UNION {
&lt;i&gt; _:p _:o .</p>
      <p>_:o &lt;http://purl.org/dc/terms/subject&gt; ?category
} }</p>
      <p>We include the categories of neighbouring resources to
increase possible overlap with other resources mentioned in
the same programme. Our evaluation in Section 5 shows
that such an expansion gives the best results.
4.2.3</p>
      <p>Vector space model for SKOS categories</p>
      <p>We now consider the subject classi cation in DBpedia
derived from Wikipedia categories and encoded as a SKOS
model. We create a vector space in which all items in that
categorisation system will have a representation, which
denes our eTVSM.</p>
      <p>There are many options for constructing such a vector
space. We focus on the one that gave the best results, and
provide some evaluation results for a few alternatives in
Section 5.</p>
      <p>We consider the hierarchy induced by the skos:broader
property in the DBpedia SKOS model. The set of all items
in that hierarchy is our set of categories C. We consider a
vector space where each dimension (c1; :::; cn) corresponds
to one of the n elements of C.</p>
      <p>For each category c ∈ C, we consider the set of its ancestors
P (c; k) ∈ ´(C) at a level k. We then construct a vector c⃗ as
follows:
t⃗ = ( Q Q
k=0 c1∈P (c;k)
k; :::; Q Q
k=0 cn∈P (c;k)</p>
      <p>1
k); c⃗ = t ⃗</p>
      <p>t
Y⃗Y</p>
      <p>Each category vector will be non null on the dimensions
corresponding to its ancestors. Two categories that do not
share any ancestor will have a null cosine similarity. The
further away a common ancestor between two categories is,
the lower the cosine similarity between those two categories
will be. The constant is an exponential decay, which can
be used to in uence how much importance we attach to
ancestors that are high in the category hierarchy. The
constant can be used to limit the level of ancestors we want
to consider. Very generic categories won't be very useful at
describing a possible interpretation and discriminating
between them.</p>
      <p>For example, if we consider the SKOS hierarchy depicted
in Figure 2, a value of of 0:5 and a value of set to more
than 2, we get the vectors in Table 1. We give a few of their
pairwise cosine similarities in Table 2.</p>
      <p>We now have a vector space in which we can assign each
category c to a vector c⃗. An Open Source implementation
of such a vector space applicable to any hierarchy encoded
as RDF is available online5.
4.3</p>
    </sec>
    <sec id="sec-12">
      <title>Using the eTVSM for automated tagging</title>
      <p>Now our eTVSM model is de ned, we use it for
identifying potentially relevant terms, disambiguating them and
ranking them, in order to identify the most relevant tags to
apply to each programme</p>
      <p>We start by looking for terms belonging to T in the
automated transcripts, after applying the same Porter Stemmer
algorithm to them. The output of this process is a list of
candidate terms with time-stamps and a list of possible
interpretations for those terms, captured as a list of DBpedia
web identi ers.</p>
      <p>For each programme p in our corpus P , we derive a `main
topic' vector t⃗p from all possible interpretations of all terms:
5See https://github.com/bbcrd/rdfsim.</p>
      <p>C⃗1
C⃗2
C⃗3
C⃗4
C⃗5
C⃗6</p>
      <p>C1
1
0:5
√1:25
0:5
√1:25
0:25
√1:3125
0:5
√1:75
0:25
√1:3125</p>
      <p>C2</p>
      <p>C3</p>
      <p>C4</p>
      <p>C5</p>
      <p>C6
0
0
0
1
√1:25
0:5
√1:3125
0:5
√1:75
0
0
0
wp;i is the weight assigned to the interpretation i in the
programme p. We set it to the term frequency of the terms
associated with i in the automated transcript of that
programme.</p>
      <p>Wrong interpretations of speci c terms will account for
very little in the resulting vector, while web identi ers
related with the main topics of the programme will overlap
and add up.</p>
      <p>We use this vector for disambiguation. For a given term
t, we choose the interpretation i ∈ I(t) which maximises the
cosine similarity between ⃗i and t⃗p.</p>
      <p>Then, we use the following rp;i value to rank the di
erent interpretations i according to how relevant they are to
describe a particular programme p:</p>
      <p>SP S t⃗p⃗i
rp;i = wp;i ∗ log( Sp ∶ t ∈ pS ) ∗ Yt⃗pYY⃗iY</p>
      <p>This corresponds to the TF-IDF score, weighted by the
cosine similarity of the chosen interpretation to the main
topic vector.</p>
      <p>We end up with a ranked list of DBpedia web identi ers,
for each programme. Some examples of the top three tags
and their associated scores are given in Table 3, for di erent
programmes.</p>
    </sec>
    <sec id="sec-13">
      <title>EVALUATION</title>
      <p>In this section, we evaluate the above algorithm for
automated tagging of speech audio.</p>
      <sec id="sec-13-1">
        <title>Programme 1</title>
        <p>d:Benjamin_Britten
d:Music
d:Gustav_Holst</p>
        <p>Programme 2
d:Revolution
d:Tehran
d:Ayatollah</p>
        <p>Programme 3
d:Hepatitis
d:Vaccine
d:Medical_research</p>
        <p>Score</p>
        <p>
          We want to compare our automatically extracted tags
with tags applied by professional editors. Such tags
are made available through the bbc.co.uk/programmes
API [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. We apply the following automated interlinking
heuristics to nd equivalences between programmes within
bbc.co.uk/programmes and the World Service radio archive.
If two programmes share the same brand name (e.g. `From
Our Own Correspondent') and the same broadcast date
(e.g. 2011-05-11), we assume they are identical.
        </p>
        <p>As a mapping heuristics for programmes data, it works
more accurately than matching on episode titles, as they
di er a lot from one database to the other. Brand
names will usually be the same across databases. We
restrict the mapping to programmes that have tags within
bbc.co.uk/programmes.</p>
        <p>This results in a set of 132 equivalences between
programmes in the World Service radio archive and editorially
tagged programmes within bbc.co.uk/programmes.</p>
        <p>In that dataset, the average number of editorial tags by
programme is 4:92, and 477 distinct tags are used. The
editorially applied tags are generally of good quality, covering
all topics a programme covers. A distribution of the
editorially applied tags is available in Figure 3. This distribution
exhibits a very long tail, as 377 tags are used only once.
i
Q c
i=1</p>
        <p>A score of 1 will be achieved if the tags applied by editors
are the top tags extracted by our algorithm. A score of 0 will
be achieved if none of the tags applied by editors appear in
the tags extracted by our algorithm. We choose a value of 0:8
for our constant c, which means that a tag will contribute
around 0:1 to the overall score before normalisation if it
appears at the tenth position.</p>
        <p>We choose this evaluation metric as it captures best the
intent of our algorithm. We want editors to skim through the
list of automatically extracted tags, add and/or delete from
them, and approve them. Therefore, we want the tags most
likely to be approved at the top of the list of automatically
extracted tags. Precision and recall would not appropriately
capture that intended use.
5.3</p>
      </sec>
    </sec>
    <sec id="sec-14">
      <title>Evaluation results and discussion</title>
      <p>On our evaluation dataset, we get the average TopN scores
in Figure 4. We got our best result (TopN = 0:209) for = 0:9
and = 10. We show in Table 4 an example of a good and
a bad result, with their associated TopN scores.</p>
      <p>In Table 5, we also give results for a few variations of our
algorithm, for the values of and that maximise the score
of our tagger when they apply:
● No SKOS expansion | When not expanding the
categories associated to a DBpedia web identi er by
fol</p>
      <sec id="sec-14-1">
        <title>Editorial tags Automatic tags</title>
        <p>Programme 1, TopN = 0:242
d:Crime_fiction d:DNA
d:DNA d:Double_helix
d:Double_helix d:Francis_Crick</p>
        <p>Programme 2, TopN = 0
d:BP d:Methane
d:Climate_change d:Water
d:Greenhouse_gas d:Natural_gas
lowing forward links, the results obtained were slightly
lower;
● Double SKOS expansion | The best results we had
were obtained by expanding the SKOS categories
associated with a DBpedia web identi er using both
forward and backward links. However, the average
number of categories per DBpedia web identi er made the
algorithm run very slowly. We decided to compromise
on the quality of the results to get our algorithm
working in a reasonable time;
● Principal Component Analysis (PCA) | We construct
a vector space where each dimension corresponds to a
category in the DBpedia SKOS hierarchy, and where
each DBpedia web identi er has a corresponding
vector, capturing the adjacency of that web identi er to
SKOS categories. We use PCA to reduce the
dimensionality of that space, and derive similarities between
interpretations from cosine similarities in that reduced
space. This version of the algorithm scored lower than
the approach described above, but had the advantage
of being faster, as the resulting space was of much lower
dimensionality.
In Table 6, we apply the same evaluation to a baseline
tagger picking tags at random and two third-party services.</p>
        <p>
          The rst one is DBpedia Spotlight [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. We submitted the
output of the transcription for two minutes chunks of
programmes to the DBpedia Spotlight API. We then summed
the scores of the entities returned by the API across the
length of the programme. Finally, we applied the same
inverse document frequency step as in our algorithm, in order
to normalise the DBpedia Spotlight results across the entire
corpus. It appears that DBpedia Spotlight does not work
well with noisy text, outputted by an automated
transcription process. In particular, the disambiguation process in
DBpedia Spotlight relies on the text surrounding a
particular term to be relatively clean. The transcripts being very
noisy, that process mostly returns the wrong interpretations.
It also appears that DBpedia Spotlight relies heavily on
capitalisation, however capitalisation is not available in the
automated transcripts. It is also important to note that
DBpedia Spotlight tackles a di erent problem. It extracts entities
from text but does not try to describe an entire document
using a few selected entities.
        </p>
        <p>It appears that using the structure of DBpedia itself for
disambiguation gives satisfying results: deriving a model of
a main programme topic from all possible interpretations of
all relevant terms, and picking the interpretations that are
closest to that main topic. Mis-interpretations will account
to very little in that main topic vector, as most of them will
be very dissimilar to each other.</p>
        <p>We tried a number of commercial third-party concept
tagging APIs, and the result of the one that scored the best in
our evaluation is also shown in Table 6. We applied the same
methodology as for DBpedia Spotlight, so that this
thirdparty service can also bene t from information about the
whole corpus. This third-party service performs almost as
well as our algorithm. However, no information is publicly
available on how that service works.</p>
      </sec>
    </sec>
    <sec id="sec-15">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>In this paper, we described an automated interlinking
system for radio programmes with web identi ers from the
Linked Data cloud. We use an Enhanced Topic Vector Space
Model to disambiguate and rank candidate terms, identi ed
within automated transcripts. We evaluated this system
against tags manually applied by editors. The results,
although by no means perfect, are good enough to e ciently
bootstrap a tagging process. As the resulting tags are Linked
Data web identi ers, isolated archives can e ectively be
interlinked with other datasets in the Linked Data Web.</p>
      <p>
        We describe in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] the process of applying this framework
to the entire World Service archive, and an application
using automatically extracted tags to aid discovery of archive
content.
      </p>
      <p>
        Future work includes creating an editorial interface to
enable editors and the public to edit and approve the list of
automatically derived tags. We also want to try and
incorporate more data (synopsis, broadcast dates, etc.) in the
automated tagging process when this data is available. We
could also enhance our results by considering more textual
representations for DBpedia identi ers than their labels,
using similar methodologies as in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. We also want to
improve the results of the automated speech recognition step,
by creating an acoustic model for British English, and a
language model built from programme transcripts. We have
access to a large pronunciation database maintained within
the BBC, and holding about 50 years worth of topical
entities with associated BBC pronunciation, which might be
useful to construct a better pronunciation dictionary. We
also want to study the impact of the noisiness of the
transcripts on the results of our algorithm.
      </p>
      <p>The tagging process outputs tags with a time-stamp. We
are currently investigating using these time-stamped tags as
a basis for topic segmentation. Each tag has a position in
the vector space constructed above, and we can track how
the average position in that space evolves over time, giving
an idea as to when the programme changes topics. Such a
segmentation could also be used to feed back in the topic
model described in this paper | the topics will be more
consistent in each of these segments.</p>
      <p>Rather than relying on a SKOS hierarchy, it would also be
interesting to nd a more broadly applicable way of
projecting Linked Data in a vector space, based on the adjacency
matrix of the Linked Data graph considered. The
PCAbased approach mentioned in the evaluation section would
be a good starting point, but would need to be made more
robust.
7.</p>
    </sec>
    <sec id="sec-16">
      <title>ACKNOWLEDGEMENTS</title>
      <p>The research for this paper was conducted as part of the
Automatic Broadcast Content Interlinking Project
(ABCIP). ABC-IP is a collaborative research and development
initiative between the British Broadcasting Corporation and
MetaBroadcast Ltd, supported with grant funding from the
UK Technology Strategy Board under its `Metadata:
increasing the value of digital content (mainstream projects)'
competition from September 2010.
8.</p>
      <p>ANNEX: SPHINX-3 PARAMETERS
-samprate 28000
-nfft 1024
-beam 1e-60
-wbeam 1e-40
-ci_pbeam 1e-8
-subvqbeam 1e-2
-maxhmmpf 2000
-maxcdsenpf 1000
-maxwpf 8
-ds 2</p>
    </sec>
    <sec id="sec-17">
      <title>ANNEX: NAMESPACES</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Dave</given-names>
            <surname>Abberley</surname>
          </string-name>
          , David Kirby,
          <string-name>
            <given-names>Steve</given-names>
            <surname>Renals</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Tony</given-names>
            <surname>Robinson</surname>
          </string-name>
          .
          <article-title>The THISL broadcast news retrieval system</article-title>
          .
          <source>In Proc. ESCA Workshop on Accessing Information In Spoken Audio</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Soren</given-names>
            <surname>Auer</surname>
          </string-name>
          , Christian Bizer, Jens Lehmann, Georgi Kobilarov, Richard Cyganiak, and Zachary Ives.
          <article-title>DBpedia: A nucleus for a web of open data</article-title>
          .
          <source>In Proceedings of the International Semantic Web Conference</source>
          , Busan, Korea, November
          <volume>11</volume>
          -15
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Adam</given-names>
            <surname>Berenzweig</surname>
          </string-name>
          , Beth Logan, Daniel P. W. Ellis, and
          <string-name>
            <given-names>Brian</given-names>
            <surname>Whitman</surname>
          </string-name>
          .
          <article-title>A large-scale evaluation of acoustic and subjective music-similarity measures</article-title>
          .
          <source>Computer Music Journal</source>
          ,
          <volume>28</volume>
          (
          <issue>2</issue>
          ):
          <volume>63</volume>
          {
          <fpage>76</fpage>
          ,
          <string-name>
            <surname>Summer</surname>
          </string-name>
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>David</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>Andrew Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
            , and
            <given-names>Michael I. Jordan. Latent Dirichlet</given-names>
          </string-name>
          <string-name>
            <surname>Allocation</surname>
          </string-name>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>3</volume>
          (
          <issue>3</issue>
          ):
          <volume>993</volume>
          {
          <fpage>1022</fpage>
          ,
          <string-name>
            <surname>March</surname>
          </string-name>
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Freddy</surname>
            <given-names>Y. Y.</given-names>
          </string-name>
          <string-name>
            <surname>Choi</surname>
          </string-name>
          .
          <article-title>Advances in domain independent linear text segmentation</article-title>
          .
          <source>Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Sam</given-names>
            <surname>Davies</surname>
          </string-name>
          , Penelope Allen,
          <string-name>
            <given-names>Mark</given-names>
            <surname>Mann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Trevor</given-names>
            <surname>Cox</surname>
          </string-name>
          .
          <article-title>Musical moods: A mass participation experiment for a ective classi cation of music</article-title>
          .
          <source>In Proceedings of the 12th International Society for Music Information Retrieval Conference</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Mike</given-names>
            <surname>Dowman</surname>
          </string-name>
          , Valentin Tablan, Hamish Cunningham, and
          <string-name>
            <given-names>Borislav</given-names>
            <surname>Popov</surname>
          </string-name>
          .
          <article-title>Web-assisted annotation, semantic indexing and search of television and radio news</article-title>
          .
          <source>In WWW '05 Proceedings of the 14th international conference on World Wide Web</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Georgi</given-names>
            <surname>Kobilarov</surname>
          </string-name>
          , Tom Scott, Yves Raimond, Silver Oliver, Chris Sizemore, Michael Smethurst, Chris Bizer, and
          <string-name>
            <given-names>Robert</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Media meets semantic web - how the BBC uses DBpedia and linked data to make connections</article-title>
          .
          <source>In Proceedings of the European Semantic Web Conference In-Use track</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kuropka</surname>
          </string-name>
          .
          <article-title>Modelle zur Reprasentation naturlichsprachlicher Dokumente - Information-Filtering und -Retrieval mit relationalen Datenbanken</article-title>
          . Logos Verlag,
          <year>2004</year>
          . ISBN:
          <fpage>3</fpage>
          -
          <lpage>8325</lpage>
          -0514-8.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>John</surname>
            <given-names>Makhoul</given-names>
          </string-name>
          , Francis Kubala, Timothy Leek, Daben Liu, Long Nguyen, Richard Schwartz, and
          <string-name>
            <given-names>Amit</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>Speech and language technologies for audio indexing and retrieval</article-title>
          .
          <source>Proceedings of the IEEE</source>
          ,
          <volume>88</volume>
          (
          <issue>8</issue>
          ):
          <volume>1338</volume>
          {
          <fpage>1353</fpage>
          ,
          <year>August 2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Korn</given-names>
            <surname>El</surname>
          </string-name>
          <string-name>
            <surname>Mark</surname>
          </string-name>
          , Kornel Marko, Udo Hahn, Stefan Schulz, and
          <string-name>
            <given-names>Percy</given-names>
            <surname>Nohama</surname>
          </string-name>
          .
          <article-title>Interlingual indexing across di erent languages</article-title>
          .
          <source>In RIAO 2004 { Conference Proceedings: Coupling Approaches, Coupling Media and Coupling Languages for Information Retrieval</source>
          , pages
          <volume>82</volume>
          {
          <fpage>99</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Olena</surname>
            <given-names>Medelyan</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ian H. Witten</surname>
          </string-name>
          , and David Milne.
          <article-title>Topic indexing with wikipedia</article-title>
          .
          <source>Proc. of Wikipedia and AI workshop</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Pablo</surname>
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Mendes</surname>
            , Max Jakob,
            <given-names>Andres</given-names>
          </string-name>
          <string-name>
            <surname>Garc</surname>
          </string-name>
          a
          <article-title>-Silva, and Christian Bizer</article-title>
          .
          <article-title>DBpedia spotlight: Shedding light on the web of documents</article-title>
          .
          <source>In Proceedings of the 7th International Conference on Semantic Systems (I-Semantics)</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Alistair</given-names>
            <surname>Miles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Matthews</surname>
          </string-name>
          , M. Wilson, and
          <string-name>
            <given-names>D.</given-names>
            <surname>Brickley</surname>
          </string-name>
          .
          <article-title>SKOS core: Simple knowledge organisation for the web</article-title>
          .
          <source>In Proceedings of the International Conference on Dublin Core and Metadata Applications</source>
          (DC-
          <year>2005</year>
          ),, pages
          <fpage>5</fpage>
          {
          <fpage>13</fpage>
          , Madrid,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Gilad</given-names>
            <surname>Mishne</surname>
          </string-name>
          .
          <article-title>Autotag: a collaborative approach to automated tag assignment for weblog posts</article-title>
          .
          <source>In WWW '06: Proceedings of the 15th international conference on World Wide Web</source>
          , pages
          <volume>953</volume>
          {
          <fpage>954</fpage>
          , New York, NY, USA,
          <year>2006</year>
          . ACM Press.
          <article-title>paper presented at the poster track</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Partrick</given-names>
            <surname>Nguyen</surname>
          </string-name>
          . Techware:
          <article-title>Speech recognition software and resources on the web</article-title>
          .
          <source>IEEE Signal Processing Magazine</source>
          , pages
          <volume>102</volume>
          {
          <fpage>108</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J. Scott</given-names>
            <surname>Olsson</surname>
          </string-name>
          and
          <string-name>
            <given-names>Douglas W.</given-names>
            <surname>Oard</surname>
          </string-name>
          .
          <article-title>Improving text classi cation for oral history archives with temporal domain knowledge</article-title>
          .
          <source>In SIGIR'07</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>G.</given-names>
            <surname>Paa</surname>
          </string-name>
          , E. Leopold,
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kindermann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Eickeler</surname>
          </string-name>
          .
          <article-title>SVM classi cation using sequences of phonemes and syllabes</article-title>
          .
          <source>In Principles of Data Mining and Knowledge Discovery</source>
          , pages
          <volume>373</volume>
          {
          <fpage>384</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Artem</given-names>
            <surname>Polyvyanyy</surname>
          </string-name>
          .
          <article-title>Evaluation of a novel information retrieval model: eTVSM</article-title>
          .
          <source>Master's thesis</source>
          , Hasso Plattner Institut,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Martin</surname>
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Porter</surname>
          </string-name>
          .
          <article-title>An algorithm for su x stripping</article-title>
          .
          <source>Program</source>
          ,
          <volume>14</volume>
          (
          <issue>3</issue>
          ):
          <volume>130</volume>
          {
          <fpage>137</fpage>
          ,
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Yves</surname>
            <given-names>Raimond</given-names>
          </string-name>
          , Chris Lowis, and
          <string-name>
            <given-names>Jonathan</given-names>
            <surname>Tweed</surname>
          </string-name>
          .
          <article-title>Automated semantic tagging of speech audio</article-title>
          .
          <source>In Proceedings of the WWW'12 Demo Track</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Yves</surname>
            <given-names>Raimond</given-names>
          </string-name>
          , Tom Scott, Silver Oliver, Patrick Sinclair, and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Smethurst</surname>
          </string-name>
          .
          <article-title>Linking Enterprise Data, chapter Use of Semantic Web technologies on the BBC Web Sites</article-title>
          ,
          <source>page 291. Springer, 1st edition edition</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Richard</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Rose</surname>
            ,
            <given-names>Eric I. Chang</given-names>
          </string-name>
          , and Richard P. Lippmann.
          <article-title>Techniques for information and retrieval from voice messages</article-title>
          .
          <source>In ICASSP'91</source>
          , pages
          <fpage>317</fpage>
          {
          <fpage>320</fpage>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Kristie</surname>
            <given-names>Seymore</given-names>
          </string-name>
          , Stanley Chen,
          <string-name>
            <surname>Sam-Joo</surname>
            <given-names>Doh</given-names>
          </string-name>
          , Maxine Eskenaziand Evandro Gouvea, Bhiksha Raj, Mosur Ravishankar, Ronald Rosenfeld, Matthew Siegler, Richard Sternane, and
          <string-name>
            <given-names>Eric</given-names>
            <surname>Thayer</surname>
          </string-name>
          .
          <article-title>The 1997 CMU sphinx-3 english broadcast news transcription system</article-title>
          .
          <source>In Proceedings of the DARPA Speech Recognition Workshop</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Fabian</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Suchanek</surname>
            , Gjergji Kasneci, and
            <given-names>Gerhard</given-names>
          </string-name>
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          .
          <article-title>Yago - a core of semantic knowledge</article-title>
          .
          <source>In 16th international World Wide Web conference</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Ian</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Witten</surname>
          </string-name>
          , Gordon W. Paynter, Eibe Frank, Carl Gutwin, and
          <string-name>
            <surname>Craig</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Nevill-Manning</surname>
          </string-name>
          .
          <article-title>Kea: practical automatic keyphrase extraction</article-title>
          .
          <source>In Proceedings of the fourth ACM conference on Digital libraries, DL '99</source>
          , pages
          <fpage>254</fpage>
          {
          <fpage>255</fpage>
          , New York, NY, USA,
          <year>1999</year>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>