<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>WhatTheySaid: Enriching UK Parliament Debates with Semantic Web</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Electronics and Computer Science, University of Southampton</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>To improve the transparency of politics, the UK Parliament Debate archives have been published online for a long time. However there is still a lack of e cient way to deeply analysis the debate data. WhatTheySaid is an initiative to solve this problem by applying natural language processing and semantic Web technologies to enrich UK Parliament Debate archives and publish them as linked data. It also provides various data visualisations for users to compare debates over years.</p>
      </abstract>
      <kwd-group>
        <kwd>linked data</kwd>
        <kwd>parliamentary debate</kwd>
        <kwd>semantic web</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>users can easily spot the statements that are contradict to each other; (R3)
Based on R2, link the debates to a fragment of debate video archive, so that
users can watch the video fragment as the proof of the statement; (R4) Analyse
the speeches of a particular MP and see how the sentiment is changing over time.</p>
      <p>To demo the implementation of the requirements above, we have taken the
UK House of Common debate data in 2013 from TheyWorkForYou as the sample
dataset, and the following sections will go through the system.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Semantic Model of UK Parliament Debate</title>
      <p>The WTS ontology4 models UK Parliament debate structure and involved agents.
This ontology reuses some vocabularies such as FOAF5 and Ontology for Media
Resource6. When designing this ontology, we have rstly referred to the data
structure of TheyWorkForYou, where one debate is identi ed by a Heading and
a Heading contains one or more Speeches. We have also added several attributes
to Speech, such as sentimental score, primary topic, summarise text and related
media fragment in order to save the data required to implement R2, R3 and R4
in Section 1.
4 http://www.whattheysaid.org.uk/ontology/v1/whatheysaid.owl
5 http://www.foaf-project.org
6 http://www.w3.org/TR/mediaont-10/
7 http://www.alchemyapi.com/</p>
      <p>WhatTheySaid: Enriching UK Parliament Debates with Semantic Web
each speech made by a speaker will be allocated with a score between 1.0
(positive) and -1.0 (negative). For speeches with more than 1000 characters, we also
carry out topic detection and text summarisation using AlchemyAPI.</p>
      <p>
        To link the debates to each other, we apply TF-IDF [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] algorithm to calculate
the similarity scores between each two debates. We rstly merge the plain text
of all the speeches in a debate into one big debate document d. Then, given a
debate document collection D and d 2 D, a word w, we calculate the weighting
of each document Wd:
      </p>
      <p>
        Wd = fw;d log(jDj=fw;D)
(1)
where fw;d equals the number of times w appears in d, jDj is the size of corpus,
and fw;D is the number of documents in which w appears in D [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In information
retrieval, the Vector Space Model (VSM) represents each document in a
collection as a point in a space and the semantically similarity of words is depended
on the space distance of related points [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. When the Wd is calculated for each
document, we use cosine similarity8 for the vector space to come up with the
similarity score between any two debate documents. On the user interface, every
time a debate document is viewed, we will list the top ten debates that similar
to this debate, so that users can easily navigate through similar debates.
      </p>
      <p>For named entity recognition, we use DBpedia Spotlight9 to extract named
entities and interlink those concepts to the speeches, where they are mentioned.
All the enrichment information are saved in a triple store implemented by
rdfstore-js10, which also exposes a SPARQL Endpoint data querying and
visualisation. For the whole 2013 year's debate, we have collected 68968 speeches
and more than 400K named entities (with duplication) have been recognised.
Using the model de ned in Figure 1, we have generated more than 1.2 million
triples.
8 http://en.wikipedia.org/wiki/Cosine_similarity
9 https://github.com/dbpedia-spotlight
10 https://github.com/antoniogarrote/rdfstore-js</p>
      <p>
        We visualise the enriched debate data in various ways. Firstly, we use both
heat map and line chart to visualise the sentiment scores of speeches for each
MP on yearly (see Figure 3(a)) and monthly basis respectively. We also provide
a timeline visualisation (Figure 3(b)) for the statements in di erent topics made
by a certain MP. To implement R3, we have referred to the previous work [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
and designed a replay page with the transcript and named entities aligned with
the fragments of debate video11. The full demo is available online12 and the
RDF dataset is published for download13. We are planning to expand the
application with more debates from early years, so that debates across years can be
interlinked and enriched for analysis.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgement</title>
      <p>This mini-project is funded by the EPSRC Semantic Media Network. We also
would like to thank Yves Raimond from BBC and Sebastian Riedel from UCL
for the support of this mini-project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Juric</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hollink</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Houben</surname>
            ,
            <given-names>G.J.:</given-names>
          </string-name>
          <article-title>Bringing parliamentary debates to the semantic web</article-title>
          . Detection, Representation, and
          <article-title>Exploitation of Events in the Semantic Web (</article-title>
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rizzo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Troncy</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wald</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wills</surname>
          </string-name>
          , G.:
          <article-title>Creating enriched youtube media fragments with nerd using timed-text (</article-title>
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ramos</surname>
          </string-name>
          , J.:
          <article-title>Using tf-idf to determine word relevance in document queries</article-title>
          .
          <source>In: Proceedings of the First Instructional Conference on Machine Learning</source>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Turney</surname>
            ,
            <given-names>P.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pantel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , et al.:
          <article-title>From frequency to meaning: Vector space models of semantics</article-title>
          .
          <source>Journal of arti cial intelligence research 37(1)</source>
          ,
          <volume>141</volume>
          {
          <fpage>188</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>