Introduction

WhatTheySaid: Enriching UK Parliament Debates with Semantic Web

0 School of Electronics and Computer Science, University of Southampton , UK

To improve the transparency of politics, the UK Parliament Debate archives have been published online for a long time. However there is still a lack of e cient way to deeply analysis the debate data. WhatTheySaid is an initiative to solve this problem by applying natural language processing and semantic Web technologies to enrich UK Parliament Debate archives and publish them as linked data. It also provides various data visualisations for users to compare debates over years.

linked data parliamentary debate semantic web

Introduction

users can easily spot the statements that are contradict to each other; (R3) Based on R2, link the debates to a fragment of debate video archive, so that users can watch the video fragment as the proof of the statement; (R4) Analyse the speeches of a particular MP and see how the sentiment is changing over time.

To demo the implementation of the requirements above, we have taken the UK House of Common debate data in 2013 from TheyWorkForYou as the sample dataset, and the following sections will go through the system. 2

Semantic Model of UK Parliament Debate

The WTS ontology4 models UK Parliament debate structure and involved agents. This ontology reuses some vocabularies such as FOAF5 and Ontology for Media Resource6. When designing this ontology, we have rstly referred to the data structure of TheyWorkForYou, where one debate is identi ed by a Heading and a Heading contains one or more Speeches. We have also added several attributes to Speech, such as sentimental score, primary topic, summarise text and related media fragment in order to save the data required to implement R2, R3 and R4 in Section 1. 4 http://www.whattheysaid.org.uk/ontology/v1/whatheysaid.owl 5 http://www.foaf-project.org 6 http://www.w3.org/TR/mediaont-10/ 7 http://www.alchemyapi.com/

WhatTheySaid: Enriching UK Parliament Debates with Semantic Web each speech made by a speaker will be allocated with a score between 1.0 (positive) and -1.0 (negative). For speeches with more than 1000 characters, we also carry out topic detection and text summarisation using AlchemyAPI.

To link the debates to each other, we apply TF-IDF [ 3 ] algorithm to calculate the similarity scores between each two debates. We rstly merge the plain text of all the speeches in a debate into one big debate document d. Then, given a debate document collection D and d 2 D, a word w, we calculate the weighting of each document Wd:

Wd = fw;d log(jDj=fw;D) (1) where fw;d equals the number of times w appears in d, jDj is the size of corpus, and fw;D is the number of documents in which w appears in D [ 3 ]. In information retrieval, the Vector Space Model (VSM) represents each document in a collection as a point in a space and the semantically similarity of words is depended on the space distance of related points [ 4 ]. When the Wd is calculated for each document, we use cosine similarity8 for the vector space to come up with the similarity score between any two debate documents. On the user interface, every time a debate document is viewed, we will list the top ten debates that similar to this debate, so that users can easily navigate through similar debates.

For named entity recognition, we use DBpedia Spotlight9 to extract named entities and interlink those concepts to the speeches, where they are mentioned. All the enrichment information are saved in a triple store implemented by rdfstore-js10, which also exposes a SPARQL Endpoint data querying and visualisation. For the whole 2013 year's debate, we have collected 68968 speeches and more than 400K named entities (with duplication) have been recognised. Using the model de ned in Figure 1, we have generated more than 1.2 million triples. 8 http://en.wikipedia.org/wiki/Cosine_similarity 9 https://github.com/dbpedia-spotlight 10 https://github.com/antoniogarrote/rdfstore-js

We visualise the enriched debate data in various ways. Firstly, we use both heat map and line chart to visualise the sentiment scores of speeches for each MP on yearly (see Figure 3(a)) and monthly basis respectively. We also provide a timeline visualisation (Figure 3(b)) for the statements in di erent topics made by a certain MP. To implement R3, we have referred to the previous work [ 2 ] and designed a replay page with the transcript and named entities aligned with the fragments of debate video11. The full demo is available online12 and the RDF dataset is published for download13. We are planning to expand the application with more debates from early years, so that debates across years can be interlinked and enriched for analysis.

Acknowledgement

This mini-project is funded by the EPSRC Semantic Media Network. We also would like to thank Yves Raimond from BBC and Sebastian Riedel from UCL for the support of this mini-project.

1. Juric , D. , Hollink , L. , Houben , G.J.: Bringing parliamentary debates to the semantic web . Detection, Representation, and Exploitation of Events in the Semantic Web ( 2012 )

2. Li , Y. , Rizzo , G. , Troncy , R. , Wald , M. , Wills , G.: Creating enriched youtube media fragments with nerd using timed-text ( 2012 )

3. Ramos , J.: Using tf-idf to determine word relevance in document queries . In: Proceedings of the First Instructional Conference on Machine Learning ( 2003 )

4. Turney , P.D. , Pantel , P. , et al.: From frequency to meaning: Vector space models of semantics . Journal of arti cial intelligence research 37(1) , 141 { 188 ( 2010 )