PoliMedia Analysing Media Coverage of political debates by automatically generated links to Radio & Newspaper Items Martijn Kleppe1, Laura Hollink2, Max Kemman1, Damir Juric3, Henri Beunders1, Jaap Blom4, Johan Oomen4, Geert-Jan Houben5 1 2 4 Erasmus Universiteit Rotterdam Vrije Universiteit Amsterdam Nederlands Instituut voor Beeld en kleppe@eshcc.eur.nl l.hollink@vu.nl Geluid kemman@eshcc.eur.nl jblom@beeldengeluid.nl 3 London Brunel University joomen@beeldengeluid.nl beunders@eshcc.eur.nl damir.juric@fer.hr 5 TU Delft g.j.p.m.houben@tudelft.nl ABSTRACT major Dutch newspapers, the transcripts of the Dutch Students and researchers of media and communication parliament, and radio bulletins have been digitised and sciences study the role of media in our society. They made available as open datasets. This contains an enormous frequently search through media archives to manually select advantage, as material can now be accessed from the Web. items that cover a certain event. When this is done for large However, since the available data is very large, another time spans and across media-outlets, this task can however challenge arises; it is a cumbersome and challenging task be challenging and laborious. Therefore, up until now the for students to analyse media items from different datasets focus of researchers has been on manual and qualitative both qualitatively as well as quantitatively. Therefore, we analyses of newspaper coverage. PoliMedia aims to created automatically generated links between the stimulate and facilitate large-scale, cross-media analysis of transcripts of the parliament with two media-outlets: 1) the coverage of political events. We focus on the meetings newspapers in their original layout of the historical of the Dutch parliament, and provide automatically newspaper archive, and 2) radio bulletins of the Dutch generated links between the transcripts of those meetings, National Press Agency (ANP), both located at the Dutch newspaper articles, including their original lay-out on the National Library. These links can be explored via the page, and radio bulletins. Via the portal at PoliMedia search user interface (SUI) which is currently www.polimedia.nl researchers can search through the online at www.polimedia.nl. The SUI allows students and debates and find related media coverage in two media- researchers to search the debates by date and analyse the outlets, facilitating a more efficient search process and related media coverage, as well as search by name of a qualitative analyses of the media coverage. Furthermore, politician or any keyword and evaluate the debates in which the generated links are available via a SPARQL endpoint at the politicians appeared and how he or she was covered in data.polimedia.nl allowing quantitative analyses with the press. complex, structured queries that are not covered by the search functionality of the portal, thus challenging the An innovative approach of PoliMedia is that the coverage student to go across the academic borders and enter fields in the media is incorporated in its original form (figure 4), that previously have been neglected. enabling analyses of both the mark-up of news articles as well as the photos in newspapers allowing further Keywords qualitative analyses of the media coverage. As a result, the Parliamentary debates, linking, mediatisation, linked data, big advantage of the PoliMedia system is that it allows media coverage, newspapers, radio students and researchers to make cross-media comparisons in a straightforward way both quantitatively and 1. INTRODUCTION qualitatively. Earlier they had to manually search each Analysing media coverage of political debates across archive separately, using the archives proprietary metadata, several types of media-outlets is a challenging task for and decide whether or not a media item covers a certain academic students and researchers. Up until now, the focus (political) event. The focus of the assignments in the of students has been on doing manual and qualitative curriculum was therefore on qualitative analysis only. research since newspaper articles have only been available Working with the PoliMedia portal gives students and in analogue format. Other media types such as radio researchers a hands-on experience with a quantitative bulletins have been neglected even more since these were approach to their field of study. In addition, it provides hardly available to students. In recent years, archives of 1 them with substantive insights into how media coverage two perspectives; 1) the user perspective, and 2) the data varies over a large number of political events. We believe perspective. that this type of insight is best learned through interaction with the data, rather than, for example, literature study. The user perspective With the PoliMedia approach researchers can go to one The targeted user groups are primarily students and website where they will have access to all sources in a researchers of History, Communication and Media, Media standardized format. While students and researchers before Studies and Sociology of Culture, Media and the Arts. mainly used newspaper articles, the PoliMedia system However, the PoliMedia portal is valuable for a much wider allows them now to make cross-media analyses in a more range of Humanities and Social Sciences students and efficient way. Furthermore, we made the automatically researchers who for example analyse the representation of generated links available through a SPARQL endpoint at politicians in the media or discussion of recurring themes. data.polimedia.nl, allowing quantitative analysis of for We also expect the system to be useful for several other example the amount of links per year and decade or the disciplines, such as communication students who are number of links per political party enabling students to interested in doing discourse analysis or linguistic aspects research the mediatisation of Dutch politics in an efficient of media and political debates, psychologists researching manner. the self-mediation of public persons, and even economists who nowadays pay more attention to the way politicians talk about the current economic crises. Furthermore, since 2. RELATED WORK all the links are available at data.polimedia.nl this data can The mediatisation of political debates has been the focal also be used by students and researchers of computer point of a growing field of disciplines, such as television science or related fields, interested in data analysis and researchers [1], communication scientists who are interested visualization. in doing discourse analysis or linguistics [2] and The development of the user interface of PoliMedia was psychologists for researching the self-mediation of public based on a requirements study with five scholars/lecturers persons [3]. However, due to the lack of available data on in history and political communication. The main use case the mediation of the debates on radio and television, the appeared to be identifying politicians or debates of interest, focus up until now has been on newspapers. Since the and finding their representation in the media for qualitative introduction of digital sources that do include radio analyses. This use case and its requirements were discussed newsreels, television newscasts and current affairs with a UI-designer, which led to the design of a faceted programs, researchers should now be able to make cross search user interface (SUI) as depicted in figure 2. Facets media-comparisons between the different types of media- allow the user to refine search results, they support the outlets. To make these large digital sources more accessible searcher by presenting an overview of the structure of the and more connected to each other, we build upon a set of collection, as well as provide a transition between browsing guidelines and techniques to represent, link and publish and search strategies [6]. data on the Linked Data Web [10] using so-called semantic web technology. In the domain of cultural heritage, the The SUI consists of three main levels: MultiMedian E-Culture project [11] has shown that through 1) the landing page where researchers can enter search explicit representation of links between and within terms (figure 1), collections, cross-collection search becomes possible. 2) the results page (figure 2) with the search results, facets Krouf et al. [12] demonstrate how various online sources of for refinements and a search bar for new queries and event information, containing both media and descriptions 3) the debate page (figure 3) which shows a complete of events, can be merged using Linked Data. Noy et al. [13] debate and the linked media items. When clicking on a describe how they represent and link hundreds of media item, the item will be opened in a new screen in its biomedical terminologies. In PoliMedia, we apply semantic original lay-out (figure 4). web technology to connect various media datasets with a political event dataset. To find links between datasets that We evaluated a preliminary version of the interface by are so different in nature, we have developed a linking means of an eye tracking study [7]. This study showed that algorithm that includes named entity recognition and topic the faceted SUI enabled users to perform both known-item detection. For the latter, we have used an off-the-shelve tool searches, as well as exploratory searches to analyse a topic called Mallet [14]. over time. However, navigating the debates themselves proved to be rather difficult; as debates can be dozens of 3. SYSTEM DESCRIPTION pages long, it was hard for users to gain an overview of the The problem PoliMedia aims to resolve is the difficulty of debate. To address this issue, the faceted search which was searching a multitude of archives for cross-media analyses. already available on the search results page (figure 2) was In order to resolve this difficulty, we approached it from also introduced on the debate page (figure 3) in the final version of the interface. 2 Fig. 1. Screenshot of the PoliMedia home page Fig. 4. Screenshot of an example newspaper in original lay-out, containing an article about a parliamentary debate. The data perspective In order to allow users to perform cross-media analysis in a single system, PoliMedia combines three data sources: parliamentary debates, a newspaper archive and a radio bulletin archive. The collection of Dutch parliamentary debates, the so-called Handelingen der Staten-Generaal, Fig. 2. Screenshot of the PoliMedia search results page. are published by the government in the form of complete transcripts of the speeches of politicians in parliamentary debates. For the period 1945-1995, the transcripts of all 9,294 debates that were held are published in unstructured TXT and PDF format at http://www.statengeneraaldigitaal.nl. The project ``War in Parliament'' has transformed them to a fine-grained XML structure [4]. We build upon War in Parliament and translate their XML to RDF. To store, query and link the debate data, we have created a semantic model in RDF which is a specialization of the more widely applicable Simple Event Model (SEM) [9]. SEM is a model that aims to represent events on the Web and explicate complicated semantic relations between people, places, actions and objects: not only who did what, when and where, but also the roles each actor played, the time during which this role Fig. 3. Screenshot of the PoliMedia debate page is valid, and the authority according to which this role is assigned. To represent the parliamentary debates in RDF, we have created a domain specific semantic model as a specialization of SEM that enables us to express 3 Handelingen Verenigde Vergadering... Debate PartOfDebate DebateContext 1945-11-20 rdf:type rdf:type rdf:type dc:date "De voorzitter nl.proc.sgd.d. nl.proc.sgd.d. nl.proc.sgd.d. Dutch dc:language hasPart hasPart hasText opent de 194519460000002 194519460000002.1 194519460000002.1.1 vergadering…" dc:publisher http://statengeneraaldigitaal.nl/ dc:id "Mijnheer de Voorzitter, de dc:source hasSubsequentPartOfDebate Commissie nl.proc.sgd.d.19720000002 van …" dc:source hasPart Speech http://resolver.politicalmashup.nl/nl.proc.sgd.d.194519460000002 nl.proc.sgd.d. 194519460000002.2 hasSpokenText rdf:type http://resolver.kb.nl/resolve?urn=sgd:mpeg21:19451946:0000002:pdf nl.proc.sgd.d. sem:hasActor 194519460000002.1.2 hasSubsequentSpeech coveredIn nl.proc.sgd.d. http://resolver.kb.nl/resolve?urn=ddd:011198136:mpeg21:a0525:ocr 194519460000002.1.3 Fig 5: RDF model to represent parliamentary debates and links to media information associated with the debates such as topics, converted to structured data in XML form in previous actors, debate structure, and links to media. To increase re- research [4]. For each speech (i.e. a fragment from a single usability of the data, we use Dublin Core properties where speaker in a debate), we extract information to represent appropriate, for example to denote dates, titles and this speech; the speaker, the date, important terms (i.e. publishers of debates. Figure 5 shows the RDF model. For named entities) from its content and important terms from brevity, we have left the representation of speakers (i.e. the description of the debate in which the speech is held. politicians and their party) out. For a detailed description of This information is then combined to create a query with the design decisions of the model, we refer to [5] and [8]. which we search the archives of the newspapers and radio bulletins. Media items that correspond to this query are Data Usage retrieved, after which a link is created between the speech The newspaper archive as well as the radio bulletin archive and the media item [5]. The links, as well as the resides at the Dutch Royal Library. To determine links parliamentary debates are represented as RDF [8]. These between debates and the media items in these archives, we links are available at data.polimedia.nl as an open dataset query the full text as well as the metadata through the OAI for future researchers. protocol. For copyright reasons, the dataset used in the PoliMedia portal does not contain the media items Performance themselves or their metadata; only the URIs of the items in We created a stable system by using SPARQL to fetch the their original archives are included. From the portal, a user relevant debate data from an OWLIM repository that hosts can click a hyperlink to the Royal Library site to view the the PoliMedia dataset. To ensure reasonable response requested media item. At the moment, the datasets are times, the server hosting the repository has been upgraded static; they contain the debate transcripts and links to media from 8GB to 16GB of memory. Because of OWLIM’s archives of the period 1945-1995. In the future, we plan to limited capabilities with respect to full-text and faceted include up-to-date data in the form of the latest debate search a separate SOLR index has been created. SOLR was transcripts and news articles and bulletins. chosen because of its widespread use and reputation as a high performing search index with capabilities for faceted The basis of PoliMedia lies in the transcripts of the Dutch search and many other optimization options, such as parliament from 1814-1995, containing circa 2.5 million language specific options to ensure better results for Dutch. pages of debates with speeches that have been OCR’d and The accuracy of our linking approach was evaluated via a thus allow for full-text search. The transcripts have been manual assessment of a sample of 150 links to newspaper 4 articles. We found that the precision of the algorithm was ACKNOWLEDGMENTS good with values around 80%, with an acceptable recall of The PoliMedia project was financed by CLARIN-NL and 62% [5]. carried out by an interdisciplinary research team, consisting of both computer scientists at the TU Delft and VU Legal & Privacy Amsterdam, information scientists and historians at the The PoliMedia portal does not involve or store any user- Erasmus University Rotterdam and programmers at the specific data. Since it is a web-portal, visited URLs may be Netherlands Institute for Sounds and Vision. We are stored locally by a user’s own browser. Clicks on grateful for the support of the National Library in providing hyperlinks to media items that reside at the servers of the the data of both the transcripts of the Dutch parliament as National library of the Netherlands may be logged by the well as of the newspapers and radio bulletins. library. The original debate data as provided by the Dutch government has a CC0 licence. The copyrights of the newspaper articles and radio bulletins are with the original REFERENCES publishers/broadcasters. This material may be used “for [1] Bignell, J., Fickers, A. (eds.) (2008). A European private use or a user’s own study.” Television History. Wiley Blackwell: Malden MA / Oxford. 4. DISCUSSION [2] Van Santen, R. A., Van Aelst, P., & Helfer, L. (2013). PoliMedia successfully automatically created links between When politics becomes news: an analysis of the transcriptions of parliamentary debates and newspaper parliamentary questions and press coverage in three articles & radio bulletins, demonstrating how two very West-European countries. Acta Politica (15 november different datasets can be connected. In the near future, we 2013) intend to study the generalizability of our linking approach [3] Corner & Pels (2003) Media and the restyling of for linking other datasets, such as online (social) media and politics (London) proceedings of other official meetings. We already tried to link the debates with television [4] Gielissen, T., & Marx, M. (2009). Exemelification of programmes located at the Netherlands Institute for Sound parliamentary debates. Proceedings of the 9th Dutch- and Vision but have not been able to do this. There can be Belgian Workshop on Information Retrieval (DIR several reasons for the lack of these links: the size of the 2009) (pp. 19–25). available television dataset, the lack of full-text search in [5] Juric, D., Hollink, L., & Houben, G. (2013). AV or the suitability of the linking algorithm. We expect Discovering links between political debates and media. that the metadata contained insufficient information to be The 13th International Conference on Web linked to, while the television programs did contain Engineering (ICWE’13). Aalborg, Denmark. coverage of the relevant debates. We hypothesize that [6] Kules, B., Capra, R., Banta, M., & Sierra, T. (2009). linking to audio-visual sources requires other techniques of What do exploratory searchers look at in a faceted opening up AV archives, such as the inclusion of time- search interface? Proceedings of the 2009 joint based metadata (e.g. subtitles) or the use of speech and international conference on Digital libraries - JCDL image recognition. These techniques give more information ’09, 313. doi:10.1145/1555400.1555452 about the content of the programs than is described in the existing metadata. We are currently working on a follow-up [7] Kemman, M., Kleppe, M., & Maarseveen, J. (2013). project of PoliMedia in which we aim to link the transcripts Eye Tracking the Use of a Collapsible Facets Panel in a of the European Parliament to television programs of which Search Interface. In Research and Advanced the metadata has been enriched with subtitles and speech Technologies for Digital Libraries: 17th International recognition to further explore the possibilities of linking to Conference on Theory and Practice of Digital television programs. Libraries (pp. 401-404) Valletta: Springer Berlin Heidelberg. 5. CONCLUSIONS [8] Juric, D., Hollink, L., & Houben, G. (2012). Bringing The PoliMedia search user interface clearly shows the parliamentary debates to the Semantic Web. DeRiVE potential for students by linking the transcripts of political workshop on Detection, Representation, and debates to different media outlets, allowing cross media Exploitation of Events in the Semantic Web. analysis of both newspapers as well as radio items. However, we did not yet succeed in linking to television [9] Van Hage, W. R., Malaisé, V., Segers, R., Hollink, L., programmes but envision this will be possible in future & Schreiber, G. (2011). Design and use of the Simple research projects that can build upon the knowledge and Event Model (SEM). Web Semantics: Science, Services insights we gained through the development of the and Agents on the World Wide Web, 9(2), 128-136. PoliMedia project. [10] Heath, Tom, and Christian Bizer (2011). "Linked data: Evolving the web into a global data space." Synthesis 5 lectures on the semantic web: theory and technology [13] Noy, N. F., Shah, N. H., Whetzel, P. L., Dai, B., Dorf, 1.1: 1-136. M., Griffith, N., Jonquet, C. Rubin, DL., Storey, M.A., [11] Schreiber, A.T., Amin, A., Aroyo, L.M., Assem, M.F.J. Chute, C.G., & Musen, M. A. (2009). BioPortal: van, Boer, V. de, Hardman, L., Hildebrand, M., ontologies and integrated data resources at the click of Omelayenko, B., Ossenbruggen, J.R., Tordai, A., a mouse. Nucleic acids research, 37(suppl 2), W170- Wielemaker, J. & Wielinga, B.J. (2008). Semantic W173. annotation and search of cultural-heritage collections: [14] McCallum, Andrew Kachites (2002) MALLET: A The MultimediaN E-Culture demonstrator. Journal of Machine Learning for Language Toolkit, Web Semantics, 6(4), 243-249. http://mallet.cs.umass.edu. [12] Khrouf, H., and R. Troncy (2012). "EventMedia: A LOD dataset of events illustrated with media." Semantic Web journal, Special Issue on Linked Dataset descriptions. 6