PoliticalMashup Ngramviewer Tracking who said what and when in parliament Bart de Goede Justin van Wees Maarten Marx Dispectu Dispectu PoliticalMashup University of Amsterdam University of Amsterdam University of Amsterdam bart@dispectu.com justin@dispectu.com maartenmarx@uva.nl ABSTRACT n-gram unique terms without hapaxes The PoliticalMashup Ngramviewer is an application that 1-grams 2,773,826 992,291 allows a user to visualise the use of terms and phrases 2-grams 38,811,679 12,852,501 in the “Tweede Kamer” (the Dutch parliament). Inspired 3-grams 170,314,738 38,648,440 by the Google Books Ngramviewer1 , the PoliticalMashup 4-grams 358,360,166 48,621,948 Ngramviewer additionally allows for faceting on politicians 5-grams 498,848,849 36,838,184 and parties, providing a more detailed insight in the use of 6-grams 573,197,917 22,737,318 certain terms and phrases by politicians and parties with 7-grams 606,867,133 13,655,460 different points of view. total 2,249,174,308 174,346,142 Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous Table 1: Distribution of unique n-grams in the Ngramviewer corpus for all terms, and with all ha- paxes (terms that occur only once in the corpus) 1. INTRODUCTION removed. The Google Books Ngramviewer [2] allows a user to query for phrases consisting of up to 5 terms. The application visualises the relative occurrence of these phrases in a corpus parliament, available and searchable. In addition, a goal of of digitised books written in a specific language over time. the project is to combine (or mash up) political data from Inspired by the Google Books Ngramviewer, the Politi- different sources, in order to provide for semantic search, calMashup Ngramviewer2 allows the user to query phrases such as queries for events or persons. consisting of up to 7 terms spoken in the Dutch parliament This Ngramviewer is an example of why linking raw text between 1815 and 2012, and visualise the occurrence of to entities such as persons or parties can be useful: for each those phrases over time. Additionally, the PoliticalMashup word ever uttered in the Dutch parliament, we know who Ngramviewer allows the user to facet on politicians and par- said it, when it was said, to which party that person belonged ties, allowing for comparison of the use of phrases through at that time, and which role that person had at that point in time by parties with different ideologies. the debate. By linking text to speakers, faceting on persons In this demonstration paper we describe the data used in and parties is enabled. this application, the approach taken with regard to analysing The data this application uses originates from three sources: and indexing that data, and examples of how the application Staten-Generaal Digitaal3 , Officiële Bekendmakingen4 and could be used in research on agenda setting and linguistics. Parlementair Documentatiecentrum Leiden5 . PoliticalMashup collected, analysed and transformed data from these sources, 2. NGRAMVIEWER determining which speaker said what when, and to which party that speaker belonged at the time. This dataset is 2.1 Data freely available via DANS EASY6 . The PoliticalMashup project [1] aims to make large quan- tities of political data, such as the proceedings of the Dutch 1 http://books.google.com/ngrams 3 2 Project of the Koninklijke Bibliotheek (http://kb.nl/ http://ngram.politicalmashup.nl en/), digitising all Dutch parliamentary proceedings between 1814 and 1995 (http://statengeneraaldigitaal.nl/ overdezesite). 4 Portal of the Dutch government, providing a search Permission to make digital or hard copies of all or part of this work for interface to all govermental proclamations, including personal or classroom use is granted without fee provided that copies are parliamentary proceedings since 1995 (https://zoek. not made or distributed for profit or commercial advantage and that copies officielebekendmakingen.nl/). bear this notice and the full citation on the first page. To copy otherwise, to 5 republish, to post on servers or to redistribute to lists, requires prior specific Biographical information on politicians and parties (http: permission and/or a fee. //www.parlement.com/). 6 DIR 2013, April 26, 2013, Delft, The Netherlands. http://www.persistent-identifier.nl/urn: Copyright remains with the authors and/or original copyright holders. nbn:nl:ui:13-k2g8-5h Figure 1: The PoliticalMashup Ngramviewer inter- Figure 2: The PoliticalMashup Ngramviewer in- face showing results for “het kan niet zo zijn dat”, terface showing results for “Henk en Ingrid”, with with facets on PvdA and VVD, illustrating the rise facets on parties, showing the introduction of the of the phrase since the eighties. term in 2008, no use in 2009, and that the term is picked up by other parties in 2010. 2.2 Indexing The PoliticalMashup Ngramviewer is built on top of an 2.4 Examples Apache Lucene7 index. We defined a document as every “Het kan niet zo zijn dat”9 is a popular phrase used by word of a specific politician spoken on a particular day. This (Dutch) politicians, lending their statement a more urgent allows for comparison of term frequencies per person, per day, feeling, (unconsciously) trying to manipulate their audience, which can be aggregated to words spoken by all members of while the person is just ventilating an opinion. Figure 1 a particular party in a particular time period (week, month, shows the rapid increase in use since the eighties, and the year, etcetera). use of the Ngramviewer for linguistic research. We used standard tokenisation and analysis on these doc- “Henk en Ingrid” are a fictional couple, conceived by the uments; lowercasing, character folding and removal of punc- Dutch politician Geert Wilders10 , representing the average tuation, but keeping stopwords, in order to facilitate search Dutch family. Figure 2 shows how Wilders’ party introduced on phrases containing common words such as articles or the phrase in 2008, but was left unused until 2010, when determiners. Additionally, we constructed word n-grams other parties picked up the phrase as well. This example (1 ≤ n ≤ 7), respecting sentence boundaries. shows the use of the Ngramviewer for agenda-setting. The index contains data from 4 April 1815 to 9 September 2012, with 326,315 documents (where a document is all the 3. DEMONSTRATION text one person said on one day), 18,572 days for which The demonstration will show how the PoliticalMashup there are documents, for in total 3,085 politicians which are Ngramviewer can be used, displaying a graph of how often members of 119 parties or the government. Table 1 shows the the entered phrases occur over time in the proceedings of distribution of n-grams in the corpus. The second column the Dutch parliament. Also, it will demonstrate faceting on shows the distribution of n-grams that occur more than once politicians and parties, showing the occurrence of the entered in the corpus, yielding a reduction of the vocabulary size of phrases over time for specific politicians and parties. one order of magnitude. This is partly due to OCR errors (all proceedings predating 1995 are scans of paper archives). 4. ACKNOWLEDGMENTS 2.3 Architecture This research was supported by the Netherlands Organi- We constructed an inverted index in Lucene, storing the zation for Scientific Research (NWO) under project number document frequency for each n-gram, and the term frequency 380-52-005 (PoliticalMashup). for each document that n-gram occurs in. Additionally, each document has attributes, such as the 5. REFERENCES date the terms of that document were spoken, and identifiers that resolve to politicians and parties8 . [1] M. Marx. Politicalmashup. Retrieved March, At query time, these identifiers are used to obtain informa- 2013 from http://politicalmashup.nl/ over-political-mashup/. tion on persons and parties, which are subsequently cached in a Redis key-value store. This Redis store is also used to [2] J.-B. Michel, Y. K. Shen, A. P. Aiden, A. Veres, M. K. cache query results and keep track of popular queries. Also, Gray, J. P. Pickett, D. Hoiberg, D. Clancy, P. Norvig, date frequencies are aggregated to frequencies per year at J. Orwant, et al. Quantitative analysis of culture using query time. millions of digitized books. Science, 331(6014):176–182, 2011. 7 http://lucene.apache.org/core/ 8 9 PoliticalMashup maintains a resolver that maps identifiers In English: “It is unacceptable that . . . ” 10 to persons parties and proceedings. http://en.wikipedia.org/wiki/Geert_wilders