=Paper= {{Paper |id=None |storemode=property |title=PoliticalMashup Ngramviewer |pdfUrl=https://ceur-ws.org/Vol-986/paper_5.pdf |volume=Vol-986 |dblpUrl=https://dblp.org/rec/conf/dir/GoedeWM13 }} ==PoliticalMashup Ngramviewer== https://ceur-ws.org/Vol-986/paper_5.pdf
                                          PoliticalMashup Ngramviewer
                                    Tracking who said what and when in parliament
                      Bart de Goede                                    Justin van Wees                        Maarten Marx
                         Dispectu                                          Dispectu                           PoliticalMashup
                 University of Amsterdam                           University of Amsterdam                University of Amsterdam
                  bart@dispectu.com                                justin@dispectu.com                   maartenmarx@uva.nl


ABSTRACT                                                                                     n-gram      unique terms    without hapaxes
The PoliticalMashup Ngramviewer is an application that                                       1-grams        2,773,826              992,291
allows a user to visualise the use of terms and phrases                                      2-grams       38,811,679           12,852,501
in the “Tweede Kamer” (the Dutch parliament). Inspired                                       3-grams      170,314,738           38,648,440
by the Google Books Ngramviewer1 , the PoliticalMashup                                       4-grams      358,360,166           48,621,948
Ngramviewer additionally allows for faceting on politicians                                  5-grams      498,848,849           36,838,184
and parties, providing a more detailed insight in the use of                                 6-grams      573,197,917           22,737,318
certain terms and phrases by politicians and parties with                                    7-grams      606,867,133           13,655,460
different points of view.
                                                                                             total      2,249,174,308          174,346,142

Categories and Subject Descriptors
H.4 [Information Systems Applications]: Miscellaneous                                Table 1: Distribution of unique n-grams in the
                                                                                     Ngramviewer corpus for all terms, and with all ha-
                                                                                     paxes (terms that occur only once in the corpus)
1.     INTRODUCTION                                                                  removed.
   The Google Books Ngramviewer [2] allows a user to query
for phrases consisting of up to 5 terms. The application
visualises the relative occurrence of these phrases in a corpus                      parliament, available and searchable. In addition, a goal of
of digitised books written in a specific language over time.                         the project is to combine (or mash up) political data from
   Inspired by the Google Books Ngramviewer, the Politi-                             different sources, in order to provide for semantic search,
calMashup Ngramviewer2 allows the user to query phrases                              such as queries for events or persons.
consisting of up to 7 terms spoken in the Dutch parliament                              This Ngramviewer is an example of why linking raw text
between 1815 and 2012, and visualise the occurrence of                               to entities such as persons or parties can be useful: for each
those phrases over time. Additionally, the PoliticalMashup                           word ever uttered in the Dutch parliament, we know who
Ngramviewer allows the user to facet on politicians and par-                         said it, when it was said, to which party that person belonged
ties, allowing for comparison of the use of phrases through                          at that time, and which role that person had at that point in
time by parties with different ideologies.                                           the debate. By linking text to speakers, faceting on persons
   In this demonstration paper we describe the data used in                          and parties is enabled.
this application, the approach taken with regard to analysing                           The data this application uses originates from three sources:
and indexing that data, and examples of how the application                          Staten-Generaal Digitaal3 , Officiële Bekendmakingen4 and
could be used in research on agenda setting and linguistics.                         Parlementair Documentatiecentrum Leiden5 . PoliticalMashup
                                                                                     collected, analysed and transformed data from these sources,
2.     NGRAMVIEWER                                                                   determining which speaker said what when, and to which
                                                                                     party that speaker belonged at the time. This dataset is
2.1      Data                                                                        freely available via DANS EASY6 .
   The PoliticalMashup project [1] aims to make large quan-
tities of political data, such as the proceedings of the Dutch
1
    http://books.google.com/ngrams                                                   3
2                                                                                      Project of the Koninklijke Bibliotheek (http://kb.nl/
    http://ngram.politicalmashup.nl
                                                                                     en/), digitising all Dutch parliamentary proceedings between
                                                                                     1814 and 1995 (http://statengeneraaldigitaal.nl/
                                                                                     overdezesite).
                                                                                     4
                                                                                       Portal of the Dutch government, providing a search
Permission to make digital or hard copies of all or part of this work for            interface to all govermental proclamations, including
personal or classroom use is granted without fee provided that copies are            parliamentary proceedings since 1995 (https://zoek.
not made or distributed for profit or commercial advantage and that copies           officielebekendmakingen.nl/).
bear this notice and the full citation on the first page. To copy otherwise, to      5
republish, to post on servers or to redistribute to lists, requires prior specific     Biographical information on politicians and parties (http:
permission and/or a fee.                                                             //www.parlement.com/).
                                                                                     6
DIR 2013, April 26, 2013, Delft, The Netherlands.                                           http://www.persistent-identifier.nl/urn:
Copyright remains with the authors and/or original copyright holders.                nbn:nl:ui:13-k2g8-5h
Figure 1: The PoliticalMashup Ngramviewer inter-                  Figure 2:   The PoliticalMashup Ngramviewer in-
face showing results for “het kan niet zo zijn dat”,              terface showing results for “Henk en Ingrid”, with
with facets on PvdA and VVD, illustrating the rise                facets on parties, showing the introduction of the
of the phrase since the eighties.                                 term in 2008, no use in 2009, and that the term is
                                                                  picked up by other parties in 2010.


2.2    Indexing
   The PoliticalMashup Ngramviewer is built on top of an          2.4      Examples
Apache Lucene7 index. We defined a document as every                “Het kan niet zo zijn dat”9 is a popular phrase used by
word of a specific politician spoken on a particular day. This    (Dutch) politicians, lending their statement a more urgent
allows for comparison of term frequencies per person, per day,    feeling, (unconsciously) trying to manipulate their audience,
which can be aggregated to words spoken by all members of         while the person is just ventilating an opinion. Figure 1
a particular party in a particular time period (week, month,      shows the rapid increase in use since the eighties, and the
year, etcetera).                                                  use of the Ngramviewer for linguistic research.
   We used standard tokenisation and analysis on these doc-         “Henk en Ingrid” are a fictional couple, conceived by the
uments; lowercasing, character folding and removal of punc-       Dutch politician Geert Wilders10 , representing the average
tuation, but keeping stopwords, in order to facilitate search     Dutch family. Figure 2 shows how Wilders’ party introduced
on phrases containing common words such as articles or            the phrase in 2008, but was left unused until 2010, when
determiners. Additionally, we constructed word n-grams            other parties picked up the phrase as well. This example
(1 ≤ n ≤ 7), respecting sentence boundaries.                      shows the use of the Ngramviewer for agenda-setting.
   The index contains data from 4 April 1815 to 9 September
2012, with 326,315 documents (where a document is all the         3.      DEMONSTRATION
text one person said on one day), 18,572 days for which
                                                                    The demonstration will show how the PoliticalMashup
there are documents, for in total 3,085 politicians which are
                                                                  Ngramviewer can be used, displaying a graph of how often
members of 119 parties or the government. Table 1 shows the
                                                                  the entered phrases occur over time in the proceedings of
distribution of n-grams in the corpus. The second column
                                                                  the Dutch parliament. Also, it will demonstrate faceting on
shows the distribution of n-grams that occur more than once
                                                                  politicians and parties, showing the occurrence of the entered
in the corpus, yielding a reduction of the vocabulary size of
                                                                  phrases over time for specific politicians and parties.
one order of magnitude. This is partly due to OCR errors
(all proceedings predating 1995 are scans of paper archives).
                                                                  4.      ACKNOWLEDGMENTS
2.3    Architecture                                                 This research was supported by the Netherlands Organi-
   We constructed an inverted index in Lucene, storing the        zation for Scientific Research (NWO) under project number
document frequency for each n-gram, and the term frequency        380-52-005 (PoliticalMashup).
for each document that n-gram occurs in.
   Additionally, each document has attributes, such as the        5.      REFERENCES
date the terms of that document were spoken, and identifiers
that resolve to politicians and parties8 .                        [1] M. Marx.     Politicalmashup.  Retrieved March,
   At query time, these identifiers are used to obtain informa-       2013    from       http://politicalmashup.nl/
                                                                      over-political-mashup/.
tion on persons and parties, which are subsequently cached
in a Redis key-value store. This Redis store is also used to      [2] J.-B. Michel, Y. K. Shen, A. P. Aiden, A. Veres, M. K.
cache query results and keep track of popular queries. Also,          Gray, J. P. Pickett, D. Hoiberg, D. Clancy, P. Norvig,
date frequencies are aggregated to frequencies per year at            J. Orwant, et al. Quantitative analysis of culture using
query time.                                                           millions of digitized books. Science, 331(6014):176–182,
                                                                      2011.
7
 http://lucene.apache.org/core/
8                                                                 9
 PoliticalMashup maintains a resolver that maps identifiers            In English: “It is unacceptable that . . . ”
                                                                  10
to persons parties and proceedings.                                    http://en.wikipedia.org/wiki/Geert_wilders