=Paper=
{{Paper
|id=None
|storemode=property
|title=Towards
Visual Overviews for Open Government Data
|pdfUrl=https://ceur-ws.org/Vol-1210/datawiz2014_04.pdf
|volume=Vol-1210
|dblpUrl=https://dblp.org/rec/conf/ht/GravesB14
}}
==Towards
Visual Overviews for Open Government Data==
<pdf width="1500px">https://ceur-ws.org/Vol-1210/datawiz2014_04.pdf</pdf>
<pre>
     Towards Visual Overviews for Open Government Data

                          Alvaro Graves                                    Javier Bustos-Jiménez
                           Inria Chile                                      NIC Chile Research Labs
                   Av. Apoquindo 2827, piso 12                               Blanco Encalada 1975
                         Santiago, Chile                                        Santiago, Chile
                    alvaro.graves@inria.cl                                    jbustos@niclabs.cl


ABSTRACT
The rise of Open Data initiatives has led to the publication
of many datasets from different organizations and govern-
ments. These datasets cover a wide range of knowledge do-
mains, from budget to education to health care. However,
not all datasets have the quality, granularity or type of in-
formation that is relevant to each user. Moreover, in many
cases the description or metadata does not specify clearly the
content of a dataset, difficulting the exploration of datasets
by stakeholders. In this paper we propose the use of dash-
boards and visualizations as a way to preview the content
of datasets for easier exploration. The use of visualizations    Figure 1: Restults provided by searching for child
can provide a rapid way to select or discard datasets based      obesity in Data.gov (upper image) and Data.gov.uk
on their content, reducing the potential datasets that a user    (lower image)
may need to look in order to get what she needs.

Categories and Subject Descriptors                               Data.gov, only one dataset is available (in several formats).
                                                                 This dataset is described as “federal”, however a closer look
D.2.2 [Software Engineering]: Design Tools and Tech-
                                                                 shows that the data is related to the state of New York
niques—User interfaces; H.5.1 [Information Interfaces
                                                                 only. In the case of Data.gov.uk, 16 results provide informa-
and Presentation]: Multimedia Information Systems—
                                                                 tion related to child and obesity in PDF and Excel formats.
Methodology; I.7.4 [Document and Text Processing]:
                                                                 Beyond these difference, it is not clear for a researcher or
Electronic Publishing
                                                                 developer is these datasets are relevant to her needs; hav-
                                                                 ing a title and a description is useful, but does not clarify
General Terms                                                    exactly what type of information, granularity and quality of
Documentation,Human Factors,Open Government Data                 the data is available.

Keywords                                                         For example, it is not clear what specific data is contained
Open Government Data, Open Data, Preview, Data Visu-             in a dataset, what structure is used or the scope of this
alization                                                        dataset. As mentioned early,i n the case of the US dataset
                                                                 about child obesity, it is labeled as “federal”, however the
1.   INTRODUCTION                                                data describes only information about New York State; it
Over one million datasets [16] are currently available in dif-   is likely that other manually curated tags and descriptions
ferent portals across the globe. Although the data is pub-       may not be precise in terms of the content or scope of the
licly available, their organization and structure is not clear   datasets published. Thus, the question in this and many
for all the stakeholders necessarily. For example, at the time   other cases is how can stakeholders know in advance what’s
of this writing the search for “child obesity” in Data.gov and   in a dataset before downloading it? We propose the use
Data.gov.uk (the two largest Open Government Data por-           of dashboards and visualizations to describe and preview
tals) gives different results, as can be see in Figure 1: In     the content of datasets; this visual representations will help
                                                                 stakeholder to decided whether a dataset is useful for them
                                                                 or not.

                                                                 This paper is structured as follows: Section 2 describes re-
                                                                 lated work found in the literature and state of the art tech-
                                                                 nology. Section 3 discusses different pieces of information
                                                                 that can be used to create visual overviews from some of the
                                                                 more common file formats used to publish Open Govern-
                                                                 ment Data. In Section 4 we show a prototype developed as
                                                                 an example of what can be done to create visual overviews
of datasets using the information discussed previously. Sec-
tion 5 presents the future challenges on our research and we
discuss our conclusions in Section 6.

2.    RELATED WORK
The problem of good data visualizations has been studied
many years [24]. In terms of data exploration and visual-
ization, Schneiderman [22] summarizes the Visual Informa-
tion Seeking Mantra as Overview first, zoom and filter, then
details-on-demand; humans need to get the “big picture” of a
dataset first in order to decide where to explore next. Thus,
a visual overview of a dataset can be be useful for researchers
and journalists to know “what’s in there” before taking fur-
ther action.

One of the seminal works in dataset preview was made by            Figure 2: Visual exploration interface proposed
Doan et al. in 1999. They studied the effects of visual pre-       by Dörk et al.[7], which includes data collection
views of queries for NASA’s EODIS datasets [19], concluding        choosers, visualization widgets, text query box and
that the main advantages of these visual strategies were:          the current set of results.


     • “eliminate zero-hit queries,
     • reduces network activity and browsing effor by prevent-
       ing the retrieval of undesired datasets,
     • represents statistical information of database visually
       to aid comprehension and axplorarion,
     • support synamic queries, which aids users to discover
       dataset patterns and exceptions, and
     • (they are) suitable to novice, intermittent, or expert
       users”.


A generalization of query previews is presented in the work
of Tanin et al. [23], complementing the work of Doan et al.        Figure 3: Number of datasets available in data.gov
with barcharts in order to show data distribution.                 and data.gov.uk by format.

In the beginning of this century, similar conclusions were
reached by Green et al. in their study about how previews          3.   CONTENT FOR VISUAL OVERVIEWS
and overviews allow users to rapidly discriminate useful in-       Different formats provide different support for data, meta-
formation from those not for interest [10], applying their         data, annotations and other extra information that can be
findings in the interfaces provided by the Digital Library of      helpful for users to identify datasets that area valuable for
Congress and concluding that “previews should be available         them. In order to understand what format are more of-
at a high level within a site so users get a taste of what is to   ten used to publish Open Government Data, we looked at
come early in their visit”.                                        Data.gov and Data.gov.uk, two of the largest government
                                                                   data portals. We took the most popular formats reported
Nowadays, the principles behind above works seems to be            by these portals and we found that most datasets are pub-
suitable for open data publication, as it has been reported        lished in HTML, followed by XML, ZIP, CSV, PDF and
to be for web searching by the work of Dörk et al.[7], where      JSON, as can be seen in Figure 3. It is important to note
they studied performance and benefits of a new approach            that in many cases a dataset is published in multiple for-
called visual exploration for information seeking on the Web       mats, so these numbers are not related to the number of
(Figure 2).                                                        datasets available.

From the perspecting of the Open Government Data, vi-              It is reasonable then to focus our efforts on the most common
sualizations are valuable and useful artifacts for users [12];     formats in order to cover an important number of datasets
visualizations can provide feedback and help on the deci-          with our study. For this work, we do not considered ZIP files
sion making process related to public policies. A survey [9]       as part of the list of datasets to study, due to the fact that
showed that many stakeholders found that users were inter-         ZIP files are actually archives containing other files, such as
ested in interacting with data via the use of visualizations.      CSV. Hence, for this study a ZIP file can be considered only
Hence, there is reasonable evidence to support our hypothe-        as an “extra layer” of communication, and not a file format
sis that preview visualizations can be a useful tools for Open     that we should study.
Government Data stakeholders.
                                                                    <persons>
3.1    Data, metadata and annotations                                <person>
We identify three different sources of information in a dataset
                                                                      <!-- this is a comment -->
that can be used to create visual overviews: data, metadata
                                                                      <name>John</name>
and annotations. We understand metadata different from
                                                                      <lastname>Doe</lastname>
annotations in that the former is aimed to provide machine-
                                                                      <language iso="EN">English</language>
processable data about the dataset (e.g., creation date, au-
                                                                     </person>
thor of the dataset), while the latter is more focused on ex-
                                                                    </persons>
plaining to a human reader certain aspects of the data (e.g.,
what does a field mean or information about how the data
was collected). As mentioned before, different data formats               Figure 4: Example of a XML document.
provide different levels of support for data and metadata;
thus, extracting data, metadata and annotations from dif-
ferent file formats present different challenges.                   content and attributes, as can be seen in Figure 4. There
                                                                    are several entities that can be extracted from a valid XML
3.2    HTML                                                         document to be used on a visual overview.
HTML is a markup language aimed to write “scientific docu-
ments, although its general design and adaptations over the
years have enabled it to be used to describe a number of               • Data: It is possible to check for common words, num-
other types of documents” [25]. While not a data format per              bers or phrases that occur in the content of XML tags.
se, it has been widely used to publish data in a way that it             One way to do so is by using XPath [5], a query lan-
is easy to consume by humans, via a web browser. There                   guage aimed to extract data from XML documents.
are multiple sources of data, metadata and annotations that              Similar to the case of HTML, the data can be used to
we can use to represent visually.                                        inform the user about the actual content of the dataset.
                                                                       • Metadata: There are at least two sources of informa-
                                                                         tion that can be used for a visual overview. First, the
   • Data: Representing data in HTML can be done in                      words used as tags and attributes are descriptive of
     multiple ways, from HTML tables to full web applica-                the type of content that is about a dataset. For exam-
     tions. In the most basic case, data can be presented                ple in Figure 4, the words person, name and lastname
     as a list or a table, structured using the <ul>, <ol>               give a good insight of what the data is about. Pre-
     or <table> elements. The process of extracting data                 processing the XML schema with Natural Language
     from HTML documents is know as Web Scraping and                     Processing techniques (e.g., Term frequency [20] or
     there are many tools to do so [17][1]. This data can                entity extraction [18]) can provide better insight on
     feed visual overviews to give insights about the actual             what type of information is contained in the dataset.
     content of the dataset.                                             Also, the structure how the data is organized is valu-
                                                                         able in itself to understand the dataset; identifying the
   • Metadata: HTML provides a mechanism to store
                                                                         most common patterns in a XML structure and repre-
     metadata, by using the <meta> element. In the case of
                                                                         sent it visually, could give insight to users of what type
     well-formed HTML tables, the header of these tables
                                                                         of data is available, without the need to download the
     contain valuable metadata as well; the <th> element on
                                                                         dataset.
     a table will describe the name of each column, some-
     thing that will tell a user if the dataset is useful for her      • Annotations: XML allows comments in a similar way
     purposes or not. These metadata elements can be ex-                 as in HTML (See Figure 4). XML Schema [14] also
     tracted with web scraping techniques as well and used               provides a series of non-mandatory mechanisms to an-
     to give more insight about the structure of the data as             notate XML documents, by using the xsd:annotation
     well as more information about the provenance of it.                tag. Applying NLP techniques as described above
                                                                         could help identify key entities related to this dataset
   • Annotations: HTML supports comments in the code
                                                                         (e.g., countries, contributors, organizations).
     between <!– and –> strings sequences. These an-
     notations can be used to extract information about
     the document and the data described in it as well.             3.4   CSV
     For example, it is possible to obtain the most rele-           Comma-separated values is a loosely used term to define
     vant words in the comments and visualize them using            plain text files structured as tables, using separators (usu-
     a word cloud. It is important to note that in the case         ally a comma, but semicolon and the tab character are not
     of HTML, many annotations might be related to the              uncommon). CSV files are popular due to its simplicity in
     JavaScript code used in the document; a smart heuris-          terms the readability and processing of the data, done both
     tic could discard potentially confusing annotations of         by humans and computers. In many cases, CSV are the re-
     this type.                                                     sult of exporting Spreadsheet files (such as Microsoft Excel)
                                                                    into text. In many cases it is possible to observe headers
                                                                    that defined the columns of a CSV file.
3.3    XML
The Extensible Markup Language [4] is a language focused
on structuring data for the Web, by providing a set of rules           • Data: Since a CSV file is basically a table, it is pos-
on how to encode such data. XML defines a tree-like struc-               sible to extract the most common terms found in the
ture where each node is a user-defined tag which may have                cells and display them as a bar chart or a word cloud
                                                                 {
                                                                      persons: [
                                                                           {
                                                                             name: "John",
                                                                             lastname: "Doe",
                                                                             language: {
                                                                                         value: "English",
                                                                                         iso: "EN"
Figure 5: Example of a spreadsheet version of a                                         }
dataset. The CSV version does not respect the ta-                          }
ble structure, due to the titles and headers that are                 ]
exported along with the rest of the data.                        }


      or other way to present it as a visual overview of the     Figure 6: A possible JSON representation of the
      dataset. There are tools and libraries for virtually any   data shown in Figure 4 as XML.
      programming language to read and extract data from
      CSV files.
                                                                 3.6     JSON
   • Metadata: Due to its simplicity, little metadata can        The JavaScript Object Notation JSON, is an open standard
     be found in a CSV file. However, as mentioned before,       format that has gained popularity, especially in the Web
     in many cases CSV files contain headers that can be         development community, due to the simplicity for consump-
     used to identify the topics described in the dataset.       tion by humans and machines alike. JSON provides a mech-
                                                                 anism to transmit objects that can be use to communicate
   • Annotations: CSV does not support annotations,              different types of variables. Many see JSON as a simpler,
     however in many cases, the direct translation from a        easier-to-use alternative to XML [6]. An example of a JSON
     spreadsheet, such as Microsoft Excel, carries the ti-       document can be seen in Figure 6.
     tle and other comments available on it (see Figure 5
     as an example). These annotations break the table           Similar to XML, JSON provides a tree-like structure, but
     structure of the CSV file and makes it difficult to read    supports different data types, arrays and other objects as
     it by programs. Still, these annotations can provide        well. Thus, it is possible to extract similar information as
     useful information about the content of the file. An        in the case of XML to later be visualized.
     heuristic to obtain such annotations could be the fol-
     lowing: Read each line of a CSV file and consider it as
     an annotation, until the header is found.                        • Data: The values in a JSON document can be used to
                                                                        obtain the most significant words or phrases that can
                                                                        be used later to create a visualization.
3.5    PDF
The use of PDF files to publish data is a common prac-                • Metadata: Collecting the words used as keys can give
tice among practically all governments and organizations,               insights on what type of data is presented in the docu-
although it is widely discouraged and criticized [15][8]. One           ment. Also, the tree structure could be used to identify
of the main reasons is that PDF is a document format, not               how the data is modeled.
a data format. In this sense, PDF does not comply with
                                                                      • Annotations: JSON does not provide a way to an-
the Open Government Data principle [11] that states that
                                                                        notate or comment documents.
data should be in a machine-processable format. Still, many
efforts like Tabula [2] have been developed to extract data
from PDF files.                                                  4.     PROTOTYPE
                                                                 As a way to test our ideas, we developed a demo tool that
                                                                 creates a visual overview of a dataset. This visual overview
   • Data: As mentioned before, in the best of cases PDF         consist on a sample of the data and a dashboard based
     files contain data tables that can be extracted semi-       on the information extracted from a dataset. Due to sim-
     automatically to generate visualizations, similar to the    plicity, our prototype only works with CSV files, but the
     case of CSV files.                                          principles shown are the same for the other file formats
                                                                 described in Section 3. We implemented this demo using
   • Metadata: Although PDF supports metadata and                JavaScript and the D3.js library [3]. The prototype is avail-
     embeddable raw data [13], common tools for creating         able at https://github.com/niclabs/visual-overview as
     PDFs do no include metadata but some basic author-          open source software.
     ship information. It is not clear what type of metadata
     may be available in the general case to use for an visual
     overview.                                                   4.1     Rationale
                                                                 The prototype presents three different levels of detail of the
   • Annotations: Similar to the case of XML and HTML            data contained in a dataset. First, we considered useful to
     documents, annotations in PDF can be used to identify       give the user a sample of the data, so she can get an idea of
     relevant terms that can be later used create a visual       what it looks like as a table. To do so, we included the first
     overview.                                                   three rows of the dataset.
Figure 7: Screenshot of our prototype. A user can indicate a CSV file available and the system will render
several statistics related to the values present in the data, as well as the headers available.


Second, in our experience most CSV files describe data pro-       best strategy.
perties in terms of columns (in contrast to rows); a CSV col-
umn usually contains values related to a specific dimension       4.2    Use of the prototype
(e.g., age, latitude, name). Thus, one reasonable approach        After a user has entered the URL of a dataset, the prototype
is to create visualizations for each column. As a way to pro-     will analyze the data in order to extract the more common
vide a visual representation of the values on each column,        terms. Although our prototype processes the data live, it
we used word clouds [21]; in this way, we present the most        is possible to imagine more sophisticated mechanisms that
common values in each column to the user in a way that is         deal with larger datasets, such as offline or batch processing.
easy to consume without any technical background.                 As mentioned early, our prototype provides several visualiza-
                                                                  tions for each column of the CSV file, including data sample,
Finally, in many cases it is important to provide more infor-     a wordcloud and a histogram for each column. A screenshot
mation about the distribution of values to answer questions,      of our prototype can be seen in Figure 7.
such as Is the data normally distributed? Does it follow a
long tail? Are all the values equally likely?. Although the
word cloud provides some insights on this respect, we think       5.    FUTURE WORK
a clearer representation was needed. Thus, a histogram of         Our hypothesis is that these visualizations can facilitate the
the values in each column is provided. This histogram facil-      process of deciding if a dataset is useful for a person or not.
itates the understanding of how the data is distributed and       Thus, we propose to perform a user study to evaluate how
what are the most/least common values.                            easy or hard is for a user to find valuable information in the
                                                                  presence/absence of visual overviews. Also, the effectiveness
It is important to note that as a prototype, there are many       of visual overviews may also depend on the type of visual-
issues with this software. For example, a more sophisticated      izations that are displayed in different scenarios. Further
approach would consider the type of data (i.e., generic num-      research is necessary in this regard.
bers, strings, geographical coordinates, time and dates) and
use different visual strategies that are more suitable for each   From this prototype, we can also take several paths. We
case. The variety of the available values may also affect what    plan to include support for other data formats, as described
visual strategy could be used; for example, for the columns       in Section 3. Having a web-based service available to pre-
sex and count the use of word clouds is not necessarily the       view and give insights about a dataset can be a valuable
                                                                  tool for journalists, activists and Open Government Data
researchers in general. Another option is to promote the use          Intelligent Data and Web Technologies (EIDWT),
of tools similar to our prototype to be part of government            2011 International Conference on. IEEE, 2011, pp.
data portals by default. Most of government organizations             107–113.
already provide a series of tags to help people identify and     [13] King, James C., “Role of PDF and Open Data,” in
understand what each dataset is about. Adding an visual               Open Data on the Web, Campus London, Shoreditch,
overview will help them on that effort. Finally, a smarter set        2013, 2013.
of heuristics could be included in our prototype to provide      [14] A. Malhotra and P. Biron, “XML schema part 2:
more suitable visual representations, based on the type of            Datatypes,” World Wide Web Consortium
data available in each dataset. Also, the use of annotations          Recommendation REC-xmlschema-2-20041028, 2004.
in datasets could be used to highlight certain visualizations    [15] Manning, Nathaniel. (2013) Bad metrics and PDF
over others.                                                          graveyards: why development needs open data.
                                                                      http://www.theguardian.com/
6.   CONCLUSIONS                                                      global-development-professionals-network/2013/oct/
In this paper we have proposed the use of visualizations to           21/development-open-data-action.
preview and give insights about datasets that can be use-        [16] C. Peng. (2012, Aug.). int. open government data
ful and valuable to many stakeholders. We showed that for             search data analytics. linking open government data.
most of the more common file formats used to publish Open             [Online]. Available: http://logd.tw.rpi.edu/iogds data
Government Data, it is possible to extract valuable informa-          analytics[RetrievedNov.24,2013]
tion that can be later used to create visual overviews. We       [17] R. B. Penman, T. Baldwin, and D. Martinez, “Web
also showed how these visual overviews can be created using           scraping made simple with sitescraper,” 2009.
a prototype developed by the authors that present a dash-        [18] M. Pennacchiotti and P. Pantel, “Entity extraction via
board of visualizations based on the information obtained             ensemble semantics,” in Proceedings of the 2009
from a dataset. Finally, we discussed the different paths             Conference on Empirical Methods in Natural Language
this work can take in the future.                                     Processing: Volume 1-Volume 1. Association for
                                                                      Computational Linguistics, 2009, pp. 238–247.
7.   REFERENCES                                                  [19] C. Plaisant, B. Shneiderman, K. Doan, and T. Bruns,
 [1] (2011) ScraperWiki.                                              “Interface and data architecture for query preview in
     ScraperWiki.2011.http://scraperwiki.com/.                        networked information systems,” ACM Transactions
 [2] “Tabula,” http://tabula.nerdpower.org/, 2013.                    on Information Systems (TOIS), vol. 17, no. 3, pp.
 [3] M. Bostock, V. Ogievetsky, and J. Heer, “D3                      320–341, 1999.
     Data-Driven Documents,” IEEE Trans. Vis. Comput.            [20] G. Salton and M. J. McGill, “Introduction to modern
     Graphics, vol. 17, no. 12, pp. 2301–2309, Dec. 2011.             information retrieval,” 1983.
 [4] T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler,        [21] C. Seifert, B. Kump, W. Kienreich, G. Granitzer, and
     and F. Yergeau, “Extensible markup language (xml),”              M. Granitzer, “On the beauty and usability of tag
     World Wide Web Consortium Recommendation                         clouds,” in Information Visualisation, 2008. IV’08.
     REC-xml-19980210. http://www. w3.                                12th International Conference. IEEE, 2008, pp.
     org/TR/1998/REC-xml-19980210, 1998.                              17–25.
 [5] J. Clark, S. DeRose et al., “Xml path language              [22] B. Shneiderman, “The eyes have it: A task by data
     (xpath) version 1.0,” 1999.                                      type taxonomy for information visualizations,” in
 [6] D. Crockford, “Json: The fat-free alternative to xml,”           Visual Languages, 1996. Proceedings., IEEE
     in Proc. of XML, vol. 2006, 2006.                                Symposium on. IEEE, 1996, pp. 336–343.
 [7] M. Dörk, C. Williamson, and S. Carpendale,                 [23] E. Tanin, C. Plaisant, and B. Shneiderman,
     “Navigating tomorrow’s web: From searching and                   “Broadening access to large online databases by
     browsing to visual exploration,” ACM Transactions on             generalizing query previews,” The craft of information
     the Web (TWEB), vol. 6, no. 3, p. 13, 2012.                      visualization: readings and reflections, p. 31, 2003.
 [8] M. Fioretti, “Open data: Emerging trends, issues and        [24] E. R. Tufte and P. Graves-Morris, The visual display
     best practices,” Laboratory of, 2011.                            of quantitative information. Graphics press Cheshire,
 [9] A. Graves and J. Hendler, “Visualization tools for               CT, 1983, vol. 2.
     open government data,” in Proceedings of the 14th           [25] W3C, “HTML5, A vocabulary and associated APIs for
     Annual International Conference on Digital                       HTML and XHTML,”
     Government Research. ACM, 2013, pp. 136–145.                     http://www.w3.org/TR/html5/introduction.html,
[10] S. Greene, G. Marchionini, C. Plaisant, and                      2014.
     B. Shneiderman, “Previews and overviews in digital
     libraries: Designing surrogates to support visual
     information seeking,” Journal of the American Society
     for Information Science, vol. 51, no. 4, pp. 380–393,
     2000.
[11] O. G. W. Group et al., “Principles of open government
     data,” in Workshop held in Sebastopol, CA, USA.
     http://www. opengovdata. org/home/8principles, 8.
[12] J. Hoxha and A. Brahaj, “Open government data on
     the web: A semantic approach,” in Emerging

</pre>