=Paper= {{Paper |id=None |storemode=property |title=Towards Visual Overviews for Open Government Data |pdfUrl=https://ceur-ws.org/Vol-1210/datawiz2014_04.pdf |volume=Vol-1210 |dblpUrl=https://dblp.org/rec/conf/ht/GravesB14 }} ==Towards Visual Overviews for Open Government Data== https://ceur-ws.org/Vol-1210/datawiz2014_04.pdf
     Towards Visual Overviews for Open Government Data

                          Alvaro Graves                                    Javier Bustos-Jiménez
                           Inria Chile                                      NIC Chile Research Labs
                   Av. Apoquindo 2827, piso 12                               Blanco Encalada 1975
                         Santiago, Chile                                        Santiago, Chile
                    alvaro.graves@inria.cl                                    jbustos@niclabs.cl


ABSTRACT
The rise of Open Data initiatives has led to the publication
of many datasets from different organizations and govern-
ments. These datasets cover a wide range of knowledge do-
mains, from budget to education to health care. However,
not all datasets have the quality, granularity or type of in-
formation that is relevant to each user. Moreover, in many
cases the description or metadata does not specify clearly the
content of a dataset, difficulting the exploration of datasets
by stakeholders. In this paper we propose the use of dash-
boards and visualizations as a way to preview the content
of datasets for easier exploration. The use of visualizations    Figure 1: Restults provided by searching for child
can provide a rapid way to select or discard datasets based      obesity in Data.gov (upper image) and Data.gov.uk
on their content, reducing the potential datasets that a user    (lower image)
may need to look in order to get what she needs.

Categories and Subject Descriptors                               Data.gov, only one dataset is available (in several formats).
                                                                 This dataset is described as “federal”, however a closer look
D.2.2 [Software Engineering]: Design Tools and Tech-
                                                                 shows that the data is related to the state of New York
niques—User interfaces; H.5.1 [Information Interfaces
                                                                 only. In the case of Data.gov.uk, 16 results provide informa-
and Presentation]: Multimedia Information Systems—
                                                                 tion related to child and obesity in PDF and Excel formats.
Methodology; I.7.4 [Document and Text Processing]:
                                                                 Beyond these difference, it is not clear for a researcher or
Electronic Publishing
                                                                 developer is these datasets are relevant to her needs; hav-
                                                                 ing a title and a description is useful, but does not clarify
General Terms                                                    exactly what type of information, granularity and quality of
Documentation,Human Factors,Open Government Data                 the data is available.

Keywords                                                         For example, it is not clear what specific data is contained
Open Government Data, Open Data, Preview, Data Visu-             in a dataset, what structure is used or the scope of this
alization                                                        dataset. As mentioned early,i n the case of the US dataset
                                                                 about child obesity, it is labeled as “federal”, however the
1.   INTRODUCTION                                                data describes only information about New York State; it
Over one million datasets [16] are currently available in dif-   is likely that other manually curated tags and descriptions
ferent portals across the globe. Although the data is pub-       may not be precise in terms of the content or scope of the
licly available, their organization and structure is not clear   datasets published. Thus, the question in this and many
for all the stakeholders necessarily. For example, at the time   other cases is how can stakeholders know in advance what’s
of this writing the search for “child obesity” in Data.gov and   in a dataset before downloading it? We propose the use
Data.gov.uk (the two largest Open Government Data por-           of dashboards and visualizations to describe and preview
tals) gives different results, as can be see in Figure 1: In     the content of datasets; this visual representations will help
                                                                 stakeholder to decided whether a dataset is useful for them
                                                                 or not.

                                                                 This paper is structured as follows: Section 2 describes re-
                                                                 lated work found in the literature and state of the art tech-
                                                                 nology. Section 3 discusses different pieces of information
                                                                 that can be used to create visual overviews from some of the
                                                                 more common file formats used to publish Open Govern-
                                                                 ment Data. In Section 4 we show a prototype developed as
                                                                 an example of what can be done to create visual overviews
of datasets using the information discussed previously. Sec-
tion 5 presents the future challenges on our research and we
discuss our conclusions in Section 6.

2.    RELATED WORK
The problem of good data visualizations has been studied
many years [24]. In terms of data exploration and visual-
ization, Schneiderman [22] summarizes the Visual Informa-
tion Seeking Mantra as Overview first, zoom and filter, then
details-on-demand; humans need to get the “big picture” of a
dataset first in order to decide where to explore next. Thus,
a visual overview of a dataset can be be useful for researchers
and journalists to know “what’s in there” before taking fur-
ther action.

One of the seminal works in dataset preview was made by            Figure 2: Visual exploration interface proposed
Doan et al. in 1999. They studied the effects of visual pre-       by Dörk et al.[7], which includes data collection
views of queries for NASA’s EODIS datasets [19], concluding        choosers, visualization widgets, text query box and
that the main advantages of these visual strategies were:          the current set of results.


     • “eliminate zero-hit queries,
     • reduces network activity and browsing effor by prevent-
       ing the retrieval of undesired datasets,
     • represents statistical information of database visually
       to aid comprehension and axplorarion,
     • support synamic queries, which aids users to discover
       dataset patterns and exceptions, and
     • (they are) suitable to novice, intermittent, or expert
       users”.


A generalization of query previews is presented in the work
of Tanin et al. [23], complementing the work of Doan et al.        Figure 3: Number of datasets available in data.gov
with barcharts in order to show data distribution.                 and data.gov.uk by format.

In the beginning of this century, similar conclusions were
reached by Green et al. in their study about how previews          3.   CONTENT FOR VISUAL OVERVIEWS
and overviews allow users to rapidly discriminate useful in-       Different formats provide different support for data, meta-
formation from those not for interest [10], applying their         data, annotations and other extra information that can be
findings in the interfaces provided by the Digital Library of      helpful for users to identify datasets that area valuable for
Congress and concluding that “previews should be available         them. In order to understand what format are more of-
at a high level within a site so users get a taste of what is to   ten used to publish Open Government Data, we looked at
come early in their visit”.                                        Data.gov and Data.gov.uk, two of the largest government
                                                                   data portals. We took the most popular formats reported
Nowadays, the principles behind above works seems to be            by these portals and we found that most datasets are pub-
suitable for open data publication, as it has been reported        lished in HTML, followed by XML, ZIP, CSV, PDF and
to be for web searching by the work of Dörk et al.[7], where      JSON, as can be seen in Figure 3. It is important to note
they studied performance and benefits of a new approach            that in many cases a dataset is published in multiple for-
called visual exploration for information seeking on the Web       mats, so these numbers are not related to the number of
(Figure 2).                                                        datasets available.

From the perspecting of the Open Government Data, vi-              It is reasonable then to focus our efforts on the most common
sualizations are valuable and useful artifacts for users [12];     formats in order to cover an important number of datasets
visualizations can provide feedback and help on the deci-          with our study. For this work, we do not considered ZIP files
sion making process related to public policies. A survey [9]       as part of the list of datasets to study, due to the fact that
showed that many stakeholders found that users were inter-         ZIP files are actually archives containing other files, such as
ested in interacting with data via the use of visualizations.      CSV. Hence, for this study a ZIP file can be considered only
Hence, there is reasonable evidence to support our hypothe-        as an “extra layer” of communication, and not a file format
sis that preview visualizations can be a useful tools for Open     that we should study.
Government Data stakeholders.
                                                                    
3.1    Data, metadata and annotations                                
We identify three different sources of information in a dataset
                                                                      
that can be used to create visual overviews: data, metadata
                                                                      John
and annotations. We understand metadata different from
                                                                      Doe
annotations in that the former is aimed to provide machine-
                                                                      English
processable data about the dataset (e.g., creation date, au-
                                                                     
thor of the dataset), while the latter is more focused on ex-
                                                                    
plaining to a human reader certain aspects of the data (e.g.,
what does a field mean or information about how the data
was collected). As mentioned before, different data formats               Figure 4: Example of a XML document.
provide different levels of support for data and metadata;
thus, extracting data, metadata and annotations from dif-
ferent file formats present different challenges.                   content and attributes, as can be seen in Figure 4. There
                                                                    are several entities that can be extracted from a valid XML
3.2    HTML                                                         document to be used on a visual overview.
HTML is a markup language aimed to write “scientific docu-
ments, although its general design and adaptations over the
years have enabled it to be used to describe a number of               • Data: It is possible to check for common words, num-
other types of documents” [25]. While not a data format per              bers or phrases that occur in the content of XML tags.
se, it has been widely used to publish data in a way that it             One way to do so is by using XPath [5], a query lan-
is easy to consume by humans, via a web browser. There                   guage aimed to extract data from XML documents.
are multiple sources of data, metadata and annotations that              Similar to the case of HTML, the data can be used to
we can use to represent visually.                                        inform the user about the actual content of the dataset.
                                                                       • Metadata: There are at least two sources of informa-
                                                                         tion that can be used for a visual overview. First, the
   • Data: Representing data in HTML can be done in                      words used as tags and attributes are descriptive of
     multiple ways, from HTML tables to full web applica-                the type of content that is about a dataset. For exam-
     tions. In the most basic case, data can be presented                ple in Figure 4, the words person, name and lastname
     as a list or a table, structured using the