=Paper=
{{Paper
|id=None
|storemode=property
|title=Towards
Visual Overviews for Open Government Data
|pdfUrl=https://ceur-ws.org/Vol-1210/datawiz2014_04.pdf
|volume=Vol-1210
|dblpUrl=https://dblp.org/rec/conf/ht/GravesB14
}}
==Towards
Visual Overviews for Open Government Data==
Towards Visual Overviews for Open Government Data Alvaro Graves Javier Bustos-Jiménez Inria Chile NIC Chile Research Labs Av. Apoquindo 2827, piso 12 Blanco Encalada 1975 Santiago, Chile Santiago, Chile alvaro.graves@inria.cl jbustos@niclabs.cl ABSTRACT The rise of Open Data initiatives has led to the publication of many datasets from different organizations and govern- ments. These datasets cover a wide range of knowledge do- mains, from budget to education to health care. However, not all datasets have the quality, granularity or type of in- formation that is relevant to each user. Moreover, in many cases the description or metadata does not specify clearly the content of a dataset, difficulting the exploration of datasets by stakeholders. In this paper we propose the use of dash- boards and visualizations as a way to preview the content of datasets for easier exploration. The use of visualizations Figure 1: Restults provided by searching for child can provide a rapid way to select or discard datasets based obesity in Data.gov (upper image) and Data.gov.uk on their content, reducing the potential datasets that a user (lower image) may need to look in order to get what she needs. Categories and Subject Descriptors Data.gov, only one dataset is available (in several formats). This dataset is described as “federal”, however a closer look D.2.2 [Software Engineering]: Design Tools and Tech- shows that the data is related to the state of New York niques—User interfaces; H.5.1 [Information Interfaces only. In the case of Data.gov.uk, 16 results provide informa- and Presentation]: Multimedia Information Systems— tion related to child and obesity in PDF and Excel formats. Methodology; I.7.4 [Document and Text Processing]: Beyond these difference, it is not clear for a researcher or Electronic Publishing developer is these datasets are relevant to her needs; hav- ing a title and a description is useful, but does not clarify General Terms exactly what type of information, granularity and quality of Documentation,Human Factors,Open Government Data the data is available. Keywords For example, it is not clear what specific data is contained Open Government Data, Open Data, Preview, Data Visu- in a dataset, what structure is used or the scope of this alization dataset. As mentioned early,i n the case of the US dataset about child obesity, it is labeled as “federal”, however the 1. INTRODUCTION data describes only information about New York State; it Over one million datasets [16] are currently available in dif- is likely that other manually curated tags and descriptions ferent portals across the globe. Although the data is pub- may not be precise in terms of the content or scope of the licly available, their organization and structure is not clear datasets published. Thus, the question in this and many for all the stakeholders necessarily. For example, at the time other cases is how can stakeholders know in advance what’s of this writing the search for “child obesity” in Data.gov and in a dataset before downloading it? We propose the use Data.gov.uk (the two largest Open Government Data por- of dashboards and visualizations to describe and preview tals) gives different results, as can be see in Figure 1: In the content of datasets; this visual representations will help stakeholder to decided whether a dataset is useful for them or not. This paper is structured as follows: Section 2 describes re- lated work found in the literature and state of the art tech- nology. Section 3 discusses different pieces of information that can be used to create visual overviews from some of the more common file formats used to publish Open Govern- ment Data. In Section 4 we show a prototype developed as an example of what can be done to create visual overviews of datasets using the information discussed previously. Sec- tion 5 presents the future challenges on our research and we discuss our conclusions in Section 6. 2. RELATED WORK The problem of good data visualizations has been studied many years [24]. In terms of data exploration and visual- ization, Schneiderman [22] summarizes the Visual Informa- tion Seeking Mantra as Overview first, zoom and filter, then details-on-demand; humans need to get the “big picture” of a dataset first in order to decide where to explore next. Thus, a visual overview of a dataset can be be useful for researchers and journalists to know “what’s in there” before taking fur- ther action. One of the seminal works in dataset preview was made by Figure 2: Visual exploration interface proposed Doan et al. in 1999. They studied the effects of visual pre- by Dörk et al.[7], which includes data collection views of queries for NASA’s EODIS datasets [19], concluding choosers, visualization widgets, text query box and that the main advantages of these visual strategies were: the current set of results. • “eliminate zero-hit queries, • reduces network activity and browsing effor by prevent- ing the retrieval of undesired datasets, • represents statistical information of database visually to aid comprehension and axplorarion, • support synamic queries, which aids users to discover dataset patterns and exceptions, and • (they are) suitable to novice, intermittent, or expert users”. A generalization of query previews is presented in the work of Tanin et al. [23], complementing the work of Doan et al. Figure 3: Number of datasets available in data.gov with barcharts in order to show data distribution. and data.gov.uk by format. In the beginning of this century, similar conclusions were reached by Green et al. in their study about how previews 3. CONTENT FOR VISUAL OVERVIEWS and overviews allow users to rapidly discriminate useful in- Different formats provide different support for data, meta- formation from those not for interest [10], applying their data, annotations and other extra information that can be findings in the interfaces provided by the Digital Library of helpful for users to identify datasets that area valuable for Congress and concluding that “previews should be available them. In order to understand what format are more of- at a high level within a site so users get a taste of what is to ten used to publish Open Government Data, we looked at come early in their visit”. Data.gov and Data.gov.uk, two of the largest government data portals. We took the most popular formats reported Nowadays, the principles behind above works seems to be by these portals and we found that most datasets are pub- suitable for open data publication, as it has been reported lished in HTML, followed by XML, ZIP, CSV, PDF and to be for web searching by the work of Dörk et al.[7], where JSON, as can be seen in Figure 3. It is important to note they studied performance and benefits of a new approach that in many cases a dataset is published in multiple for- called visual exploration for information seeking on the Web mats, so these numbers are not related to the number of (Figure 2). datasets available. From the perspecting of the Open Government Data, vi- It is reasonable then to focus our efforts on the most common sualizations are valuable and useful artifacts for users [12]; formats in order to cover an important number of datasets visualizations can provide feedback and help on the deci- with our study. For this work, we do not considered ZIP files sion making process related to public policies. A survey [9] as part of the list of datasets to study, due to the fact that showed that many stakeholders found that users were inter- ZIP files are actually archives containing other files, such as ested in interacting with data via the use of visualizations. CSV. Hence, for this study a ZIP file can be considered only Hence, there is reasonable evidence to support our hypothe- as an “extra layer” of communication, and not a file format sis that preview visualizations can be a useful tools for Open that we should study. Government Data stakeholders.3.1 Data, metadata and annotations plaining to a human reader certain aspects of the data (e.g., what does a field mean or information about how the data was collected). As mentioned before, different data formats Figure 4: Example of a XML document. provide different levels of support for data and metadata; thus, extracting data, metadata and annotations from dif- ferent file formats present different challenges. content and attributes, as can be seen in Figure 4. There are several entities that can be extracted from a valid XML 3.2 HTML document to be used on a visual overview. HTML is a markup language aimed to write “scientific docu- ments, although its general design and adaptations over the years have enabled it to be used to describe a number of • Data: It is possible to check for common words, num- other types of documents” [25]. While not a data format per bers or phrases that occur in the content of XML tags. se, it has been widely used to publish data in a way that it One way to do so is by using XPath [5], a query lan- is easy to consume by humans, via a web browser. There guage aimed to extract data from XML documents. are multiple sources of data, metadata and annotations that Similar to the case of HTML, the data can be used to we can use to represent visually. inform the user about the actual content of the dataset. • Metadata: There are at least two sources of informa- tion that can be used for a visual overview. First, the • Data: Representing data in HTML can be done in words used as tags and attributes are descriptive of multiple ways, from HTML tables to full web applica- the type of content that is about a dataset. For exam- tions. In the most basic case, data can be presented ple in Figure 4, the words person, name and lastname as a list or a table, structured using theWe identify three different sources of information in a dataset that can be used to create visual overviews: data, metadata thor of the dataset), while the latter is more focused on ex-John and annotations. We understand metadata different fromDoe annotations in that the former is aimed to provide machine-English processable data about the dataset (e.g., creation date, au-
element on dataset. a table will describe the name of each column, some- thing that will tell a user if the dataset is useful for her • Annotations: XML allows comments in a similar way purposes or not. These metadata elements can be ex- as in HTML (See Figure 4). XML Schema [14] also tracted with web scraping techniques as well and used provides a series of non-mandatory mechanisms to an- to give more insight about the structure of the data as notate XML documents, by using the xsd:annotation well as more information about the provenance of it. tag. Applying NLP techniques as described above could help identify key entities related to this dataset • Annotations: HTML supports comments in the code (e.g., countries, contributors, organizations). between strings sequences. These an- notations can be used to extract information about the document and the data described in it as well. 3.4 CSV For example, it is possible to obtain the most rele- Comma-separated values is a loosely used term to define vant words in the comments and visualize them using plain text files structured as tables, using separators (usu- a word cloud. It is important to note that in the case ally a comma, but semicolon and the tab character are not of HTML, many annotations might be related to the uncommon). CSV files are popular due to its simplicity in JavaScript code used in the document; a smart heuris- terms the readability and processing of the data, done both tic could discard potentially confusing annotations of by humans and computers. In many cases, CSV are the re- this type. sult of exporting Spreadsheet files (such as Microsoft Excel) into text. In many cases it is possible to observe headers that defined the columns of a CSV file. 3.3 XML The Extensible Markup Language [4] is a language focused on structuring data for the Web, by providing a set of rules • Data: Since a CSV file is basically a table, it is pos- on how to encode such data. XML defines a tree-like struc- sible to extract the most common terms found in the ture where each node is a user-defined tag which may have cells and display them as a bar chart or a word cloud { persons: [ { name: "John", lastname: "Doe", language: { value: "English", iso: "EN" Figure 5: Example of a spreadsheet version of a } dataset. The CSV version does not respect the ta- } ble structure, due to the titles and headers that are ] exported along with the rest of the data. } or other way to present it as a visual overview of the Figure 6: A possible JSON representation of the dataset. There are tools and libraries for virtually any data shown in Figure 4 as XML. programming language to read and extract data from CSV files. 3.6 JSON • Metadata: Due to its simplicity, little metadata can The JavaScript Object Notation JSON, is an open standard be found in a CSV file. However, as mentioned before, format that has gained popularity, especially in the Web in many cases CSV files contain headers that can be development community, due to the simplicity for consump- used to identify the topics described in the dataset. tion by humans and machines alike. JSON provides a mech- anism to transmit objects that can be use to communicate • Annotations: CSV does not support annotations, different types of variables. Many see JSON as a simpler, however in many cases, the direct translation from a easier-to-use alternative to XML [6]. An example of a JSON spreadsheet, such as Microsoft Excel, carries the ti- document can be seen in Figure 6. tle and other comments available on it (see Figure 5 as an example). These annotations break the table Similar to XML, JSON provides a tree-like structure, but structure of the CSV file and makes it difficult to read supports different data types, arrays and other objects as it by programs. Still, these annotations can provide well. Thus, it is possible to extract similar information as useful information about the content of the file. An in the case of XML to later be visualized. heuristic to obtain such annotations could be the fol- lowing: Read each line of a CSV file and consider it as an annotation, until the header is found. • Data: The values in a JSON document can be used to obtain the most significant words or phrases that can be used later to create a visualization. 3.5 PDF The use of PDF files to publish data is a common prac- • Metadata: Collecting the words used as keys can give tice among practically all governments and organizations, insights on what type of data is presented in the docu- although it is widely discouraged and criticized [15][8]. One ment. Also, the tree structure could be used to identify of the main reasons is that PDF is a document format, not how the data is modeled. a data format. In this sense, PDF does not comply with • Annotations: JSON does not provide a way to an- the Open Government Data principle [11] that states that notate or comment documents. data should be in a machine-processable format. Still, many efforts like Tabula [2] have been developed to extract data from PDF files. 4. PROTOTYPE As a way to test our ideas, we developed a demo tool that creates a visual overview of a dataset. This visual overview • Data: As mentioned before, in the best of cases PDF consist on a sample of the data and a dashboard based files contain data tables that can be extracted semi- on the information extracted from a dataset. Due to sim- automatically to generate visualizations, similar to the plicity, our prototype only works with CSV files, but the case of CSV files. principles shown are the same for the other file formats described in Section 3. We implemented this demo using • Metadata: Although PDF supports metadata and JavaScript and the D3.js library [3]. The prototype is avail- embeddable raw data [13], common tools for creating able at https://github.com/niclabs/visual-overview as PDFs do no include metadata but some basic author- open source software. ship information. It is not clear what type of metadata may be available in the general case to use for an visual overview. 4.1 Rationale The prototype presents three different levels of detail of the • Annotations: Similar to the case of XML and HTML data contained in a dataset. First, we considered useful to documents, annotations in PDF can be used to identify give the user a sample of the data, so she can get an idea of relevant terms that can be later used create a visual what it looks like as a table. To do so, we included the first overview. three rows of the dataset. Figure 7: Screenshot of our prototype. A user can indicate a CSV file available and the system will render several statistics related to the values present in the data, as well as the headers available. Second, in our experience most CSV files describe data pro- best strategy. perties in terms of columns (in contrast to rows); a CSV col- umn usually contains values related to a specific dimension 4.2 Use of the prototype (e.g., age, latitude, name). Thus, one reasonable approach After a user has entered the URL of a dataset, the prototype is to create visualizations for each column. As a way to pro- will analyze the data in order to extract the more common vide a visual representation of the values on each column, terms. Although our prototype processes the data live, it we used word clouds [21]; in this way, we present the most is possible to imagine more sophisticated mechanisms that common values in each column to the user in a way that is deal with larger datasets, such as offline or batch processing. easy to consume without any technical background. As mentioned early, our prototype provides several visualiza- tions for each column of the CSV file, including data sample, Finally, in many cases it is important to provide more infor- a wordcloud and a histogram for each column. A screenshot mation about the distribution of values to answer questions, of our prototype can be seen in Figure 7. such as Is the data normally distributed? Does it follow a long tail? Are all the values equally likely?. Although the word cloud provides some insights on this respect, we think 5. FUTURE WORK a clearer representation was needed. Thus, a histogram of Our hypothesis is that these visualizations can facilitate the the values in each column is provided. This histogram facil- process of deciding if a dataset is useful for a person or not. itates the understanding of how the data is distributed and Thus, we propose to perform a user study to evaluate how what are the most/least common values. easy or hard is for a user to find valuable information in the presence/absence of visual overviews. Also, the effectiveness It is important to note that as a prototype, there are many of visual overviews may also depend on the type of visual- issues with this software. For example, a more sophisticated izations that are displayed in different scenarios. Further approach would consider the type of data (i.e., generic num- research is necessary in this regard. bers, strings, geographical coordinates, time and dates) and use different visual strategies that are more suitable for each From this prototype, we can also take several paths. We case. The variety of the available values may also affect what plan to include support for other data formats, as described visual strategy could be used; for example, for the columns in Section 3. Having a web-based service available to pre- sex and count the use of word clouds is not necessarily the view and give insights about a dataset can be a valuable tool for journalists, activists and Open Government Data researchers in general. Another option is to promote the use Intelligent Data and Web Technologies (EIDWT), of tools similar to our prototype to be part of government 2011 International Conference on. IEEE, 2011, pp. data portals by default. Most of government organizations 107–113. already provide a series of tags to help people identify and [13] King, James C., “Role of PDF and Open Data,” in understand what each dataset is about. Adding an visual Open Data on the Web, Campus London, Shoreditch, overview will help them on that effort. Finally, a smarter set 2013, 2013. of heuristics could be included in our prototype to provide [14] A. Malhotra and P. Biron, “XML schema part 2: more suitable visual representations, based on the type of Datatypes,” World Wide Web Consortium data available in each dataset. Also, the use of annotations Recommendation REC-xmlschema-2-20041028, 2004. in datasets could be used to highlight certain visualizations [15] Manning, Nathaniel. (2013) Bad metrics and PDF over others. graveyards: why development needs open data. http://www.theguardian.com/ 6. CONCLUSIONS global-development-professionals-network/2013/oct/ In this paper we have proposed the use of visualizations to 21/development-open-data-action. preview and give insights about datasets that can be use- [16] C. Peng. (2012, Aug.). int. open government data ful and valuable to many stakeholders. We showed that for search data analytics. linking open government data. most of the more common file formats used to publish Open [Online]. Available: http://logd.tw.rpi.edu/iogds data Government Data, it is possible to extract valuable informa- analytics[RetrievedNov.24,2013] tion that can be later used to create visual overviews. We [17] R. B. Penman, T. Baldwin, and D. Martinez, “Web also showed how these visual overviews can be created using scraping made simple with sitescraper,” 2009. a prototype developed by the authors that present a dash- [18] M. Pennacchiotti and P. Pantel, “Entity extraction via board of visualizations based on the information obtained ensemble semantics,” in Proceedings of the 2009 from a dataset. Finally, we discussed the different paths Conference on Empirical Methods in Natural Language this work can take in the future. Processing: Volume 1-Volume 1. Association for Computational Linguistics, 2009, pp. 238–247. 7. REFERENCES [19] C. Plaisant, B. Shneiderman, K. Doan, and T. Bruns, [1] (2011) ScraperWiki. “Interface and data architecture for query preview in ScraperWiki.2011.http://scraperwiki.com/. networked information systems,” ACM Transactions [2] “Tabula,” http://tabula.nerdpower.org/, 2013. on Information Systems (TOIS), vol. 17, no. 3, pp. [3] M. Bostock, V. Ogievetsky, and J. Heer, “D3 320–341, 1999. Data-Driven Documents,” IEEE Trans. Vis. Comput. [20] G. Salton and M. J. McGill, “Introduction to modern Graphics, vol. 17, no. 12, pp. 2301–2309, Dec. 2011. information retrieval,” 1983. [4] T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, [21] C. Seifert, B. Kump, W. Kienreich, G. Granitzer, and and F. Yergeau, “Extensible markup language (xml),” M. Granitzer, “On the beauty and usability of tag World Wide Web Consortium Recommendation clouds,” in Information Visualisation, 2008. IV’08. REC-xml-19980210. http://www. w3. 12th International Conference. IEEE, 2008, pp. org/TR/1998/REC-xml-19980210, 1998. 17–25. [5] J. Clark, S. DeRose et al., “Xml path language [22] B. Shneiderman, “The eyes have it: A task by data (xpath) version 1.0,” 1999. type taxonomy for information visualizations,” in [6] D. Crockford, “Json: The fat-free alternative to xml,” Visual Languages, 1996. Proceedings., IEEE in Proc. of XML, vol. 2006, 2006. Symposium on. IEEE, 1996, pp. 336–343. [7] M. Dörk, C. Williamson, and S. Carpendale, [23] E. Tanin, C. Plaisant, and B. Shneiderman, “Navigating tomorrow’s web: From searching and “Broadening access to large online databases by browsing to visual exploration,” ACM Transactions on generalizing query previews,” The craft of information the Web (TWEB), vol. 6, no. 3, p. 13, 2012. visualization: readings and reflections, p. 31, 2003. [8] M. Fioretti, “Open data: Emerging trends, issues and [24] E. R. Tufte and P. Graves-Morris, The visual display best practices,” Laboratory of, 2011. of quantitative information. Graphics press Cheshire, [9] A. Graves and J. Hendler, “Visualization tools for CT, 1983, vol. 2. open government data,” in Proceedings of the 14th [25] W3C, “HTML5, A vocabulary and associated APIs for Annual International Conference on Digital HTML and XHTML,” Government Research. ACM, 2013, pp. 136–145. http://www.w3.org/TR/html5/introduction.html, [10] S. Greene, G. Marchionini, C. Plaisant, and 2014. B. Shneiderman, “Previews and overviews in digital libraries: Designing surrogates to support visual information seeking,” Journal of the American Society for Information Science, vol. 51, no. 4, pp. 380–393, 2000. [11] O. G. W. Group et al., “Principles of open government data,” in Workshop held in Sebastopol, CA, USA. http://www. opengovdata. org/home/8principles, 8. [12] J. Hoxha and A. Brahaj, “Open government data on the web: A semantic approach,” in Emerging |
---|