How can we figure out what is inside thousands of spreadsheets? Thomas Levine _@thomaslevine.com ABSTRACT bonds. In situations like this, I usually need to know the We have enough data today that we it may not be realistic name of the dataset or to ask around until I find the name. to understand all of them. In hopes of vaguely understand- I wanted a faster and more systematic approach to this. ing these data, I have been developing methods for exploring the contents of large collections of weakly structured spread- 2. TYPICAL APPROACHES TO EXPLOR- sheets. We can get some feel for the contents of these col- lections by assembling metadata about many spreadsheets ING THE CONTENTS OF SPREADSHEETS and run otherwise typical analyses on the data-about-data; Before we discuss my spreadsheet exploration methods, let’s this gives us some understanding patterns in data publish- discuss some more ordinary methods that I see in common ing and a crude understanding of the contents. I have also use today. developed spreadsheet-specific search tools that try to find related spreadsheets based on similarities in implicit schema. 2.1 Look at every spreadsheet By running crude statistics across many disparate datasets, As a baseline, one approach is to look manually at every we can learn a lot about unweildy collections of poorly struc- cell in many spreadsheets. This takes a long time, but it is tured data. feasible in some situations. Keywords 2.2 Use standard metaformats data management, spreadsheets, open data, search Many groups develop domain-specific metaformats for ex- pressing a very specific sort of data. For example, JSON API is a metaformat for expressing the response of a database 1. INTRODUCTION query on the web [4], Data Packages is a metaformat for These days, we have more data than we know what to do expressing metadata about a dataset [17], and KML is a with. And by ”data”, we often mean unclean, poorly doc- metaformat for expressing annotations of geographic maps umented spreadsheets. I started wondering what was in all [19]. of these spreadsheets. Addressing my curiosity turned out to be quite difficult, so I’ve found up developing various ap- Agreement on format and metaformat makes it faster and proaches to understanding the contents of large collections easier to inspect individual files. On the other hand, it does of weakly structured spreadsheets. not alleviate the need to acquire lots of different files and to at least glance at them. We spend less time manually My initial curiosity stemmed from the release of thousands of inspecting each dataset, but we must still manually inspect spreadsheets in government open data initiatives. I wanted lots of dataset. to know what they had released so that I may find interesting things in it. The same sort of thing happens when data publishers pro- vide graphs of each individual dataset. When we provide More practically, I often am looking for data from multi- some graphs of a dataset rather than simply the standard ple sources that I can connect in relation to a particular data file, we are trying to make it easier for people to un- topic. For example, in a project I had data about cash derstand that particular dataset, rather than trying to focus flows through the United States treasury and wanted to join them on a particular subset of datasets. them to data about the daily interest rates for United States 2.3 Provide good metadata Data may be easier to find if we catalog our data well and adhere to certain data quality standards. With this reason- ing, many ”open data” guidelines provide direction as to how a person or organization with lots of datasets might allow other people to use them [16, 1, 18, 13, 15]. At a basic level, these guidelines suggest that data should be available on the internet and under a free license; at the other end of the spectrum, guidelines suggest that data be 4. CRUDE STATISTICS ABOUT SPREAD- in standard formats accompanied with particular metadata. SHEETS My first approach was involved running rather crude anal- Datasets can be a joy to work with when these data quality yses on this interesting dataset-about-datasets that I had guidelines are followed, but this requires much upfront work assembled. by the publishers of the data. 4.1 How many datasets 2.4 Asking people I started out by simply counting how many datasets each In practice, I find that people learn what’s in a spreadsheet catalog website had. through word of mouth, even if the data are already pub- lished on the internet in standard formats with good meta- The smaller sites had just a few spreadsheets, and the larger data. sites had thousands. Amanda Hickman teaches journalism and keeps a list of data sources for her students [3]. 4.2 Meaninglessness of the count of datasets Many organizations report this count of datasets that they There entire conferences about the contents of newly re- publish, and this number turns out to be nearly useless. As leased datasets, such as the annual meeting of the Associa- illustration of this, let’s consider a specific group of spread- tion of Public Data Users [14]. sheets. Here are the titles of a few spreadsheets in New York City’s open data catalog. The Open Knowledge Foundation [16] and Code for America [2] even conducted data censuses to determine which govern- ments were releasing what data publically on the internet. • Math Test Results 2006-2012 - Citywide - Gender In each case, volunteers searched the internet and talked to • Math Test Results 2006-2012 - Citywide - Ethnicity government employees in order to determine whether each • English Language Arts (ELA) Test Results 2006-2012 dataset was available and to collect certain information about - Citywide - SWD each dataset. • English Language Arts (ELA) Test Results 2006-2012 - Citywide - ELL 3. ACQUIRING LOTS OF SPREADSHEETS In order to explore methods for examining thousands of • Math Test Results 2006-2012 - Citywide - SWD spreadsheets, I needed to find spreadsheets that I could ex- • English Language Arts (ELA) Test Results 2006-2012 plore. - Citywide - All Students • Math Test Results 2006-2012 - Citywide - ELL Many governments and other large organizations publish spreadsheets on data catalog websites. Data catalogs make • English Language Arts (ELA) Test Results 2006-2012 it kind of easy to get a bunch of spreadsheets all together. - Citywide - Gender The basic approach is this. • Math Test Results 2006-2012 - Citywide - All Students • English Language Arts (ELA) Test Results 2006-2012 - Citywide - Ethnicity 1. Download a list of all of the dataset identifiers that are present in the data catalog. 2. Download the metadata document about each dataset. These spreadsheets all had the same column names; they 3. Download data files about each dataset. were ”grade”, ”year”, ”demographic”, ”number tested”, ”mean scale score”, ”num level 1”, ”pct level 1”, ”num level 2”, ”pct level 2”, ”num level 3”, ”pct level 3”, ”num level 4”, I’ve implemented this for the following data catalog soft- ”pct level 4”, ”num level 3 and 4”, and ”pct level 3 and 4”. wares. These ”datasets” can all be thought of as subsets of the same single dataset of test scores. • Socrata Open Data Portal • Common Knowledge Archive Network (CKAN) If I just take different subsets of a single spreadsheet (and optionally pivot/reshape the subsets), I can easily expand • OpenDataSoft one spreadsheet into over 9000. This is why the dataset count figure is near useless. This allows me to get all of the data from most of the open data catalogs I know about. 4.3 Size of the datasets I can also look at how big they are. It turns out that most After I’ve downloaded spreadsheets and their metadata, I of them are pretty small. often assemble them into a spreadsheet about spreadsheets [6]. In this super-spreadsheet, each record corresponds to a full sub-spreadsheet; you could say that I am collecting • Only 25% of datasets had more than 100 rows. features or statistics about each spreadsheet. • Only 12% of datasets had more than 1,000 rows. Figure 1: How many datasets (spreadsheets) each data catalog had • Only 5% of datasets had more than 10,000 rows. Francicso government website. As another example, most of the datasets in the catalog of the state of Missouri are traffic surveys, saved as PDF files [5]. Regardless of the format of these datasets, you can think of them as spreadsheets without code, where columns are 5.2 Licensing variables and rows are records. Many of the data publishing guidelines indicate that datasets should be freely licensed. All of the data catalog websites 5. MEASURING HOW WELL DIFFERENT that I looked at include a metadata field for the license of the SPREADSHEETS FOLLOW DATA PUB- dataset, and I looked at the contents of that field. I found that most datasets had no license [10], and this is thought LISHING GUIDELINES to be detrimental to their ability to be shared and reused Having gotten some feel for the contents of these various [16, 1, 18, 13, 15]. data catalogs, I started running some less arbitrary statis- tics. As discussed in section 2.3, many groups have written guidelines as to how data should be published [16, 1, 18, 13, 5.3 Liveliness of links 15]. I started coming up with measures of adherence to these One common guideline is that data be available on the in- guidelines and running them across all of these datasets. ternet. If a dataset shows up in one of these catalogs, you might think that it is on the internet. It turns out that the links to these datasets often do not work. 5.1 File format File format of datasets can tell us quite a lot about the data. I tried downloading the full data file for each dataset refenced I looked at the MIME types of the full data files for each in any of these catalogs and recorded any errors I received dataset on catalogs running Socrata software and compared [9, 12]. I found most links to be working and noticed some them between data catalogs [11]. common reasons why links didn’t work. If datasets are represented as tables inside the Socrata soft- ware, they are available in many formats. If they are up- • Many link URLs were in fact local file paths or links loaded in formats not recognized by Socrata, they are only within an intranet. available in their original format. • Many link ”URLs” were badly formed or were not URLs at all. I looked at a few data catalogs for which many datasets were presented in their original format. In some cases, the • Some servers did not have SSL configured properly. file formats can point out when a group of related files is • Some servers took a very long time to respond. added at once. For example, the results indicate that the city of San Francisco in 2012 added a bunch of shapefile format datasets to its open data catalog from another San I also discovered that one of the sites with very alive links, https://data.gov.uk, had a ”Broken links” tool for identi- fying these broken links. 6. SEARCHING FOR SPREADSHEETS While assessing the adherence to various data publishing guidelines, I kept noticing that it’s very hard to find spread- sheets that are relevant to a particular analysis unless you already know that the spreadsheet exists. Major search engines focus on HTML format web pages, and spreadsheet files are often not indexed at all. The var- ious data catalog software programs discussed in section 3 include a search feature, but this feature only works within the particular website. For example, I have to go to the Dutch government’s data catalog website in order to search for Dutch data. To summarize my thoughts about the common means of searching through spreadsheets, I see two main issues. The first issue is that the search is localized to datasets that are published or otherwise managed by a particular entity; it’s hard to search for spreadsheets without first identifying a specific publisher or repository. The second issue is that the search method is quite naive; these websites are usually Figure 2: The search engine for words takes words running crude keyword searches. as input and emits words as output Having articulated these difficulties in searching for spread- sheets, I started trying to address them. 6.1 Searching across publishers When I’m looking for spreadsheets, the publishing organiza- tion is unlikely to be my main concern. For example, if I’m interested in data about the composition of different pesti- cides, but I don’t really care whether the data were collected by this city government or by that country government. To address this issue, I made a a disgustingly simple site that forwards your search query to 100 other websites and returns the results to you in a single page [7]. Lots of people use it, and this says something about the inconvenience of having separate search bars for separate websites. 6.2 Spreadsheets-specific search algorithms The other issue is that our search algorithms don’t take ad- vantage of all of the structure that is encoded in a spread- sheet. I started to address this issue by pulling schema- related features out of the spreadsheets (section 4.2). 6.3 Spreadsheets as input to a search Taking this further, I’ve been thinking about what it would mean to have a search engine for spreadsheets. When we search for ordinary written documents, we send words into a search engine and get pages of words back. Figure 3: The search engine for spreadsheets takes spreadsheets as input and emits spreadsheets as out- What if we could search for spreadsheets by sending spread- put sheets into a search engine and getting spreadsheets back? The order of the results would be determined by various spe- cialized statistics; just as we use PageRank to find relevant hypertext documents, we can develop other statistics that help us find relevant spreadsheets. be helpful to those who are tasked with cataloging and main- taining a diverse array of datasets. Data quality statistics can provide a quick and timely summary of the issues with different datasets and allow for a more targeted approach in the maintenance of a data catalog. New strategies for searching spreadsheets can help us find data that are relevant to a topic within the context of anal- ysis. 9. REFERENCES [1] T. Berners-Lee. Linked data. http://www.w3.org/DesignIssues/LinkedData.html, 2006. Figure 4: Commasearch infers some schema infor- [2] Code for America. U.S. City Open Data Census, 2014. mation about each spreadsheet and looks for other spreadsheets with similar schemas. [3] A. Hickman. Where to Find Data, 2014. [4] S. Klabnik and Y. Katz. Json api: A standard for building apis in json. http://jsonapi.org/. 6.3.1 Schema-based searches [5] T. Levine. License-free data in Missouri’s data portal, I think a lot about rows and columns. When we define tables 2013. in relational databases, we can say reasonably well what each [6] T. Levine. Open data had better be data-driven. column means, based on names and types, and what a row http://thomaslevine.com/!/dataset-as-datapoint, means, based on unique indices. In spreadsheets, we still 2013. have column names, but we don’t get everything else. [7] T. Levine. OpenPrism, 2013. [8] T. Levine. commasearch, 2014. The unique indices tell us quite a lot; they give us an idea [9] T. Levine. Dead links on data catalogs. http: about the observational unit of the table and what other //thomaslevine.com/!/data-catalog-dead-links/, tables we can nicely join or union with that table. 2014. [10] T. Levine. Open data licensing. Commasearch [8] is the present state of my spreadsheet http://thomaslevine.com/!/open-data-licensing/, search tools. To use comma search, you first index a lot 2014. of spreadsheets. Once you have the index, you may search [11] T. Levine. What file formats are on the data portals? by providing a single spreadsheet as input. http://thomaslevine.com/!/socrata-formats/, 2014. In the indexing phase, spreadsheets are examined do find all [12] T. Levine. Zombie links on data catalogs. combinations of columns that act as unique indices, that is, http://thomaslevine.com/!/zombie-links/, 2014. all combinations of fields whose values are not duplicated within the spreadsheet. In the search phase, comma search [13] C. Malamud, T. O’Reilly, G. Elin, M. Sifry, finds all combinations of columns in the input spreadsheet A. Holovaty, D. X. O’Neil, M. Migurski, S. Allen, and then looks for spreadsheets that are uniquely indexed J. Tauberer, L. Lessig, D. Newman, J. Geraci, by these columns. The results are ordered by how much E. Bender, T. Steinberg, D. Moore, D. Shaw, overlap there is between the values of the two spreadsheets. J. Needham, J. Hardi, E. Zuckerman, G. Palmer, J. Taylor, B. Horowitz, Z. Exley, K. Fogel, M. Dale, To say this more colloquially, comma search looks for many- J. L. Hall, M. Hofmann, D. Orban, W. Fitzpatrick, to-one join relationships between disparate datasets. and A. Swartz. 8 principles of open government data. http://www.opengovdata.org/home/8principles, 2007. Open Government Working Group. 7. REVIEW [14] A. of Public Data Users. Association of Public Data I’ve been downloading lots of spreadsheets and doing crude, Users Annual Conference, 2013. silly things with them. I started out by looking at very [15] Open Data Institute. Certificates, 2013. simple things like how big they are. I also tried to quantify other people’s ideas of how good datasets are, like whether [16] Open Knowledge Foundation. Open Data Census, they are freely licensed. In doing this, I have noticed that it’s 2013. pretty hard to search for spreadsheets; I’ve been developing [17] R. Pollock, M. Brett, and M. Keegan. Data packages. approaches for rough detection of implicit schemas and for http://dataprotocols.org/data-packages/, 2013. relating spreadsheets based on these schemas. [18] Sunlight Foundation. Open Data Policy Guidelines, 2014. 8. APPLICATIONS [19] T. Wilson. OgcÂő kml. Technical Report OGC 07-147r2, Open Geospatial Consortium Inc., 2008. A couple of people can share a few spreadsheets without any http://portal.opengeospatial.org/files/ special means, but it gets hard when there are more than a ?artifact_id=27810. couple people sharing more than a few spreadsheets. Statistics about adherence to data publishing guidelines can