How can we figure out what is inside thousands of
                         spreadsheets?

                                                    Thomas Levine
                                                  _@thomaslevine.com


ABSTRACT                                                         bonds. In situations like this, I usually need to know the
We have enough data today that we it may not be realistic        name of the dataset or to ask around until I find the name.
to understand all of them. In hopes of vaguely understand-       I wanted a faster and more systematic approach to this.
ing these data, I have been developing methods for exploring
the contents of large collections of weakly structured spread-   2.    TYPICAL APPROACHES TO EXPLOR-
sheets. We can get some feel for the contents of these col-
lections by assembling metadata about many spreadsheets                ING THE CONTENTS OF SPREADSHEETS
and run otherwise typical analyses on the data-about-data;       Before we discuss my spreadsheet exploration methods, let’s
this gives us some understanding patterns in data publish-       discuss some more ordinary methods that I see in common
ing and a crude understanding of the contents. I have also       use today.
developed spreadsheet-specific search tools that try to find
related spreadsheets based on similarities in implicit schema.   2.1   Look at every spreadsheet
By running crude statistics across many disparate datasets,      As a baseline, one approach is to look manually at every
we can learn a lot about unweildy collections of poorly struc-   cell in many spreadsheets. This takes a long time, but it is
tured data.                                                      feasible in some situations.

Keywords                                                         2.2   Use standard metaformats
data management, spreadsheets, open data, search                 Many groups develop domain-specific metaformats for ex-
                                                                 pressing a very specific sort of data. For example, JSON API
                                                                 is a metaformat for expressing the response of a database
1.   INTRODUCTION                                                query on the web [4], Data Packages is a metaformat for
These days, we have more data than we know what to do            expressing metadata about a dataset [17], and KML is a
with. And by ”data”, we often mean unclean, poorly doc-          metaformat for expressing annotations of geographic maps
umented spreadsheets. I started wondering what was in all        [19].
of these spreadsheets. Addressing my curiosity turned out
to be quite difficult, so I’ve found up developing various ap-   Agreement on format and metaformat makes it faster and
proaches to understanding the contents of large collections      easier to inspect individual files. On the other hand, it does
of weakly structured spreadsheets.                               not alleviate the need to acquire lots of different files and
                                                                 to at least glance at them. We spend less time manually
My initial curiosity stemmed from the release of thousands of    inspecting each dataset, but we must still manually inspect
spreadsheets in government open data initiatives. I wanted       lots of dataset.
to know what they had released so that I may find interesting
things in it.                                                    The same sort of thing happens when data publishers pro-
                                                                 vide graphs of each individual dataset. When we provide
More practically, I often am looking for data from multi-        some graphs of a dataset rather than simply the standard
ple sources that I can connect in relation to a particular       data file, we are trying to make it easier for people to un-
topic. For example, in a project I had data about cash           derstand that particular dataset, rather than trying to focus
flows through the United States treasury and wanted to join      them on a particular subset of datasets.
them to data about the daily interest rates for United States
                                                                 2.3   Provide good metadata
                                                                 Data may be easier to find if we catalog our data well and
                                                                 adhere to certain data quality standards. With this reason-
                                                                 ing, many ”open data” guidelines provide direction as to how
                                                                 a person or organization with lots of datasets might allow
                                                                 other people to use them [16, 1, 18, 13, 15].

                                                                 At a basic level, these guidelines suggest that data should
                                                                 be available on the internet and under a free license; at the
other end of the spectrum, guidelines suggest that data be           4.    CRUDE STATISTICS ABOUT SPREAD-
in standard formats accompanied with particular metadata.                  SHEETS
                                                                     My first approach was involved running rather crude anal-
Datasets can be a joy to work with when these data quality           yses on this interesting dataset-about-datasets that I had
guidelines are followed, but this requires much upfront work         assembled.
by the publishers of the data.
                                                                     4.1     How many datasets
2.4      Asking people                                               I started out by simply counting how many datasets each
In practice, I find that people learn what’s in a spreadsheet        catalog website had.
through word of mouth, even if the data are already pub-
lished on the internet in standard formats with good meta-           The smaller sites had just a few spreadsheets, and the larger
data.                                                                sites had thousands.
Amanda Hickman teaches journalism and keeps a list of data
sources for her students [3].                                        4.2     Meaninglessness of the count of datasets
                                                                     Many organizations report this count of datasets that they
There entire conferences about the contents of newly re-             publish, and this number turns out to be nearly useless. As
leased datasets, such as the annual meeting of the Associa-          illustration of this, let’s consider a specific group of spread-
tion of Public Data Users [14].                                      sheets. Here are the titles of a few spreadsheets in New York
                                                                     City’s open data catalog.
The Open Knowledge Foundation [16] and Code for America
[2] even conducted data censuses to determine which govern-
ments were releasing what data publically on the internet.                • Math Test Results 2006-2012 - Citywide - Gender
In each case, volunteers searched the internet and talked to              • Math Test Results 2006-2012 - Citywide - Ethnicity
government employees in order to determine whether each                   • English Language Arts (ELA) Test Results 2006-2012
dataset was available and to collect certain information about              - Citywide - SWD
each dataset.
                                                                          • English Language Arts (ELA) Test Results 2006-2012
                                                                            - Citywide - ELL
3.     ACQUIRING LOTS OF SPREADSHEETS
In order to explore methods for examining thousands of                    • Math Test Results 2006-2012 - Citywide - SWD
spreadsheets, I needed to find spreadsheets that I could ex-              • English Language Arts (ELA) Test Results 2006-2012
plore.                                                                      - Citywide - All Students
                                                                          • Math Test Results 2006-2012 - Citywide - ELL
Many governments and other large organizations publish
spreadsheets on data catalog websites. Data catalogs make                 • English Language Arts (ELA) Test Results 2006-2012
it kind of easy to get a bunch of spreadsheets all together.                - Citywide - Gender
The basic approach is this.                                               • Math Test Results 2006-2012 - Citywide - All Students
                                                                          • English Language Arts (ELA) Test Results 2006-2012
                                                                            - Citywide - Ethnicity
     1. Download a list of all of the dataset identifiers that are
        present in the data catalog.
     2. Download the metadata document about each dataset.           These spreadsheets all had the same column names; they
     3. Download data files about each dataset.                      were ”grade”, ”year”, ”demographic”, ”number tested”,
                                                                     ”mean scale score”, ”num level 1”, ”pct level 1”, ”num level 2”,
                                                                     ”pct level 2”, ”num level 3”, ”pct level 3”, ”num level 4”,
I’ve implemented this for the following data catalog soft-           ”pct level 4”, ”num level 3 and 4”, and ”pct level 3 and 4”.
wares.
                                                                     These ”datasets” can all be thought of as subsets of the same
                                                                     single dataset of test scores.
     • Socrata Open Data Portal
     • Common Knowledge Archive Network (CKAN)                       If I just take different subsets of a single spreadsheet (and
                                                                     optionally pivot/reshape the subsets), I can easily expand
     • OpenDataSoft                                                  one spreadsheet into over 9000. This is why the dataset
                                                                     count figure is near useless.
This allows me to get all of the data from most of the open
data catalogs I know about.                                          4.3     Size of the datasets
                                                                     I can also look at how big they are. It turns out that most
After I’ve downloaded spreadsheets and their metadata, I             of them are pretty small.
often assemble them into a spreadsheet about spreadsheets
[6]. In this super-spreadsheet, each record corresponds to
a full sub-spreadsheet; you could say that I am collecting                • Only 25% of datasets had more than 100 rows.
features or statistics about each spreadsheet.                            • Only 12% of datasets had more than 1,000 rows.
                         Figure 1: How many datasets (spreadsheets) each data catalog had


     • Only 5% of datasets had more than 10,000 rows.             Francicso government website. As another example, most of
                                                                  the datasets in the catalog of the state of Missouri are traffic
                                                                  surveys, saved as PDF files [5].
Regardless of the format of these datasets, you can think
of them as spreadsheets without code, where columns are           5.2    Licensing
variables and rows are records.
                                                                  Many of the data publishing guidelines indicate that datasets
                                                                  should be freely licensed. All of the data catalog websites
5.    MEASURING HOW WELL DIFFERENT                                that I looked at include a metadata field for the license of the
      SPREADSHEETS FOLLOW DATA PUB-                               dataset, and I looked at the contents of that field. I found
                                                                  that most datasets had no license [10], and this is thought
      LISHING GUIDELINES                                          to be detrimental to their ability to be shared and reused
Having gotten some feel for the contents of these various         [16, 1, 18, 13, 15].
data catalogs, I started running some less arbitrary statis-
tics. As discussed in section 2.3, many groups have written
guidelines as to how data should be published [16, 1, 18, 13,     5.3    Liveliness of links
15]. I started coming up with measures of adherence to these      One common guideline is that data be available on the in-
guidelines and running them across all of these datasets.         ternet. If a dataset shows up in one of these catalogs, you
                                                                  might think that it is on the internet. It turns out that the
                                                                  links to these datasets often do not work.
5.1     File format
File format of datasets can tell us quite a lot about the data.   I tried downloading the full data file for each dataset refenced
I looked at the MIME types of the full data files for each        in any of these catalogs and recorded any errors I received
dataset on catalogs running Socrata software and compared         [9, 12]. I found most links to be working and noticed some
them between data catalogs [11].                                  common reasons why links didn’t work.
If datasets are represented as tables inside the Socrata soft-
ware, they are available in many formats. If they are up-            • Many link URLs were in fact local file paths or links
loaded in formats not recognized by Socrata, they are only             within an intranet.
available in their original format.
                                                                     • Many link ”URLs” were badly formed or were not URLs
                                                                       at all.
I looked at a few data catalogs for which many datasets
were presented in their original format. In some cases, the          • Some servers did not have SSL configured properly.
file formats can point out when a group of related files is          • Some servers took a very long time to respond.
added at once. For example, the results indicate that the
city of San Francisco in 2012 added a bunch of shapefile
format datasets to its open data catalog from another San         I also discovered that one of the sites with very alive links,
https://data.gov.uk, had a ”Broken links” tool for identi-
fying these broken links.

6.    SEARCHING FOR SPREADSHEETS
While assessing the adherence to various data publishing
guidelines, I kept noticing that it’s very hard to find spread-
sheets that are relevant to a particular analysis unless you
already know that the spreadsheet exists.

Major search engines focus on HTML format web pages,
and spreadsheet files are often not indexed at all. The var-
ious data catalog software programs discussed in section 3
include a search feature, but this feature only works within
the particular website. For example, I have to go to the
Dutch government’s data catalog website in order to search
for Dutch data.

To summarize my thoughts about the common means of
searching through spreadsheets, I see two main issues. The
first issue is that the search is localized to datasets that are
published or otherwise managed by a particular entity; it’s
hard to search for spreadsheets without first identifying a
specific publisher or repository. The second issue is that
the search method is quite naive; these websites are usually       Figure 2: The search engine for words takes words
running crude keyword searches.                                    as input and emits words as output

Having articulated these difficulties in searching for spread-
sheets, I started trying to address them.

6.1    Searching across publishers
When I’m looking for spreadsheets, the publishing organiza-
tion is unlikely to be my main concern. For example, if I’m
interested in data about the composition of different pesti-
cides, but I don’t really care whether the data were collected
by this city government or by that country government.

To address this issue, I made a a disgustingly simple site
that forwards your search query to 100 other websites and
returns the results to you in a single page [7]. Lots of people
use it, and this says something about the inconvenience of
having separate search bars for separate websites.

6.2    Spreadsheets-specific search algorithms
The other issue is that our search algorithms don’t take ad-
vantage of all of the structure that is encoded in a spread-
sheet. I started to address this issue by pulling schema-
related features out of the spreadsheets (section 4.2).

6.3    Spreadsheets as input to a search
Taking this further, I’ve been thinking about what it would
mean to have a search engine for spreadsheets.

When we search for ordinary written documents, we send
words into a search engine and get pages of words back.            Figure 3: The search engine for spreadsheets takes
                                                                   spreadsheets as input and emits spreadsheets as out-
What if we could search for spreadsheets by sending spread-        put
sheets into a search engine and getting spreadsheets back?
The order of the results would be determined by various spe-
cialized statistics; just as we use PageRank to find relevant
hypertext documents, we can develop other statistics that
help us find relevant spreadsheets.
                                                                    be helpful to those who are tasked with cataloging and main-
                                                                    taining a diverse array of datasets. Data quality statistics
                                                                    can provide a quick and timely summary of the issues with
                                                                    different datasets and allow for a more targeted approach in
                                                                    the maintenance of a data catalog.

                                                                    New strategies for searching spreadsheets can help us find
                                                                    data that are relevant to a topic within the context of anal-
                                                                    ysis.

                                                                    9.   REFERENCES
                                                                     [1] T. Berners-Lee. Linked data.
                                                                         http://www.w3.org/DesignIssues/LinkedData.html,
                                                                         2006.
Figure 4: Commasearch infers some schema infor-
                                                                     [2] Code for America. U.S. City Open Data Census, 2014.
mation about each spreadsheet and looks for other
spreadsheets with similar schemas.                                   [3] A. Hickman. Where to Find Data, 2014.
                                                                     [4] S. Klabnik and Y. Katz. Json api: A standard for
                                                                         building apis in json. http://jsonapi.org/.
6.3.1    Schema-based searches                                       [5] T. Levine. License-free data in Missouri’s data portal,
I think a lot about rows and columns. When we define tables              2013.
in relational databases, we can say reasonably well what each        [6] T. Levine. Open data had better be data-driven.
column means, based on names and types, and what a row                   http://thomaslevine.com/!/dataset-as-datapoint,
means, based on unique indices. In spreadsheets, we still                2013.
have column names, but we don’t get everything else.                 [7] T. Levine. OpenPrism, 2013.
                                                                     [8] T. Levine. commasearch, 2014.
The unique indices tell us quite a lot; they give us an idea         [9] T. Levine. Dead links on data catalogs. http:
about the observational unit of the table and what other                 //thomaslevine.com/!/data-catalog-dead-links/,
tables we can nicely join or union with that table.                      2014.
                                                                    [10] T. Levine. Open data licensing.
Commasearch [8] is the present state of my spreadsheet                   http://thomaslevine.com/!/open-data-licensing/,
search tools. To use comma search, you first index a lot                 2014.
of spreadsheets. Once you have the index, you may search
                                                                    [11] T. Levine. What file formats are on the data portals?
by providing a single spreadsheet as input.
                                                                         http://thomaslevine.com/!/socrata-formats/,
                                                                         2014.
In the indexing phase, spreadsheets are examined do find all
                                                                    [12] T. Levine. Zombie links on data catalogs.
combinations of columns that act as unique indices, that is,
                                                                         http://thomaslevine.com/!/zombie-links/, 2014.
all combinations of fields whose values are not duplicated
within the spreadsheet. In the search phase, comma search           [13] C. Malamud, T. O’Reilly, G. Elin, M. Sifry,
finds all combinations of columns in the input spreadsheet               A. Holovaty, D. X. O’Neil, M. Migurski, S. Allen,
and then looks for spreadsheets that are uniquely indexed                J. Tauberer, L. Lessig, D. Newman, J. Geraci,
by these columns. The results are ordered by how much                    E. Bender, T. Steinberg, D. Moore, D. Shaw,
overlap there is between the values of the two spreadsheets.             J. Needham, J. Hardi, E. Zuckerman, G. Palmer,
                                                                         J. Taylor, B. Horowitz, Z. Exley, K. Fogel, M. Dale,
To say this more colloquially, comma search looks for many-              J. L. Hall, M. Hofmann, D. Orban, W. Fitzpatrick,
to-one join relationships between disparate datasets.                    and A. Swartz. 8 principles of open government data.
                                                                         http://www.opengovdata.org/home/8principles,
                                                                         2007. Open Government Working Group.
7.   REVIEW                                                         [14] A. of Public Data Users. Association of Public Data
I’ve been downloading lots of spreadsheets and doing crude,              Users Annual Conference, 2013.
silly things with them. I started out by looking at very
                                                                    [15] Open Data Institute. Certificates, 2013.
simple things like how big they are. I also tried to quantify
other people’s ideas of how good datasets are, like whether         [16] Open Knowledge Foundation. Open Data Census,
they are freely licensed. In doing this, I have noticed that it’s        2013.
pretty hard to search for spreadsheets; I’ve been developing        [17] R. Pollock, M. Brett, and M. Keegan. Data packages.
approaches for rough detection of implicit schemas and for               http://dataprotocols.org/data-packages/, 2013.
relating spreadsheets based on these schemas.                       [18] Sunlight Foundation. Open Data Policy Guidelines,
                                                                         2014.
8.   APPLICATIONS                                                   [19] T. Wilson. OgcÂő kml. Technical Report OGC
                                                                         07-147r2, Open Geospatial Consortium Inc., 2008.
A couple of people can share a few spreadsheets without any
                                                                         http://portal.opengeospatial.org/files/
special means, but it gets hard when there are more than a
                                                                         ?artifact_id=27810.
couple people sharing more than a few spreadsheets.

Statistics about adherence to data publishing guidelines can