<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">How can we figure out what is inside thousands of spreadsheets?</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Thomas</forename><surname>Levine</surname></persName>
							<email>_@thomaslevine.com</email>
						</author>
						<title level="a" type="main">How can we figure out what is inside thousands of spreadsheets?</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">5A8BCE2C339C93B12564A06E65B4AC1E</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T17:28+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>data management</term>
					<term>spreadsheets</term>
					<term>open data</term>
					<term>search</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We have enough data today that we it may not be realistic to understand all of them. In hopes of vaguely understanding these data, I have been developing methods for exploring the contents of large collections of weakly structured spreadsheets. We can get some feel for the contents of these collections by assembling metadata about many spreadsheets and run otherwise typical analyses on the data-about-data; this gives us some understanding patterns in data publishing and a crude understanding of the contents. I have also developed spreadsheet-specific search tools that try to find related spreadsheets based on similarities in implicit schema. By running crude statistics across many disparate datasets, we can learn a lot about unweildy collections of poorly structured data.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>These days, we have more data than we know what to do with. And by "data", we often mean unclean, poorly documented spreadsheets. I started wondering what was in all of these spreadsheets. Addressing my curiosity turned out to be quite difficult, so I've found up developing various approaches to understanding the contents of large collections of weakly structured spreadsheets.</p><p>My initial curiosity stemmed from the release of thousands of spreadsheets in government open data initiatives. I wanted to know what they had released so that I may find interesting things in it.</p><p>More practically, I often am looking for data from multiple sources that I can connect in relation to a particular topic. For example, in a project I had data about cash flows through the United States treasury and wanted to join them to data about the daily interest rates for United States bonds. In situations like this, I usually need to know the name of the dataset or to ask around until I find the name. I wanted a faster and more systematic approach to this.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">TYPICAL APPROACHES TO EXPLOR-ING THE CONTENTS OF SPREADSHEETS</head><p>Before we discuss my spreadsheet exploration methods, let's discuss some more ordinary methods that I see in common use today.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Look at every spreadsheet</head><p>As a baseline, one approach is to look manually at every cell in many spreadsheets. This takes a long time, but it is feasible in some situations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Use standard metaformats</head><p>Many groups develop domain-specific metaformats for expressing a very specific sort of data. For example, JSON API is a metaformat for expressing the response of a database query on the web <ref type="bibr" target="#b3">[4]</ref>, Data Packages is a metaformat for expressing metadata about a dataset <ref type="bibr" target="#b15">[17]</ref>, and KML is a metaformat for expressing annotations of geographic maps <ref type="bibr" target="#b17">[19]</ref>.</p><p>Agreement on format and metaformat makes it faster and easier to inspect individual files. On the other hand, it does not alleviate the need to acquire lots of different files and to at least glance at them. We spend less time manually inspecting each dataset, but we must still manually inspect lots of dataset.</p><p>The same sort of thing happens when data publishers provide graphs of each individual dataset. When we provide some graphs of a dataset rather than simply the standard data file, we are trying to make it easier for people to understand that particular dataset, rather than trying to focus them on a particular subset of datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Provide good metadata</head><p>Data may be easier to find if we catalog our data well and adhere to certain data quality standards. With this reasoning, many "open data" guidelines provide direction as to how a person or organization with lots of datasets might allow other people to use them <ref type="bibr" target="#b14">[16,</ref><ref type="bibr" target="#b1">1,</ref><ref type="bibr" target="#b16">18,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr">15]</ref>.</p><p>At a basic level, these guidelines suggest that data should be available on the internet and under a free license; at the other end of the spectrum, guidelines suggest that data be in standard formats accompanied with particular metadata.</p><p>Datasets can be a joy to work with when these data quality guidelines are followed, but this requires much upfront work by the publishers of the data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Asking people</head><p>In practice, I find that people learn what's in a spreadsheet through word of mouth, even if the data are already published on the internet in standard formats with good metadata.</p><p>Amanda Hickman teaches journalism and keeps a list of data sources for her students <ref type="bibr" target="#b2">[3]</ref>.</p><p>There entire conferences about the contents of newly released datasets, such as the annual meeting of the Association of Public Data Users <ref type="bibr" target="#b13">[14]</ref>.</p><p>The Open Knowledge Foundation <ref type="bibr" target="#b14">[16]</ref> and Code for America [2] even conducted data censuses to determine which governments were releasing what data publically on the internet.</p><p>In each case, volunteers searched the internet and talked to government employees in order to determine whether each dataset was available and to collect certain information about each dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">ACQUIRING LOTS OF SPREADSHEETS</head><p>In order to explore methods for examining thousands of spreadsheets, I needed to find spreadsheets that I could explore.</p><p>Many governments and other large organizations publish spreadsheets on data catalog websites. Data catalogs make it kind of easy to get a bunch of spreadsheets all together. The basic approach is this.</p><p>1. Download a list of all of the dataset identifiers that are present in the data catalog.</p><p>2. Download the metadata document about each dataset.</p><p>3. Download data files about each dataset.</p><p>I've implemented this for the following data catalog softwares.</p><p>• Socrata Open Data Portal</p><formula xml:id="formula_0">• Common Knowledge Archive Network (CKAN)</formula><p>• OpenDataSoft</p><p>This allows me to get all of the data from most of the open data catalogs I know about.</p><p>After I've downloaded spreadsheets and their metadata, I often assemble them into a spreadsheet about spreadsheets <ref type="bibr" target="#b5">[6]</ref>. In this super-spreadsheet, each record corresponds to a full sub-spreadsheet; you could say that I am collecting features or statistics about each spreadsheet.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">CRUDE STATISTICS ABOUT SPREAD-SHEETS</head><p>My first approach was involved running rather crude analyses on this interesting dataset-about-datasets that I had assembled.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">How many datasets</head><p>I started out by simply counting how many datasets each catalog website had.</p><p>The smaller sites had just a few spreadsheets, and the larger sites had thousands. These spreadsheets all had the same column names; they were "grade", "year", "demographic", "number tested", "mean scale score", "num level 1", "pct level 1", "num level 2", "pct level 2", "num level 3", "pct level 3", "num level 4", "pct level 4", "num level 3 and 4", and "pct level 3 and 4".</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Meaninglessness of the count of datasets</head><p>These "datasets" can all be thought of as subsets of the same single dataset of test scores.</p><p>If I just take different subsets of a single spreadsheet (and optionally pivot/reshape the subsets), I can easily expand one spreadsheet into over 9000. This is why the dataset count figure is near useless.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Size of the datasets</head><p>I can also look at how big they are. It turns out that most of them are pretty small.</p><p>• Only 25% of datasets had more than 100 rows.</p><p>• Only 12% of datasets had more than 1,000 rows. • Only 5% of datasets had more than 10,000 rows.</p><p>Regardless of the format of these datasets, you can think of them as spreadsheets without code, where columns are variables and rows are records.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">MEASURING HOW WELL DIFFERENT SPREADSHEETS FOLLOW DATA PUB-LISHING GUIDELINES</head><p>Having gotten some feel for the contents of these various data catalogs, I started running some less arbitrary statistics. As discussed in section 2.3, many groups have written guidelines as to how data should be published <ref type="bibr" target="#b14">[16,</ref><ref type="bibr" target="#b1">1,</ref><ref type="bibr" target="#b16">18,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr">15]</ref>. I started coming up with measures of adherence to these guidelines and running them across all of these datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">File format</head><p>File format of datasets can tell us quite a lot about the data. I looked at the MIME types of the full data files for each dataset on catalogs running Socrata software and compared them between data catalogs <ref type="bibr" target="#b10">[11]</ref>.</p><p>If datasets are represented as tables inside the Socrata software, they are available in many formats. If they are uploaded in formats not recognized by Socrata, they are only available in their original format.</p><p>I looked at a few data catalogs for which many datasets were presented in their original format. In some cases, the file formats can point out when a group of related files is added at once. For example, the results indicate that the city of San Francisco in 2012 added a bunch of shapefile format datasets to its open data catalog from another San Francicso government website. As another example, most of the datasets in the catalog of the state of Missouri are traffic surveys, saved as PDF files <ref type="bibr" target="#b4">[5]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Licensing</head><p>Many of the data publishing guidelines indicate that datasets should be freely licensed. All of the data catalog websites that I looked at include a metadata field for the license of the dataset, and I looked at the contents of that field. I found that most datasets had no license <ref type="bibr" target="#b9">[10]</ref>, and this is thought to be detrimental to their ability to be shared and reused <ref type="bibr" target="#b14">[16,</ref><ref type="bibr" target="#b1">1,</ref><ref type="bibr" target="#b16">18,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr">15</ref>].</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Liveliness of links</head><p>One common guideline is that data be available on the internet. If a dataset shows up in one of these catalogs, you might think that it is on the internet. It turns out that the links to these datasets often do not work.</p><p>I tried downloading the full data file for each dataset refenced in any of these catalogs and recorded any errors I received <ref type="bibr">[9,</ref><ref type="bibr" target="#b11">12]</ref>. I found most links to be working and noticed some common reasons why links didn't work.</p><p>• Many link URLs were in fact local file paths or links within an intranet.</p><p>• Many link "URLs" were badly formed or were not URLs at all.</p><p>• Some servers did not have SSL configured properly.</p><p>• Some servers took a very long time to respond.</p><p>I also discovered that one of the sites with very alive links, https://data.gov.uk, had a "Broken links" tool for identifying these broken links.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">SEARCHING FOR SPREADSHEETS</head><p>While assessing the adherence to various data publishing guidelines, I kept noticing that it's very hard to find spreadsheets that are relevant to a particular analysis unless you already know that the spreadsheet exists.</p><p>Major search engines focus on HTML format web pages, and spreadsheet files are often not indexed at all. The various data catalog software programs discussed in section 3 include a search feature, but this feature only works within the particular website. For example, I have to go to the Dutch government's data catalog website in order to search for Dutch data.</p><p>To summarize my thoughts about the common means of searching through spreadsheets, I see two main issues. The first issue is that the search is localized to datasets that are published or otherwise managed by a particular entity; it's hard to search for spreadsheets without first identifying a specific publisher or repository. The second issue is that the search method is quite naive; these websites are usually running crude keyword searches.</p><p>Having articulated these difficulties in searching for spreadsheets, I started trying to address them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Searching across publishers</head><p>When I'm looking for spreadsheets, the publishing organization is unlikely to be my main concern. For example, if I'm interested in data about the composition of different pesticides, but I don't really care whether the data were collected by this city government or by that country government.</p><p>To address this issue, I made a a disgustingly simple site that forwards your search query to 100 other websites and returns the results to you in a single page <ref type="bibr" target="#b6">[7]</ref>. Lots of people use it, and this says something about the inconvenience of having separate search bars for separate websites.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Spreadsheets-specific search algorithms</head><p>The other issue is that our search algorithms don't take advantage of all of the structure that is encoded in a spreadsheet. I started to address this issue by pulling schemarelated features out of the spreadsheets (section 4.2).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Spreadsheets as input to a search</head><p>Taking this further, I've been thinking about what it would mean to have a search engine for spreadsheets.</p><p>When we search for ordinary written documents, we send words into a search engine and get pages of words back.</p><p>What if we could search for spreadsheets by sending spreadsheets into a search engine and getting spreadsheets back?</p><p>The order of the results would be determined by various specialized statistics; just as we use PageRank to find relevant hypertext documents, we can develop other statistics that help us find relevant spreadsheets.  In the indexing phase, spreadsheets are examined do find all combinations of columns that act as unique indices, that is, all combinations of fields whose values are not duplicated within the spreadsheet. In the search phase, comma search finds all combinations of columns in the input spreadsheet and then looks for spreadsheets that are uniquely indexed by these columns. The results are ordered by how much overlap there is between the values of the two spreadsheets.</p><p>To say this more colloquially, comma search looks for manyto-one join relationships between disparate datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">REVIEW</head><p>I've been downloading lots of spreadsheets and doing crude, silly things with them. I started out by looking at very simple things like how big they are. I also tried to quantify other people's ideas of how good datasets are, like whether they are freely licensed. In doing this, I have noticed that it's pretty hard to search for spreadsheets; I've been developing approaches for rough detection of implicit schemas and for relating spreadsheets based on these schemas.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">APPLICATIONS</head><p>A couple of people can share a few spreadsheets without any special means, but it gets hard when there are more than a couple people sharing more than a few spreadsheets.</p><p>Statistics about adherence to data publishing guidelines can be helpful to those who are tasked with cataloging and maintaining a diverse array of datasets. Data quality statistics can provide a quick and timely summary of the issues with different datasets and allow for a more targeted approach in the maintenance of a data catalog.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: How many datasets (spreadsheets) each data catalog had</figDesc><graphic coords="3,53.80,53.80,502.12,251.06" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :Figure 3 :Figure 4 :</head><label>234</label><figDesc>Figure 2: The search engine for words takes words as input and emits words as output</figDesc><graphic coords="4,316.81,108.84,239.10,179.33" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>Many organizations report this count of datasets that they publish, and this number turns out to be nearly useless. As illustration of this, let's consider a specific group of spreadsheets. Here are the titles of a few spreadsheets in New York City's open data catalog.</figDesc><table><row><cell>• Math Test Results 2006-2012 -Citywide -Gender</cell></row><row><cell>• Math Test Results 2006-2012 -Citywide -Ethnicity</cell></row><row><cell>• English Language Arts (ELA) Test Results 2006-2012</cell></row><row><cell>-Citywide -SWD</cell></row><row><cell>• English Language Arts (ELA) Test Results 2006-2012</cell></row><row><cell>-Citywide -ELL</cell></row></table><note>• Math Test Results 2006-2012 -Citywide -SWD • English Language Arts (ELA) Test Results 2006-2012 -Citywide -All Students • Math Test Results 2006-2012 -Citywide -ELL • English Language Arts (ELA) Test Results 2006-2012 -Citywide -Gender • Math Test Results 2006-2012 -Citywide -All Students • English Language Arts (ELA) Test Results 2006-2012 -Citywide -Ethnicity</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>6.3.1 Schema-based searches I think a lot about rows and columns. When we define tables in relational databases, we can say reasonably well what each column means, based on names and types, and what a row means, based on unique indices. In spreadsheets, we still have column names, but we don't get everything else. The unique indices tell us quite a lot; they give us an idea about the observational unit of the table and what other tables we can nicely join or union with that table.</figDesc><table /><note>Commasearch<ref type="bibr" target="#b7">[8]</ref> is the present state of my spreadsheet search tools. To use comma search, you first index a lot of spreadsheets. Once you have the index, you may search by providing a single spreadsheet as input.</note></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0">New strategies for searching spreadsheets can help us find data that are relevant to a topic within the context of analysis.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title/>
		<author>
			<persName><surname>References</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Berners-Lee</surname></persName>
		</author>
		<ptr target="http://www.w3.org/DesignIssues/LinkedData.html" />
		<title level="m">Linked data</title>
				<imprint>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Where to Find Data</title>
		<author>
			<persName><forename type="first">A</forename><surname>Hickman</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Json api: A standard for building apis in json</title>
		<author>
			<persName><forename type="first">S</forename><surname>Klabnik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Katz</surname></persName>
		</author>
		<ptr target="http://jsonapi.org/" />
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">License-free data in Missouri&apos;s data portal</title>
		<author>
			<persName><forename type="first">T</forename><surname>Levine</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Levine</surname></persName>
		</author>
		<ptr target="http://thomaslevine.com/!/dataset-as-datapoint" />
		<title level="m">Open data had better be data-driven</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">T</forename><surname>Levine</surname></persName>
		</author>
		<author>
			<persName><surname>Openprism</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">T</forename><surname>Levine</surname></persName>
		</author>
		<author>
			<persName><surname>Commasearch</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Levine</surname></persName>
		</author>
		<ptr target="http://thomaslevine.com/!/data-catalog-dead-links/" />
		<title level="m">Dead links on data catalogs</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Levine</surname></persName>
		</author>
		<ptr target="http://thomaslevine.com/!/open-data-licensing/" />
		<title level="m">Open data licensing</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Levine</surname></persName>
		</author>
		<ptr target="http://thomaslevine.com/!/socrata-formats/" />
		<title level="m">What file formats are on the data portals</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Levine</surname></persName>
		</author>
		<ptr target="http://thomaslevine.com/!/zombie-links/" />
		<title level="m">Zombie links on data catalogs</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Malamud</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>O'reilly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Elin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sifry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Holovaty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">X</forename><surname>O'neil</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Migurski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Allen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tauberer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Lessig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Newman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Geraci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Bender</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Steinberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Moore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Shaw</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Needham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hardi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Zuckerman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Palmer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Taylor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Horowitz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Exley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Fogel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">L</forename><surname>Hall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hofmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Orban</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Fitzpatrick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Swartz</surname></persName>
		</author>
		<ptr target="http://www.opengovdata.org/home/8principles,2007.OpenGovernmentWorkingGroup" />
		<title level="m">8 principles of open government data</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">of Public Data Users</title>
		<author>
			<persName><forename type="first">A</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Association of Public Data Users Annual Conference</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m">Open Knowledge Foundation</title>
				<imprint>
			<publisher>Open Data Census</publisher>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Data packages</title>
		<author>
			<persName><forename type="first">R</forename><surname>Pollock</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Brett</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Keegan</surname></persName>
		</author>
		<ptr target="http://dataprotocols.org/data-packages/" />
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m">Open Data Policy Guidelines</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
		<respStmt>
			<orgName>Sunlight Foundation</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">Ogc Âő kml</title>
		<author>
			<persName><forename type="first">T</forename><surname>Wilson</surname></persName>
		</author>
		<idno>OGC 07-147r2</idno>
		<ptr target="http://portal.opengeospatial.org/files/?artifact_id=27810" />
		<imprint>
			<date type="published" when="2008">2008</date>
			<publisher>Open Geospatial Consortium Inc</publisher>
		</imprint>
	</monogr>
	<note type="report_type">Technical Report</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
