<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>How can we figure out what is inside thousands of spreadsheets?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thomas Levine _@thomaslevine.com</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>We have enough data today that we it may not be realistic to understand all of them. In hopes of vaguely understanding these data, I have been developing methods for exploring the contents of large collections of weakly structured spreadsheets. We can get some feel for the contents of these collections by assembling metadata about many spreadsheets and run otherwise typical analyses on the data-about-data; this gives us some understanding patterns in data publishing and a crude understanding of the contents. I have also developed spreadsheet-speci c search tools that try to nd related spreadsheets based on similarities in implicit schema. By running crude statistics across many disparate datasets, we can learn a lot about unweildy collections of poorly structured data.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;data management</kwd>
        <kwd>spreadsheets</kwd>
        <kwd>open data</kwd>
        <kwd>search</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>My initial curiosity stemmed from the release of thousands of
spreadsheets in government open data initiatives. I wanted
to know what they had released so that I may nd interesting
things in it.</p>
      <p>More practically, I often am looking for data from
multiple sources that I can connect in relation to a particular
topic. For example, in a project I had data about cash
ows through the United States treasury and wanted to join
them to data about the daily interest rates for United States
bonds. In situations like this, I usually need to know the
name of the dataset or to ask around until I nd the name.
I wanted a faster and more systematic approach to this.</p>
    </sec>
    <sec id="sec-2">
      <title>2. TYPICAL APPROACHES TO EXPLOR</title>
    </sec>
    <sec id="sec-3">
      <title>ING THE CONTENTS OF SPREADSHEETS</title>
      <p>Before we discuss my spreadsheet exploration methods, let's
discuss some more ordinary methods that I see in common
use today.</p>
    </sec>
    <sec id="sec-4">
      <title>2.1 Look at every spreadsheet</title>
      <p>As a baseline, one approach is to look manually at every
cell in many spreadsheets. This takes a long time, but it is
feasible in some situations.</p>
    </sec>
    <sec id="sec-5">
      <title>2.2 Use standard metaformats</title>
      <p>
        Many groups develop domain-speci c metaformats for
expressing a very speci c sort of data. For example, JSON API
is a metaformat for expressing the response of a database
query on the web [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Data Packages is a metaformat for
expressing metadata about a dataset [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], and KML is a
metaformat for expressing annotations of geographic maps
[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>Agreement on format and metaformat makes it faster and
easier to inspect individual les. On the other hand, it does
not alleviate the need to acquire lots of di erent les and
to at least glance at them. We spend less time manually
inspecting each dataset, but we must still manually inspect
lots of dataset.</p>
      <p>The same sort of thing happens when data publishers
provide graphs of each individual dataset. When we provide
some graphs of a dataset rather than simply the standard
data le, we are trying to make it easier for people to
understand that particular dataset, rather than trying to focus
them on a particular subset of datasets.</p>
    </sec>
    <sec id="sec-6">
      <title>2.3 Provide good metadata</title>
      <p>
        Data may be easier to nd if we catalog our data well and
adhere to certain data quality standards. With this
reasoning, many "open data" guidelines provide direction as to how
a person or organization with lots of datasets might allow
other people to use them [
        <xref ref-type="bibr" rid="ref1 ref13 ref15 ref16 ref18">16, 1, 18, 13, 15</xref>
        ].
      </p>
      <p>At a basic level, these guidelines suggest that data should
be available on the internet and under a free license; at the
other end of the spectrum, guidelines suggest that data be
in standard formats accompanied with particular metadata.
Datasets can be a joy to work with when these data quality
guidelines are followed, but this requires much upfront work
by the publishers of the data.</p>
    </sec>
    <sec id="sec-7">
      <title>2.4 Asking people</title>
      <p>In practice, I nd that people learn what's in a spreadsheet
through word of mouth, even if the data are already
published on the internet in standard formats with good
metadata.</p>
      <p>
        Amanda Hickman teaches journalism and keeps a list of data
sources for her students [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        There entire conferences about the contents of newly
released datasets, such as the annual meeting of the
Association of Public Data Users [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        The Open Knowledge Foundation [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and Code for America
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] even conducted data censuses to determine which
governments were releasing what data publically on the internet.
In each case, volunteers searched the internet and talked to
government employees in order to determine whether each
dataset was available and to collect certain information about
each dataset.
      </p>
    </sec>
    <sec id="sec-8">
      <title>3. ACQUIRING LOTS OF SPREADSHEETS</title>
      <p>In order to explore methods for examining thousands of
spreadsheets, I needed to nd spreadsheets that I could
explore.</p>
      <p>Many governments and other large organizations publish
spreadsheets on data catalog websites. Data catalogs make
it kind of easy to get a bunch of spreadsheets all together.
The basic approach is this.</p>
      <p>1. Download a list of all of the dataset identi ers that are
present in the data catalog.
2. Download the metadata document about each dataset.</p>
      <sec id="sec-8-1">
        <title>3. Download data les about each dataset. I've implemented this for the following data catalog softwares.</title>
      </sec>
      <sec id="sec-8-2">
        <title>Socrata Open Data Portal</title>
      </sec>
      <sec id="sec-8-3">
        <title>Common Knowledge Archive Network (CKAN)</title>
      </sec>
      <sec id="sec-8-4">
        <title>OpenDataSoft</title>
        <p>This allows me to get all of the data from most of the open
data catalogs I know about.</p>
        <p>
          After I've downloaded spreadsheets and their metadata, I
often assemble them into a spreadsheet about spreadsheets
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In this super-spreadsheet, each record corresponds to
a full sub-spreadsheet; you could say that I am collecting
features or statistics about each spreadsheet.
        </p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>4. CRUDE STATISTICS ABOUT SPREAD</title>
    </sec>
    <sec id="sec-10">
      <title>SHEETS</title>
      <p>My rst approach was involved running rather crude
analyses on this interesting dataset-about-datasets that I had
assembled.</p>
    </sec>
    <sec id="sec-11">
      <title>4.1 How many datasets</title>
      <p>I started out by simply counting how many datasets each
catalog website had.</p>
      <p>The smaller sites had just a few spreadsheets, and the larger
sites had thousands.</p>
    </sec>
    <sec id="sec-12">
      <title>4.2 Meaninglessness of the count of datasets</title>
      <p>Many organizations report this count of datasets that they
publish, and this number turns out to be nearly useless. As
illustration of this, let's consider a speci c group of
spreadsheets. Here are the titles of a few spreadsheets in New York
City's open data catalog.</p>
      <p>Math Test Results 2006-2012 - Citywide - Gender
Math Test Results 2006-2012 - Citywide - Ethnicity
English Language Arts (ELA) Test Results 2006-2012
- Citywide - SWD
English Language Arts (ELA) Test Results 2006-2012
- Citywide - ELL</p>
      <sec id="sec-12-1">
        <title>Math Test Results 2006-2012 - Citywide - SWD English Language Arts (ELA) Test Results 2006-2012 - Citywide - All Students</title>
      </sec>
      <sec id="sec-12-2">
        <title>Math Test Results 2006-2012 - Citywide - ELL</title>
        <p>English Language Arts (ELA) Test Results 2006-2012
- Citywide - Gender
Math Test Results 2006-2012 - Citywide - All Students
English Language Arts (ELA) Test Results 2006-2012
- Citywide - Ethnicity
These spreadsheets all had the same column names; they
were "grade", "year", "demographic", "number tested",
"mean scale score", "num level 1", "pct level 1", "num level 2",
"pct level 2", "num level 3", "pct level 3", "num level 4",
"pct level 4", "num level 3 and 4", and "pct level 3 and 4".
These "datasets" can all be thought of as subsets of the same
single dataset of test scores.</p>
        <p>If I just take di erent subsets of a single spreadsheet (and
optionally pivot/reshape the subsets), I can easily expand
one spreadsheet into over 9000. This is why the dataset
count gure is near useless.</p>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>4.3 Size of the datasets</title>
      <p>I can also look at how big they are. It turns out that most
of them are pretty small.</p>
      <sec id="sec-13-1">
        <title>Only 25% of datasets had more than 100 rows.</title>
      </sec>
      <sec id="sec-13-2">
        <title>Only 12% of datasets had more than 1,000 rows.</title>
      </sec>
      <sec id="sec-13-3">
        <title>Only 5% of datasets had more than 10,000 rows. Regardless of the format of these datasets, you can think of them as spreadsheets without code, where columns are variables and rows are records.</title>
      </sec>
    </sec>
    <sec id="sec-14">
      <title>5. MEASURING HOW WELL DIFFERENT</title>
    </sec>
    <sec id="sec-15">
      <title>SPREADSHEETS FOLLOW DATA PUB</title>
    </sec>
    <sec id="sec-16">
      <title>LISHING GUIDELINES</title>
      <p>
        Having gotten some feel for the contents of these various
data catalogs, I started running some less arbitrary
statistics. As discussed in section 2.3, many groups have written
guidelines as to how data should be published [
        <xref ref-type="bibr" rid="ref1 ref13 ref15 ref16 ref18">16, 1, 18, 13,
15</xref>
        ]. I started coming up with measures of adherence to these
guidelines and running them across all of these datasets.
      </p>
    </sec>
    <sec id="sec-17">
      <title>5.1 File format</title>
      <p>
        File format of datasets can tell us quite a lot about the data.
I looked at the MIME types of the full data les for each
dataset on catalogs running Socrata software and compared
them between data catalogs [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>If datasets are represented as tables inside the Socrata
software, they are available in many formats. If they are
uploaded in formats not recognized by Socrata, they are only
available in their original format.</p>
      <p>
        I looked at a few data catalogs for which many datasets
were presented in their original format. In some cases, the
le formats can point out when a group of related les is
added at once. For example, the results indicate that the
city of San Francisco in 2012 added a bunch of shape le
format datasets to its open data catalog from another San
Francicso government website. As another example, most of
the datasets in the catalog of the state of Missouri are tra c
surveys, saved as PDF les [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
    </sec>
    <sec id="sec-18">
      <title>5.2 Licensing</title>
      <p>
        Many of the data publishing guidelines indicate that datasets
should be freely licensed. All of the data catalog websites
that I looked at include a metadata eld for the license of the
dataset, and I looked at the contents of that eld. I found
that most datasets had no license [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and this is thought
to be detrimental to their ability to be shared and reused
[
        <xref ref-type="bibr" rid="ref1 ref13 ref15 ref16 ref18">16, 1, 18, 13, 15</xref>
        ].
      </p>
    </sec>
    <sec id="sec-19">
      <title>5.3 Liveliness of links</title>
      <p>One common guideline is that data be available on the
internet. If a dataset shows up in one of these catalogs, you
might think that it is on the internet. It turns out that the
links to these datasets often do not work.</p>
      <p>
        I tried downloading the full data le for each dataset refenced
in any of these catalogs and recorded any errors I received
[
        <xref ref-type="bibr" rid="ref12 ref9">9, 12</xref>
        ]. I found most links to be working and noticed some
common reasons why links didn't work.
      </p>
      <p>Many link URLs were in fact local le paths or links
within an intranet.</p>
      <p>Many link "URLs" were badly formed or were not URLs
at all.</p>
      <p>Some servers did not have SSL con gured properly.</p>
      <sec id="sec-19-1">
        <title>Some servers took a very long time to respond. I also discovered that one of the sites with very alive links, https://data.gov.uk, had a "Broken links" tool for identifying these broken links.</title>
      </sec>
    </sec>
    <sec id="sec-20">
      <title>6. SEARCHING FOR SPREADSHEETS</title>
      <p>While assessing the adherence to various data publishing
guidelines, I kept noticing that it's very hard to nd
spreadsheets that are relevant to a particular analysis unless you
already know that the spreadsheet exists.</p>
      <p>Major search engines focus on HTML format web pages,
and spreadsheet les are often not indexed at all. The
various data catalog software programs discussed in section 3
include a search feature, but this feature only works within
the particular website. For example, I have to go to the
Dutch government's data catalog website in order to search
for Dutch data.</p>
      <p>To summarize my thoughts about the common means of
searching through spreadsheets, I see two main issues. The
rst issue is that the search is localized to datasets that are
published or otherwise managed by a particular entity; it's
hard to search for spreadsheets without rst identifying a
speci c publisher or repository. The second issue is that
the search method is quite naive; these websites are usually
running crude keyword searches.</p>
      <p>Having articulated these di culties in searching for
spreadsheets, I started trying to address them.</p>
    </sec>
    <sec id="sec-21">
      <title>6.1 Searching across publishers</title>
      <p>
        When I'm looking for spreadsheets, the publishing
organization is unlikely to be my main concern. For example, if I'm
interested in data about the composition of di erent
pesticides, but I don't really care whether the data were collected
by this city government or by that country government.
To address this issue, I made a a disgustingly simple site
that forwards your search query to 100 other websites and
returns the results to you in a single page [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Lots of people
use it, and this says something about the inconvenience of
having separate search bars for separate websites.
      </p>
    </sec>
    <sec id="sec-22">
      <title>6.2 Spreadsheets-specific search algorithms</title>
      <p>The other issue is that our search algorithms don't take
advantage of all of the structure that is encoded in a
spreadsheet. I started to address this issue by pulling
schemarelated features out of the spreadsheets (section 4.2).</p>
    </sec>
    <sec id="sec-23">
      <title>6.3 Spreadsheets as input to a search</title>
      <p>Taking this further, I've been thinking about what it would
mean to have a search engine for spreadsheets.</p>
      <p>
        When we search for ordinary written documents, we send
words into a search engine and get pages of words back.
What if we could search for spreadsheets by sending
spreadsheets into a search engine and getting spreadsheets back?
The order of the results would be determined by various
specialized statistics; just as we use PageRank to nd relevant
hypertext documents, we can develop other statistics that
help us nd relevant spreadsheets.
6.3.1 Schema-based searches
I think a lot about rows and columns. When we de ne tables
in relational databases, we can say reasonably well what each
column means, based on names and types, and what a row
means, based on unique indices. In spreadsheets, we still
have column names, but we don't get everything else.
The unique indices tell us quite a lot; they give us an idea
about the observational unit of the table and what other
tables we can nicely join or union with that table.
Commasearch [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is the present state of my spreadsheet
search tools. To use comma search, you rst index a lot
of spreadsheets. Once you have the index, you may search
by providing a single spreadsheet as input.
      </p>
      <p>In the indexing phase, spreadsheets are examined do nd all
combinations of columns that act as unique indices, that is,
all combinations of elds whose values are not duplicated
within the spreadsheet. In the search phase, comma search
nds all combinations of columns in the input spreadsheet
and then looks for spreadsheets that are uniquely indexed
by these columns. The results are ordered by how much
overlap there is between the values of the two spreadsheets.
To say this more colloquially, comma search looks for
manyto-one join relationships between disparate datasets.</p>
    </sec>
    <sec id="sec-24">
      <title>7. REVIEW</title>
      <p>I've been downloading lots of spreadsheets and doing crude,
silly things with them. I started out by looking at very
simple things like how big they are. I also tried to quantify
other people's ideas of how good datasets are, like whether
they are freely licensed. In doing this, I have noticed that it's
pretty hard to search for spreadsheets; I've been developing
approaches for rough detection of implicit schemas and for
relating spreadsheets based on these schemas.</p>
    </sec>
    <sec id="sec-25">
      <title>8. APPLICATIONS</title>
      <p>A couple of people can share a few spreadsheets without any
special means, but it gets hard when there are more than a
couple people sharing more than a few spreadsheets.
Statistics about adherence to data publishing guidelines can
be helpful to those who are tasked with cataloging and
maintaining a diverse array of datasets. Data quality statistics
can provide a quick and timely summary of the issues with
di erent datasets and allow for a more targeted approach in
the maintenance of a data catalog.</p>
      <p>New strategies for searching spreadsheets can help us nd
data that are relevant to a topic within the context of
analysis.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          .
          <article-title>Linked data</article-title>
          . http://www.w3.org/DesignIssues/LinkedData.html,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Code for America. U.S. City Open Data Census</surname>
          </string-name>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hickman</surname>
          </string-name>
          . Where to Find Data,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Klabnik</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Katz</surname>
          </string-name>
          .
          <article-title>Json api: A standard for building apis in json</article-title>
          . http://jsonapi.org/.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Levine</surname>
          </string-name>
          .
          <article-title>License-free data in Missouri's data portal</article-title>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Levine</surname>
          </string-name>
          .
          <article-title>Open data had better be data-driven</article-title>
          . http://thomaslevine.com/!/dataset-as-datapoint,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Levine. OpenPrism</surname>
          </string-name>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Levine</surname>
          </string-name>
          . commasearch,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Levine</surname>
          </string-name>
          .
          <article-title>Dead links on data catalogs</article-title>
          . http: //thomaslevine.com/!/data-catalog
          <string-name>
            <surname>-</surname>
          </string-name>
          dead-links/,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Levine</surname>
          </string-name>
          .
          <article-title>Open data licensing</article-title>
          . http://thomaslevine.com/!/open-data-licensing/,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Levine</surname>
          </string-name>
          .
          <article-title>What le formats are on the data portals</article-title>
          ? http://thomaslevine.com/!/socrata-formats/,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Levine</surname>
          </string-name>
          .
          <article-title>Zombie links on data catalogs</article-title>
          . http://thomaslevine.com/!/zombie-links/,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Malamud</surname>
          </string-name>
          ,
          <string-name>
            <surname>T. O'Reilly</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Elin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sifry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Holovaty</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. X. O'Neil</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Migurski</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Allen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Tauberer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Lessig</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Newman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Geraci</surname>
            , E. Bender,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Steinberg</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Moore</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Shaw</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Needham</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hardi</surname>
            , E. Zuckerman, G. Palmer,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , B.
          <string-name>
            <surname>Horowitz</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Exley</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Fogel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dale</surname>
            ,
            <given-names>J. L.</given-names>
          </string-name>
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hofmann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Orban</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Fitzpatrick</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Swartz</surname>
          </string-name>
          .
          <article-title>8 principles of open government data</article-title>
          . http://www.opengovdata.org/home/8principles,
          <year>2007</year>
          . Open Government Working Group.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>[14] A. of Public Data Users. Association of Public Data Users Annual Conference</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Open</surname>
            <given-names>Data</given-names>
          </string-name>
          <string-name>
            <surname>Institute</surname>
          </string-name>
          . Certi cates,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Open</surname>
            <given-names>Knowledge</given-names>
          </string-name>
          <string-name>
            <surname>Foundation. Open Data Census</surname>
          </string-name>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pollock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brett</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Keegan</surname>
          </string-name>
          .
          <article-title>Data packages</article-title>
          . http://dataprotocols.org/data-packages/,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Sunlight</given-names>
            <surname>Foundation. Open Data Policy Guidelines</surname>
          </string-name>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wilson</surname>
          </string-name>
          . OgcA^ o} kml.
          <source>Technical Report OGC 07-147r2</source>
          , Open Geospatial Consortium Inc.,
          <year>2008</year>
          . http://portal.opengeospatial.org/files/ ?artifact_id=
          <fpage>27810</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>