<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>data.world: A Platform for Global-Scale Semantic Publishing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>data.world</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>North Mopac Expressway</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Austin</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>data.world (https://data.world/) is a collaborative web platform with a user base consisting primarily of users who are not Semantic Web experts, and datasets that are not initially semantically annotated or linked. By using web standards for the automated translation of those tabular data formats into RDF, data.world leverages the iterative data work done by the users of the platform to build a connected network of linked datasets. data.world is an open platform where anyone can sign up for a free account to work with open data - it was launched in July of 2016, and as of a year later is in active use by a community of tens of thousands of users and organizations. An oft-cited fact is that finding, understanding, and preparing data for use can take eighty percent of the time spent on an analysis project. These projects usually involve multiple data sources in a variety of formats. Semantic Web offers a powerful set of tools (universal structure for data, federated query) to deal with this diversity. Metadata can be iteratively layered into datasets, by different actors at different times. data.world focuses on collaborations where each actor is empowered to participate in the iterative development of the data resource, by helping to clean, annotate, and contextualize the data. Data can be worked on in the open, or in access-controlled datasets and projects. Structured data is converted into RDF and can be queried via SPARQL, but the original data is retrievable as well, so that users can continue working in familiar modes while enjoying the benefits of semantic web technology.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Introduction</p>
    </sec>
    <sec id="sec-2">
      <title>Managing diverse data and user base</title>
      <p>The majority of structured data in the world is tabular in nature. CSV and other text
variants, spreadsheets, relational database tables, and many “long tail” data formats
are all representations of tabular data. CSVW provides a model for modeling tables
within an RDF graph structure. CSVW tables use RDF schema and types, can be
mixed and queried together with graph structures defined directly in RDF, and can be
serialized for transmission or storage in any textual or binary form that RDF can take.</p>
      <p>Across projects and domains, and often within a single project, there are a diverse
set of actors. There are knowledge engineers who create ontologies and knowledge
bases; data scientists and statisticians who produce models and visualizations;
analysts and scientists who use spreadsheets or visual analytics platforms; and
endusers and stakeholders consuming the conclusions of the work. data.world
emphasizes collaboration between these personas.</p>
    </sec>
    <sec id="sec-3">
      <title>Architecture for Scalable, Heterogeneous Data Publication</title>
      <p>data.world prioritizes query responsiveness over update flexibility. Updates are
handled as bulk ingest, the output of the ingest is an immutable RDF dataset in the
HDT (Header-Dictionary-Triples) file format. This HDT architecture is optimized
for the queries that characterize exploratory usage - and allows us to treat datasets as
independent graphs, but loadable together as named graphs for optimized joins.</p>
    </sec>
    <sec id="sec-4">
      <title>Future Work and Conclusions</title>
      <p>Dataset versioning and provenance is an area of active research and development
for data.world – our presentation will cover our current work there. Our HDT-based
architecture works well for exploratory queries, but it is suboptimal for large
analytical (non-selective) queries. We will talk about the work we are doing to
leverage a hybrid query architecture to support both simultaneously.</p>
      <p>Our hypothesis is that to the surface area of the web of Linked Data, we need to
nurture of the network of people who are working with data. A component of every
data project is collecting, cleaning and preparing the data – and we can use Semantic
Web technology to both facilitate that work and leverage that work to grow the web
of linked data. We have seen promising indicators that support this, and as the
community grows we hope to present a quantitative assessment of that growth. One
group that leverages data.world is Data For Democracy (http://datafordemocracy.org/)
- hundreds of data scientists, analysts, and programmers using data.world to build data
dictionaries, capture cleaned data alongside raw data, and highlight the relationships
between data. This work is annotating data and enriching metadata, which is turning
raw data in CSVs and spreadsheets into meaningful sources of linked data.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>