data.world: A Platform for Global-Scale Semantic
                          Publishing

Bryon Jacob1[0000-0003-0470-9300] ,Dave Griffith1[0000-0001-9700-0012], Triet Le1[0000-0001-5619-5802]
         1
           data.world, 7000 North Mopac Expressway #425, Austin, TX 78731 USA
    bryon@data.world dave.griffith@data.world triet.le@data.world


1       Introduction
data.world (https://data.world/) is a collaborative web platform with a user base
consisting primarily of users who are not Semantic Web experts, and datasets that are
not initially semantically annotated or linked. By using web standards for the
automated translation of those tabular data formats into RDF, data.world leverages the
iterative data work done by the users of the platform to build a connected network of
linked datasets. data.world is an open platform where anyone can sign up for a free
account to work with open data - it was launched in July of 2016, and as of a year
later is in active use by a community of tens of thousands of users and organizations.
   An oft-cited fact is that finding, understanding, and preparing data for use can take
eighty percent of the time spent on an analysis project. These projects usually involve
multiple data sources in a variety of formats. Semantic Web offers a powerful set of
tools (universal structure for data, federated query) to deal with this diversity.
Metadata can be iteratively layered into datasets, by different actors at different times.
   data.world focuses on collaborations where each actor is empowered to participate
in the iterative development of the data resource, by helping to clean, annotate, and
contextualize the data. Data can be worked on in the open, or in access-controlled
datasets and projects. Structured data is converted into RDF and can be queried via
SPARQL, but the original data is retrievable as well, so that users can continue
working in familiar modes while enjoying the benefits of semantic web technology.


2       Managing diverse data and user base
The majority of structured data in the world is tabular in nature. CSV and other text
variants, spreadsheets, relational database tables, and many “long tail” data formats
are all representations of tabular data. CSVW provides a model for modeling tables
within an RDF graph structure. CSVW tables use RDF schema and types, can be
mixed and queried together with graph structures defined directly in RDF, and can be
serialized for transmission or storage in any textual or binary form that RDF can take.
   Across projects and domains, and often within a single project, there are a diverse
set of actors. There are knowledge engineers who create ontologies and knowledge
bases; data scientists and statisticians who produce models and visualizations;
analysts and scientists who use spreadsheets or visual analytics platforms; and end-
users and stakeholders consuming the conclusions of the work. data.world
emphasizes collaboration between these personas.
                                                                                          2


3      Architecture for Scalable, Heterogeneous Data Publication
data.world prioritizes query responsiveness over update flexibility. Updates are
handled as bulk ingest, the output of the ingest is an immutable RDF dataset in the
HDT (Header-Dictionary-Triples) file format. This HDT architecture is optimized
for the queries that characterize exploratory usage - and allows us to treat datasets as
independent graphs, but loadable together as named graphs for optimized joins.


Fig. 1. data.world high-level ETL and query architecture overview – data files are ingested
through the etl process, then derivative data rendered as HDT is persisted. Query heads load
and cache HDT to execute SPARQL queries on demand, and query endpoints expose SPARQL
endpoints and SQL query endpoints to the web – the endpoints parse and rewrite SQL into
SPARQL, and parse SPARQL queries to rewrite and route them. Each layer is scalable, with
performance proportional to the size of the query set, not to the overall collection.


4      Future Work and Conclusions
   Dataset versioning and provenance is an area of active research and development
for data.world – our presentation will cover our current work there. Our HDT-based
architecture works well for exploratory queries, but it is suboptimal for large
analytical (non-selective) queries. We will talk about the work we are doing to
leverage a hybrid query architecture to support both simultaneously.
   Our hypothesis is that to the surface area of the web of Linked Data, we need to
nurture of the network of people who are working with data. A component of every
data project is collecting, cleaning and preparing the data – and we can use Semantic
Web technology to both facilitate that work and leverage that work to grow the web
of linked data. We have seen promising indicators that support this, and as the
community grows we hope to present a quantitative assessment of that growth. One
group that leverages data.world is Data For Democracy (http://datafordemocracy.org/)
- hundreds of data scientists, analysts, and programmers using data.world to build data
dictionaries, capture cleaned data alongside raw data, and highlight the relationships
between data. This work is annotating data and enriching metadata, which is turning
raw data in CSVs and spreadsheets into meaningful sources of linked data.