Yuzu: Publishing Any Data as Linked Data

                                   John P. McCrae

       Insight Centre for Data Analytics, National University of Ireland, Galway
                                    john@mccr.ae


        Abstract. Linked data is one of the most important methods for im-
        proving the applicability of data, however most data is not in linked data
        formats and raising it to linked data is still a significant challenge. We
        present Yuzu, an application that makes it easy to host legacy data in
        JSON, XML or CSV as linked data, while providing a clean interface
        with advanced features. The ease-of-use of this framework is shown by
        its adoption for a number of existing datasets including WordNet.

        Keywords: linked data, data frontend, data conversion


1     Introduction
Linked data [1] has been identified as one of the major ways to present data
for knowledge discovery and has been shown to improve the quality and the
usefulness of datasets [12]. However, a major challenge remains the conversion
of datasets into linked data [5, 10]. This is frequently caused by the fact that
data is in legacy formats such as CSV, XML or JSON and the conversion from
these formats into RDF often represents much of the effort of a project. In recent
years, a number of efforts have been made to make RDF and linked data work
with these formats in particular, CSV on the Web [15] and JSON-LD [13], and
these formats should lower the barrier to entry to users of linked data.
    In this paper, we present the Yuzu platform1 , a frontend for linked data, like
Pubby [4] or LodLive [3]. This platform can, in contrast to existing systems, aims
to be free from strict restrictions about the format of the data, instead assuming
that the data can be understood even in legacy formats with a small amount
of metadata. This system also removes the need to run a separate SPARQL
database and instead allows simple SPARQL access to data with some limita-
tions ‘out-of-the-box’. In addition, this platform implements many features that
are required to make data easy-to-work with including content negotiation and
automatic backups based on hashes [7].


2     Handling data in legacy formats
Data can be structured in three main ways: firstly tabular data which is serialized
by means of table format where data is separated typically by a tab or comma.
1
    Yuzu is available at https://github.com/jmccrae/yuzu
Secondly, hierarchical data is structured in a flat tree and XML and JSON are
the two most popular serialization methods. Finally, graph structured data has
the most freedom in its representation, and RDF is the most commonly found
form of this data, but databases based on this model can have a significant
performance gap, which is called the “RDF tax” [2]. The Yuzu model is to keep
documents in the format that is created but enable querying over them as if
they were graph-based linked data. All conversions are provided using existing
standards such wherever possible. and as such the input to Yuzu is the dataset
as a single ZIP file containing all the data files in a some mix of XML, CSV and
JSON.

2.1     JSON-LD and XML
JSON documents in Yuzu can be understood by means of a context document
and it is required that each data either is a JSON-LD document with a @context
element or that in the containing folder there is a context.json, which is used
for indexing and is returned with the Link header [13, §6.8]. XML is mapped
also using the JSON-LD context file and we assume a simple generic mapping
method, whereby attributes and subtags (if there is no text context) are treated
as name/value pairs in an object. If this is not possible the @value special
property is used. Alternatively, a mapping may be provided using the LIXR
mapping language [9].

2.2     CSV
CSV conversion is based on the CSV on the Web standard’s recipe for creating
RDF data [14], which we implement as part of the Yuzu model. Generation of
RDF data from this CSV is provided in standard mode such that extra data
for querying is available to the user. Each CSV file is described by means of an
extra metadata file in the form of the Tabular Data Metadata Vocabulary [11],
which is in fact another JSON-LD file. In the case where no mapping is found
a default empty tabular metadata file is created and used to map the CSV into
RDF. In the interface, data that was originally in CSV is presented to the user
in a tabular form, however the RDF data can be obtained by means of content
negotiation.

3     Cheap, robust SPARQL querying
SPARQL provides a powerful and effective method for querying data on the
Web, however it provides significant challenges for hosts wishing to provide fast
access with limited resources. SPARQL is a very free query language and it is
easy to devise queries that are very hard to answer, and even worse this can
easily be caused by typos2 .
2
    e.g., a typo in a variable name will not be detected in SPARQL and will turn a query
    that could be answered with an inner join to a query that can only be answered with
    the more expensive cross join
    We employ a pre-processor that attempts to find a fixed subset of documents
that have a given property and then creating a mini-dataset to evaluate the
query on. This means certain queries, for example those which rely on FILTER
constraints to do most of the document selection, will not be possible to execute,
but more typical queries, such as documents with a list of properties can be more
readily executed. We believe that this provides a good perfomance balance and
will continue to evaluate this balance in our deployed instances. Of course, a
full SPARQL endpoint can be used along with our this method to support all
SPARQL queries.


4     Hashing, permalinks and backups

A major issue that faces data users is that data frequently becomes unavailable
or has changed in a manner that makes it difficult to reuse. In order to combat
this, Yuzu takes a hash of the overall dataset and a hash of each individual
file in the dataset. The hash of each individual file can be used to look up any
individual resource.
     Secondly, each Yuzu instance may allocate a certain amount of space to back
up parts of other resources. This back-up procedure is implemented by a method
based on the Kademlia [6] protocol. Each Yuzu instance generates at start-up a
unique identifier and checks the identifier of each of its peers (from a fixed list of
peers). Then the files in the dataset are posted to each of the peers and the peers
store those files that are closest in the XOR distance between the file’s hash and
the instance’s hash, up to the limit of files that are there for back-up. Then when
resolving a ‘permalink’, if the hash does not correspond to any of the file in this
resources dataset the system redirects to another host, whose instance hash is
closer to the requested hash.


5     Conclusions and current deployments

The ease-of-use of the Yuzu system has been deployed to host a number of
datasets: originally it was developed for the WordNet dataset3 , and this is still
a supported theme of the system. Since then, Yuzu has been applied to large
datasets, such as Linghub [8], and has been used to host a large number of
smaller datasets. The robust theming and stability of the interface has allowed
datasets to be hosted even on very low-resourced virtual machines, even while
allowing querying using SPARQL.


Acknowledgements

This research was supported by the Science Foundation Ireland under Grant
Number SFI/12/RC/2289 (Insight)
3
    http://wordnet-rdf.princeton.edu
References
 1. Bizer, C., Heath, T., Berners-Lee, T.: Linked data-the story so far. Semantic Ser-
    vices, Interoperability and Web Applications: Emerging Concepts pp. 205–227
    (2009)
 2. Boncz, P., Erling, O., Pham, M.D.: Advances in large-scale RDF data management.
    In: Linked Open Data–Creating Knowledge Out of Interlinked Data, pp. 21–44.
    Springer (2014)
 3. Camarda, D.V., Mazzini, S., Antonuccio, A.: LodLive, exploring the web of data.
    In: Proceedings of the 8th International Conference on Semantic Systems. pp. 197–
    200. ACM (2012)
 4. Cyganiak, R., Bizer, C.: Pubby-a linked data frontend for sparql endpoints (2008),
    http://www4.wiwiss.fu-berlin.de/pubby/
 5. Ehrmann, M., Ceconi, F., Vannella, D., McCrae, J.P., Cimiano, P., Navigli, R.:
    Representing Multilingual Data as Linked Data: the Case of BabelNet 2.0. In:
    Proceedings of the 9th Language Resource and Evaluation Conference. pp. 401–
    408 (2014)
 6. Maymounkov, P., Mazieres, D.: Kademlia: A peer-to-peer information system based
    on the xor metric. In: Peer-to-Peer Systems, pp. 53–65. Springer (2002)
 7. McCrae, J.P., Bordea, G., Buitelaar, P.: Linked data and text mining as an enabler
    for reproducible research. In: Proceedings of the Workshop on Cross-Platform Text
    Mining and Natural Language Processing Interoperability (2016)
 8. McCrae, J.P., Cimiano, P.: Linghub: a linked data based portal supporting the dis-
    covery of language resources. In: Joint Proceedings of the Posters and Demos Track
    of 11th International Conference on Semantic Systems-SEMANTiCS 2015 and 1st
    Workshop on Data Science: Methods, Technology and Applications (DSci15). pp.
    88–91 (2015)
 9. McCrae, J.P., Cimiano, P.: LIXR: Quick, succinct conversion of XML to RDF and
    back again. In: Proceedings of the ISWC 2016 Posters and Demo Track (2016)
10. O’Riain, S., Curry, E., Harth, A.: XBRL and open data for global financial ecosys-
    tems: A linked data approach. International Journal of Accounting Information
    Systems 13(2), 141–162 (2012)
11. Pollock, R., Tennison, J., Kellogg, G., Herman, I.: Metadata vocabulary for tabular
    data. W3C recommendation, World Wide Web Consortium (2015)
12. Schultz, A., Matteini, A., Isele, R., Mendes, P.N., Bizer, C., Becker, C.: Ldif-
    a framework for large-scale linked data integration. In: 21st International World
    Wide Web Conference (WWW 2012), Developers Track, Lyon, France (2012)
13. Sporny, M., Longley, D., Kellogg, G., Lanthaler, M., Lindstrm, N.: Json-ld 1.0.
    W3C recommendation, World Wide Web Consortium (2014)
14. Tandy, J., Herman, I., Kellogg, G.: Generating RDF from tabular data on the web.
    W3C recommendation, World Wide Web Consortium (2015)
15. Tennison, J., Kellogg, G., Herman, I.: Model for tabular data and metadata on the
    web. W3C recommendation, World Wide Web Consortium (2015)