Decentralizing the Persistence and Querying of RDF
      Datasets Through Browser-Based Technologies

                              Blake Regalia - blake@geog.ucsb.edu

                      STKO Lab, University of California, Santa Barbara, USA


          Abstract. Even though linked open data are inherently capable of persisting in a
          decentralized fashion, the Semantic Web community is currently lacking effective
          workflows for seamlessly sharing and querying RDF datasets beyond relatively
          small scales. The most accessible methods for cloning and subsequently querying
          an RDF dataset are quite involved, often requiring users to locate and download a
          data dump, install and run a local SPARQL engine, load data into the triplestore,
          and then query their endpoint using some interface. While these actions may qual-
          ify as the first step towards decentralization, the incentive for such a user to then
          make their local endpoint available to the World Wide Web and thus act as an
          auxiliary host or ‘mirror’ is hard to surmise. In this work, we answer the intelli-
          gent client challenge by presenting a browser-based RDF query engine coupled
          with web-based peer-to-peer networks to illustrate the decentralized persistence
          and use of RDF datasets via intelligent peers.


1       Introduction and Motivation
It is no surprise that in recent years, web services and standalone web applications
have been introduced as viable competitors to desktop native applications thanks to in-
novative Web technologies [7]. When it comes to hosting and querying RDF datasets
however, native applications are almost always involved, typically on the server side.
While this technique is trivial for centralized architectures, it does not promote the de-
centralization of persisting and querying RDF datasets. [10] demonstrates the value
that decentralization has on the Semantic Web by thoroughly examining the methods
and applications found at the intersection of the Semantic Web and Peer-to-Peer. Ad-
ditionally, as several previous works have demonstrated [12,13], intelligent clients play
an important role when it comes to improving both the acessibility to RDF datasets as
well as the reliability of obtaining query results. In this work, we strive towards the
decentralization of persisting and querying RDF datasets by creating intelligent peers
that may simultaneously act as data publishers, data hosts and data consumers.

2       Peer-to-Peer Communication
Using peer-to-peer (p2p) networks to disseminate large RDF datasets has two main
advantages for the Semantic Web community: (1) data publishers who operate with
limited resources may circumvent hosting costs once their dataset reaches a sufficient
demand, i.e., a substantial number of seeds1 and (2) data consumers should see an in-
    1
        A peer that is uploading content for other peers
creased availability of datasets correlate with larger swarm2 sizes. From the data pub-
lisher’s perspective, a p2p content delivery network (CDN) operates at a fraction of the
cost compared to the equivalent centralized approach because a p2p CDN only requires
the cost of initial seeds. From the data consumer’s perspective, as the number of seeds
increases, so does the probability of faster content delivery for any new peer. While
other works in the literature have focused on distributed RDF repositories [2,3,1,4], our
approach is to deliver the entire dataset, or a complete selection of the dataset, to each
peer so that applications can continue to function offline. We have also yet to see a p2p
implementation of RDF persistence or querying in the browser.
    Web Real-Time Communication (WebRTC)3 is a set of protocols and APIs that is
currently being standardized by W3C to enable communication between clients, within
the browser, over p2p connections. A demonstration of this technology, already imple-
mented in most major browsers, is made available by the open-source JavaScript library
WebTorrent4 , which offers many of the same features as traditional torrent programs.
For example, incoming torrent data are available to use on-arrival, meaning that appli-
cations can start operating on fragments of a dataset before the full file is downloaded.
Although our prototype does not yet take advantage of this capability, a use case is
demonstrated by the streaming video playback on the WebTorrent home page. We em-
ploy the WebTorrent library to demonstrate how WebRTC can be used to deliver RDF
dataset content to clients within the browser.

3       Querying RDF Datasets in the Browser
While there are several existing browser-based triplestores and query engines [8,11,6,13],
the major shortcomings of these JavaScript libraries have to do with the size of datasets
they can reliably handle, query performance, or the overhead time of acquiring a dataset
prior to querying. The cause for these shortcomings can be narrowed down to the fact
that no client-side implementation is making use of a binary serialization format that
applies succinct data structures. Such serializations are intended to create compact stor-
age footprints while also enabling high-performance query operations without the need
for decompression. HDT [5,9] is the pioneer binary serialization format for encoding
RDF datasets using compact data structures that can be queried in-place, i.e., without
transformations. In order to drive forward real-time Semantic Web applications that
take place in the browser, intelligent clients should be able to quickly and efficiently
transmit, receive and query RDF datasets with little overhead.
    As a proof of concept, we provide here5 a working prototype that combines our
browser-based RDF query engine, a connection to a p2p network, and the use of RDF
datasets that have been encoded using a custom binary serialization format based on
HDT. The web application initially downloads an RDF dataset by connecting to a p2p
network via WebTorrent. We don’t go into details about our custom binary serialization
format here, but the encoding software is written entirely in JavaScript and is therefore

    2
      All active peers in a p2p network sharing the same torrent
    3
      https://webrtc.org
    4
      https://webtorrent.io
    5
      http://phuzzy.link/query/
also capable of being used by intelligent peers to create new RDF datasets from within
the browser. Result times will vary depending on the machine used to trial our web
application demo, but we believe that its performance illustrates the potential that this
type of approach has.
     In its current state, our query engine works exclusively on basic graph patterns, i.e.,
sets of triple patterns and filters. In order to generate graph patterns, we designed a
JavaScript API that follows a graph traversal query paradigm, much like the Gremlin
Graph Traversal Language6 . This low-level approach of using an API to build queries
offers several compelling advantages to us over the more user-friendly alternative of a
query language such as SPARQL. First of all, the data structures we are using to store
RDF terms and RDF triples, as well as their complementary search algorithms, make the
API-built queries straightforward to construct and evaluate. More importantly however,
supporting a high-level query language such as SPARQL requires that some library will
handle parsing, query planning and query optimization. With a graph traversal query
paradigm, the costs associated with these processes are either voided or significantly
mitigated by virtue of the query paradigm itself, e.g., parsing is handled by the runtime
system directly and query planning is essentially performed by the query author. Reduc-
ing costs at this stage of query execution is pivotal to our goal of competing with the
time it takes distant endpoints to return query results. In the broader scheme of provid-
ing a usable query engine, we wish to start by offering the bare essentials for querying
in the form of an API. Support for a more advanced query language such as SPARQL
is therefore supplemental and appropriate for future work. A sample query is shown in
Listing 1, and its equivalent SPARQL query shown in Listing 2. Documentation about
our query API is available online7 .
(graph) => graph.pattern()
   .object(’dbo:City’).edgeIn(’rdf:type’)
   .subjects().mark(’city’)
   .forkOut({
      ’dbo:isPartOf’: e => e.object(’dbr:California’),
      ’rdfs:label’: e => e.literals()
         .data(’language’, k => k.in(’en’)).mark(’name’),
      ’dbo:populationTotal’: e => e.literals()
         .data(’number’, k => k.is(’>=’,1e5).is(’<’,2e5).save(’population’)),
   })
   .exit();
Listing 1 A sample query that shows how our low-level JavaScript API is used to create graph
patterns. This query selects all cities in California that have a population between 100,000 and
200,000 from a selection of the DBpedia dataset.

             select * {
                 ?city a dbo:City ;
                     dbo:isPartOf dbr:California ;
                     rdfs:label ?name ;
                     dbo:populationTotal ?population .
                 filter(isLiteral(?name))
                 filter(langMatches(lang(?name), "en"))
                 filter(isNumeric(?population))
                 filter(?population >= 100000 && ?population < 200000)
             }

    6
        http://tinkerpop.apache.org/gremlin.html
    7
        http://phuzzy.link/query/documentation
       Listing 2 The equivalent SPARQL query to the graph pattern shown in Listing 1.
    To illustrate how such an RDF query engine compares to a traditional triplestore,
the query shown in Listing 1 yielded 45 results from a dataset consisting of 4.15 million
triples (all things of type dbo:City) in 329ms on a modern laptop while consuming
421M iB in memory, shown in Figure 1, compared to the 45 results obtained in 365ms
on a cold, local Fuseki instance with the exact same dataset while consuming 2.43GiB
in memory. The most likely explanation for why we see comparable performance and
a smaller memory footprint out of our browser-based query engine, and the reason that
such comparisons are not fair to draw conclusions from, could be due to the fact that
our system operates on a read-only snapshot of an efficiently packed dataset file with
pre-built indexes. Nonetheless, the comparable query result times are a good indication
that browser-based query engines may indeed have the strength to compete with distant
endpoints where clients would normally be confronted by network latency, network
throughput, and server-side resource limitations.


Fig. 1 A snapshot of the demo interface showing the sample query from Listing 1 on the left
evaluated on a selection of the DBpedia dataset, consisting of 4.15 million triples, and the results
shown on the right.
    As these tools are still under active research and development, we are not yet ready
to publish an evaluation of this work. However, we hope to demonstrate the potential
implications that the associated methods have. The source code is made available under
the development branch on GitHub8 .

4       Closing Remarks
In this work, we took steps towards decentralizing the Semantic Web from entirely
within the browser. We combined the concepts of intelligent clients with peer-to-peer
communication to create intelligent peers. We introduced our RDF query engine that
operates on binary serializations of RDF in an effort to compete with the round trip
times it takes distant endpoints to process and return query results to the client. We hope
to continue our research in this direction so that datasets persisting in a decentralized
architecture can come to fruition within the browser and so that they may be used and
queried by data consumers offline.

References
 1. Arumugam, M., Sheth, A.P., Arpinar, I.B.: Towards peer-to-peer semantic web: A distributed
    environment for sharing semantic knowledge on the web (2002)
 2. Cai, M., Frank, M.: Rdfpeers: a scalable distributed rdf repository based on a structured peer-
    to-peer network. In: Proceedings of the 13th international conference on World Wide Web,
    pp. 650–657. ACM (2004)
 3. Cai, M., Frank, M., Yan, B., MacGregor, R.: A subscribable peer-to-peer rdf repository for
    distributed metadata management. Web Semantics: Science, Services and Agents on the
    World Wide Web 2(2), 109–130 (2004)
 4. Dimartino, M.M., Calı̀, A., Poulovassilis, A., Wood, P.T.: Peer-to-peer semantic integration
    of linked data. In: CEUR Workshop Proceedings, vol. 1330, pp. 213–220. CEUR Workshop
    Proceedings (2015)
 5. Fernández, J.D., Martı́nez-Prieto, M.A., Gutiérrez, C., Polleres, A., Arias, M.: Binary rdf
    representation for publication and exchange (hdt). Web Semantics: Science, Services and
    Agents on the World Wide Web 19, 22–41 (2013)
 6. Hernández, A.G., GARCıA, M.: A javascript rdf store and application library for linked data
    client applications. In: Devtracks of the, WWW2012, conference. Lyon, France (2012)
 7. Kun, Y., Xiao-Ling, W., Ao-Ying, Z.: Underlying techniques for web services: A survey. In:
    Journal of software. Citeseer (2004)
 8. Maccioni, A., Collina, M.: Graph databases in the browser: using levelgraph to explore new
    delhi. Proceedings of the VLDB Endowment 9(13), 1469–1472 (2016)
 9. Martı́nez-Prieto, M.A., Gallego, M.A., Fernández, J.D.: Exchange and consumption of huge
    rdf data. In: Extended Semantic Web Conference, pp. 437–452. Springer (2012)
10. Staab, S., Stuckenschmidt, H.: Semantic Web and Peer-to-peer: decentralized management
    and exchange of knowledge and information. Springer (2006)
11. Uchida, H., Swick, R., Sambra, A.: The web browser personalization with the client side
    triplestore. In: International Semantic Web Conference, pp. 470–485. Springer (2014)
12. Verborgh, R., Vander Sande, M., Colpaert, P., Coppens, S., Mannens, E., Van de Walle, R.:
    Web-scale querying through linked data fragments. In: LDOW (2014)


    8
        https://github.com/blake-regalia/graphy.js/tree/development
13. Verborgh, R., Vander Sande, M., Hartig, O., Van Herwegen, J., De Vocht, L., De Meester,
    B., Haesendonck, G., Colpaert, P.: Triple pattern fragments: A low-cost knowledge graph
    interface for the web. Web Semantics: Science, Services and Agents on the World Wide Web
    37, 184–206 (2016)