Querying RDF data in Networks of Web
                   Browsers

             Arnaud Grall1,2 , Hala Skaf-Molli1 , and Pascal Molli1
                    1
                      LS2N – University of Nantes, France
         {arnaud.grall,hala.skaf, pascal.molli}@univ-nantes.fr
      2
        GFI Informatique - IS/CIE, Nantes, France {arnaud.grall@gfi.fr}


      Abstract. Web browsers represent an under-exploited data deposit. In
      this paper, we propose Snob, a decentralized query execution engine for
      SPARQL query execution over RDF data hosted in a P2P network of
      web browsers.


1   Introduction
Web Browsers are certainly the most deployed execution environment in the
world and currently represent an under-exploited data deposit. Browsers are
in direct contact with end-users, they are able to capture their location, their
history of browsing and their perceptions of the real world. However, querying
data over a large-scale network of browsers is challenging.
    Many works address the problems of data management in Peer-to-Peer net-
works [6]. However, decentralizing data in browsers raises some specific issues.
First, a network of browsers gathers a very large number of browsers with few
RDF data hosted in browsers. This raises the problem of source selection when
executing a SPARQL query over a very large number of relevant sources [3].
Second, a large number of browsers are now running on mobile phones. The
query execution has to save bandwidth and battery. Finally, connections be-
tween browsers rely on the WebRTC3 standard that does not have routing.
Consequently, communication costs with distant neighbors are expansive. There-
fore, during query execution direct neighbors must be considered as privileged
relevant data sources.
    In this paper 4 , we propose Snob, an approach for executing SPARQL queries
over a network of browsers5 . We assume that browsers are connected in an un-
structured network based on periodic peer-sampling [5], i.e., each browser is
connected to a bounded random subset of the network that is renewed periodi-
cally during a shuffling. A SPARQL query running in a browser can be seen as a
federated SPARQL query executed after each shuffling where neighbors are the
3
  https://www.w3.org/TR/webrtc/
4
  This work was partially funded by the French ANR project O’Browser (ANR-16-
  CE25-0005-01). Mr. Grall is funded by the GFI Informatique compagny (145 Boule-
  vard Victor Hugo, 93400, Saint Ouen, France).
5
  Snob is described with more details in [2]
data sources. As data sources are renewed every shuffling, the query execution
will eventually continue to produce new results after each shuffling. This ensures
that the number of messages exchanged with neighbors is constant per shuffling
and every query makes progress at each shuffling. However, as progression can
be slow, we build a second overlay network where browsers are connected to a
fixed number of browsers that process similar queries.
    This paper presents the following contributions: (i) Snob a SPARQL query
execution model over RDF data hosted in a network of browsers. (ii) A semantic
overlay network based on query containment. (iii) Experimentation shows that
the number of results produced by queries grows with the number of running
queries in the network. The semantic overlay network is able to speed up the
number of produced results when fewer queries are running simultaneously.


2    The Snob Approach

Snob relies on four key ideas:
    1. Browsers are connected through an unstructured network (RPS network)
based on periodic peer-sampling [5], i.e., periodically, each browser shuffles its
local view on the network with the local view of its neighbors. This prevents
the network to be partitioned. Compared to structured networks, unstructured
network tolerates high churn and high expressiveness of queries [6][chapter 16].
    2. A browser evaluates its queries as federated queries using only its direct
neighbors as data sources. Later, it waits for the next shuffling that will bring
new data sources. Such approach regulates the network traffic, i.e., a browser
can only send a number of messages bounded to the number of direct neighbors
per shuffling.
    3. To speed up query execution, we promote the sharing of intermediate re-
sults. Sharing intermediate results replicates and aggregates data. Consequently,
the probability that a browser meets another one with relevant intermediate re-
sults after l shuffling is improved [4].
    4. A Semantic Overlay Network (SON) [1] built on top of the unstructured
network that selects as neighbors, for each browser, the k-best browsers process-
ing similar queries. The profile of a browser is used for ranking them. A profile
is the set of the triple patterns of the SPARQL queries under processing. The
similarity of two profiles is based on triple patterns containment relationships
(v). Given 2 triple patterns tpi , tpj , we define a scoring function St (tpi, tpj) : N
such that : St (tpi v tpj ∧ tpj v tpi)  St (tpi v tpj)  St (tpi w tpj). So
given
P       two browsers profiles Bi , Bj , we define a scoring function Sb (Bi , Bj ) =
   tpi ∈Bi ,tpj ∈Bj St (tpi , tpj ). After each shuffling, each browser recomputes its k-
best neighbors with new neighbors discovered by the RPS.
    Figure 1 shows browsers hosting data from Diseasome and Linked MDB.
Browsers B1 − B4 executes queries over Diseasome, B6 − B9 over LinkedMDB
and B5 over both. The RPS network ensures that all browsers are connected
through a random graph, browsers profiles make the clustered network converg-
ing towards two communities. B1 − B4 will be highly connected because they
                              b6
               b3
     b1                                      b9
                         b5                       SON
                                   b7   b8
          b2
                    b4

                              b6
               b3
     b1                                      b9
                         b5                       RPS
                                   b7   b8
          b2
                    b4


    Fig. 1: Random Peer Sampling and Se-
    mantic Overlay Networks              Fig. 2: Average query completeness.
execute queries over Diseasome, while B6 − B9 will be grouped together due
their interest in LinkedMDB. B5 is connected to both communities.

Query Execution At a given time, a browser Bi has a fixed number of neigh-
bors: k neighbors in the random peer sampling layer, and l neighbors in the
semantic overlay network. Typical values of k and l are logarithmic to the size of
the whole network. Each browser exposes to its neighbors a triple pattern inter-
face. The browser is able to process an incoming triple pattern query and return
results to the triple pattern query originator. Incoming triple pattern queries are
processed over local data and intermediate results.
    Suppose that the browser Bi processes the query Qi . Qi is processed like a
federated query over its k+l neighbors considered as data sources. In this setting,
there are few differences with federated query processing: (i) The federation is
incomplete, i.e., the federation will change at the next shuffling. Therefore, Bi
has to keep intermediate results for continuing processing when new sources are
discovered. (ii) Shuffling could bring already visited data sources, even these
browsers have been already visited, queries have to be re-executed. Because
intermediate results hosted by visited browsers could have been changed.
    As the size of local data is small in our context, after each shuffling, Bi
processes as following: (i) for each triple pattern tpj of query Qi , evaluates tpj
over all its neighbors and updates its datasets. (ii) Execute Qi and eventually
produces new results. (iii) Sleep until next shuffling and only answers to triple
pattern requests from direct neighbors.


3     Experimental Study

Snob source code is available at 6 . Snob uses RDFStore-js as a local data store
and local query engine. We use the real datasets Diseasome7 and Linked Movies
Database8 (LinkedMDB). We generated 100 queries per dataset using PATH and
STAR shaped templates with two to eight triple patterns that are instantiated
with random values from the dataset. We removed the queries that caused the
6
  https://github.com/folkvir/webrtc-dequenpeda
7
  https://old.datahub.io/dataset/fu-berlin-diseasome
8
  http://data.linkedmdb.org/
query engine to abort execution. This results in 100 queries from LinkedMDB
and 96 queries from Diseasome. For these queries, we extracted triple patterns,
each triple pattern of the query is executed as a SPARQL construct query and
produces a fragment that we split into two sub-fragments. Sub-fragments are
randomly distributed across the clients, each client hosts at least one fragment.
     We set up a network of 196 clients in two configurations: one with only RPS
where each client has 10 random neighbors, and another with RPS+SON where
each browser has 5 random neighbors and 5 neighbors in SON. Results presented
are the average of three successive executions of 100 rounds. A round corresponds
to a query execution after a periodic shuffle where the shuffling time is set to 10
minutes. Figure 2 presents the pourcentage of completeness of query answers,
i.e. the pourcentage of produced answers w.r.t the total number of answers, per
round for three different configurations. The total number of queries answers
is computed in a centralized setting where all data are loaded. Firstly for 49
queries (quarter), secondly for 98 queries (half) and finally for 196 queries (all).
As we can see, RPS+SON provides better query completeness when less queries
are running in the network. When every browser runs a query, then proportional
replication of intermediate results is enough.


4    Conclusion
In this paper, we proposed Snob, a query execution model for SPARQL query
over RDF data hosted in a network of browsers. Snob allows semantic appli-
cation developpers to exploit RDF stored in browsers. This work opens several
perspectives. First, data exchange between browsers has to be optimized when a
data source is revisited to optimize the traffic. Second, intermediate results could
be streamed on the semantic overlay network. Finally, a DHT service could be
implemented in browsers to speed up the discovery of similar queries in the
network.


References
1. Crespo, A., Garcia-Molina, H.: Semantic overlay networks for p2p systems. In: In-
   ternational Workshop on Agents and P2P Computing. pp. 1–13. Springer (2004)
2. Grall, A., Skaf-Molli, H., Molli, P.: SPARQL Query Execution in Networks of Web
   Browsers (Jun 2018), http://hal.univ-nantes.fr/hal-01805154
3. Grubenmann, T., Bernstein, A., Moor, D., Seuken, S.: Challenges of source selection
   in the wod. In: International Semantic Web Conference. pp. 313–328. Springer (2017)
4. Lv, Q., Cao, P., Cohen, E., Li, K., Shenker, S.: Search and replication in unstruc-
   tured peer-to-peer networks. In: Proceedings of the 16th international conference
   on Supercomputing. pp. 84–95. ACM (2002)
5. Nédelec, B., Tanke, J., Frey, D., Molli, P., Mostéfaoui, A.: An adaptive peer-sampling
   protocol for building networks of browsers. World Wide Web pp. 1–33 (2017)
6. Özsu, M.T., Valduriez, P.: Principles of distributed database systems. Springer
   (2011)