Storage Balancing in P2P Based Distributed
                  RDF Data Stores

                   Maximiliano Osorio and Carlos Buil-Aranda?

             Universidad Técnica Federico Santa Marı́a, Valparaı́so, Chile.
                          {mosorio,cbuil}@inf.utfsm.cl


        Abstract. Centralized RDF repositories have been designed to support
        RDF data storage and retrieval. However, they suffer from the traditional
        limitations of centralized approaches which are scalability and fault tol-
        erance, specially in a Web scenario. Peer to Peer (P2P) networks can
        provide the scalability, fault-tolerance and robustness, features that the
        current solutions to local RDF storage do not provide. A common strat-
        egy from state-of-the-art P2P-RDF data stores is to store triples at three
        locations so each triple can be found using a look-up by subject, predi-
        cate, or object identifier. One major issue of this strategy is the lack of
        load-balancing, since occurrences in triples are not uniformly distributed.
        Consequently, this issue leads an unbalanced query processing load dis-
        tribution and unfair storage load in the network. To solve this problem
        we propose a new scheme to split the data in the overloaded nodes across
        neighboring nodes. We propose the use of a Prefix Hash Table consisting
        in XXX to access to such data. We provide an empirical evaluation of
        our approach and compare with other state of the art systems for storage
        balancing showing the feasibility of our approach.


1     Introduction

Semantic Web applications rely on data that is stored in centralized RDF repos-
itories and look-up systems. These systems have been designed to support RDF
data storage and access, however, they normally suffer from the traditional lim-
itations of these approaches which are scalability and fault tolerance [1]. These
limitations are more evident on a Web environment due to the large amount
of clients accessing concurrently the data stored in them. As an alternative to
these centralized systems, P2P approaches have been proposed to overcome some
of their limitations by building (fully) decentralized data storage and retrieval
systems [2].
    P2P networks provide the scalability, fault-tolerance and robustness features
needed by Internet applications. However, it is crucial for the P2P system to
distribute evenly the data over the network, otherwise, the P2P system becomes
a centralized system. This problem already exists in some of the state of the art
?
    Supported by the Millennium Nucleus Center for Semantic Web Research under
    Grant NC120004
RDF-P2P systems like Atlas [3], being a major cornerstone of such networks. In
this paper, we propose a structure for indexing triples using Distributed Hash
Tables (DHT) and Prefix Hash Tree (PHT). Our approach is based on evenly
distributing the excess of data across neighboring nodes based on a dynamic
changing threshold. For maintaining fast access to the data stored we propose
to use a ring-based P2P system and a distributed data structure called Prefix
Hash Tree (PHT). In PHT each node in the trie has a label with a prefix that
is defined recursively to allow fast access to the nodes in the network containing
the related data. The rest of the paper is organized as follows: in Section 2 we
describe the existing state of the art related to RDF storage in P2P networks.
Next in Section 3 we present our solution to the problems identified in Section 2.
We evaluate our approach in Section 4 and finally we present our conclusions in
Section 5.


2    Related work

The Resource Description Framework (RDF) [4] is the recommended data model
by the W3C aiming to improve the World Wide Web with machine-processable
semantic data for data interchange on the Web. The notion of RDF triple is the
basic building block of the RDF model. It consists of a subject (s), a predicate
(p) and an object (o). More precisely, given a set of IRI [5] references I, a set of
blank nodes B, and a set of literals L, a triple (s, p, o) ∈ (I ∪ B) × I × (I ∪ B ∪ L).
The subject of a triple denotes the resource that the statement is about, the
predicate denotes a property of the subject and the object presents the value of
the property.
    Centralized RDF repositories and lookup systems such as Jena1 , RDFDB2
or Virtuoso3 have been designed to support RDF data storage and retrieval.
Although these systems are highly optimized RDF stores, they suffer from the
traditional limitations of centralized approaches such as scalability and fault-
tolerance in an open Web scenario [1]. As an alternative to these centralized
systems, P2P approaches [6,7,8] have been proposed to overcome some of these
limitations by building decentralized data storage and retrieval systems [2].
    P2P networks and especially distributed hash tables (DHTs [9], hash tables
in which the responsibility for maintaining the mapping from keys to values
is distributed among the nodes in the network) have gained much attention re-
cently [2]. The reasons are the scalability, fault-tolerance and robustness features
key features for most Internet and Web applications. In such networks a respon-
sible node is the node that stores the key for the RDF triple. Some RDF systems
use DHTs to store and query RDF data at Internet scale like RDFPeers [10],
Atlas [3], 3rdf [11] or GridVine [12]. We can classify these solutions according to
the overlay structure (ring-based, cube-based, tree-based and generic).
1
  http://jena.apache.org/documentation/rdf/
2
  https://github.com/scor/rdfdb
3
  https://virtuoso.openlinksw.com/
     RDFPeers [10] and Atlas [3] use a ring-based structure. In DHT, each peer
and data item has an identifier, i.e. the network address and the file name, are
hashed to a hash key in key space [0, 2m) for a typical constant m = 128 for
128-bit keys. A peer then gets all the data assigned which has a hash key between
its hash key and the next larger hash key of another peer in the key space ring.
In this way the key space range assigned to any peer is not greater than factor
O (logn).
     GridVine and 3rdf [11] are distributed RDF systems using search-tree based
overlay networks, such as the P2P systems P-Grid [13] and 3nut [14]. The dif-
ference between a search tree instead a hash table is omitting any hashing of
data keys. The objective of this approach is to preserve the order of data key in
key space and achieves efficient range queries in key space. The trade-off is in
this class of overlays has more complexity, a larger routing structure being more
difficult to maintain [15].
     In ring-based solutions index triples 3 times for each component of triple,
these are regularly disseminated to the nodes in the network by calculating the
hash function of the subjects, predicates, and objects and sending the triples to
the nodes responsible received. This indexing technique provides the possibility
to find triples based on any search criteria as long there exist at least one con-
stant in a triple pattern [16]. The triples distribution frequency is not uniformly
distributed, thus, it is possible that the responsible peer will be heavily loaded.
In [10] the authors approach the storage balancing problem by simply not index-
ing the most common RDF triples such those having an rdf:type predicate and
the requesting node must then find an alternative way of resolving the query and
ask to another target node. The result of such an approach is cost of possibly
losing the complete result.
     To efficiently balance the RDF triples, authors in [17] propose the use of
four RDF databases in each peer of the network (local triples, received triples,
generated triples, and replica triples database). The local triples database stores
RDF triples that originate from the particular node, the received triples database
stores all local triples, these are regularly disseminated to the nodes in the net-
work by calculating the hash function of the subjects, predicates, and objects
and sending the triples to the nodes responsible received. Also, each node hosts a
database for generated triples that originate from forward chaining. Finally, the
replica database has a copy of the data of whose IDs that are the nearest to the
target hash value determined by the hash function. The authors addressed the
issue of load-balancing to build an overlay tree over a DHT, if a node detects that
it is overloaded, the node performs a split operation, half of the triples remained
in the local part of the remote triples database and half of them were moved
to the new node. This approach allows an easier access to the data, however
the overhead generated by the five databases in each node of the network and
the use of a forward chaining approach for RDF(S) reasoning results in higher
storage and bandwidth costs for a single peer [18].
     Atlas [8] is a distributed RDF system that uses DHTs for distributing RDF
data across the peers. For storing an RDF triple Atlas sends three DHT put
requests using as key the subject, property and object together, and the triple
itself as an item. The key is hashed to create the identifier that leads to the re-
sponsible node where the triple is stored. Atlas also suffers from load imbalances
[18], due to RDF triples some triple components are more frequent, for instance
rdf:type and rdfs:label, the peer responsible for such a key store more triples
and the built-in load balancing is not able to balance this higher load.
    In [16] the authors propose an indexing scheme “3-tuple index” where 3
combined routing indexes are created on triples subject, predicate and object
components. The output combinations are subject+predicate, predicate+object
and object+subject.
    In summary, for storing and accessing RDF triples in a P2P network the
main approach is to use DHTs. However one critical problem is not correctly
balancing the amount of data stored in each node, leading to extra processing
by some nodes. We now proceed to describe our solution to such problem for
ring-based solutions, based on Atlas.


3   System model and data model

An obvious approach to solve the node saturation problem described before is
to add a threshold limit indicating the maximum amount of triples that a node
can store. Once a node reaches that threshold, the node splits the data into two
children nodes, redistributed the data keys among the children equally like in
P-Grid [13] (without using most of the P-Grid features like the routing layer).
After that, each child stores half of the triples that can also be split again in
case that node reaches again the threshold limit. Now the problem is to find
the triples in the tree since a triple can be placed in any node of the tree. To
fix such problem, we propose to use a distributed data structure called Prefix
Hash Tree (PHT). In PHT each node in the trie has a label with a prefix that
is defined recursively to allow fast access to the nodes in the network containing
the related data. Consider the node l in which l0 and l1 are the left hand and
right hand side nodes. For each two children we have the following properties:

 1. Each node has either 0 or 2 children.
 2. A key K is stored at a leaf node whose label is a prefix K.
 3. Each leaf node stores at most B keys.
 4. Each internal node contains at least (B + 1) keys in its sub-tree.

The PHT structure is a routine binary trie where each node in the trie is a
node of the DHT ring. Atlas uses a mapping dictionary, which maps a unique
integer value to a triple. Since URIs and literals may consist of long strings,
these are mapped to integer values and then mapped again to the triple storage.
Finally, the query evaluation is performed using these integer values [8]. Using
the previous idea, the domain for the indexing is {0, 1}D , thus, the triples are
stored according to their integer value. Figure 1 illustrates two cases, in the first
case (artist,o,p) the node is able to store the triples while in the second case
Fig. 1: Triple (artist,o,p) is stored in n2, in this case the node n2 can store the
triple. Triple (student,o,p) has to be stored in n3, n3 cannot store the triple,
then n3 use the PHT and the triple is stored in n6.


(student,o,p) the node reaches the maximum size and use the prefix Hash Tree
(PHT).
    Notice that this problem cannot be solved by just redistributing again the
triples across the network since the hash function for distributing the data would
return similar keys all the time and some nodes would be still overloaded. Besides
generating more complexity at search time.


3.1   Operations

Lookup Given a part of the triple K, PHT lookup returns a unique leaf node:
leaf (K). To find the triple our implementation uses a binary search algorithm:
if the current prefix is an internal node, the search tries a shorter prefix, and if
the current prefix is not an internal node, the search tries a longer prefix.

Query Forwarding queries within the tree does not consume expensive DHT
routing and can be done via direct communication. A query reaches its desti-
nation in O(log(N ) + d) steps, where N is the number of nodes in the DHT
network and d is the depth of tree.

Insert/Delete Insertion is a common operation in the system. When a new triple
has to be inserted, the node calculates the hash function for subject, predicate,
and object and sends the triples to the responsible nodes. When a node receives
the triple for either insertion or deletion, the node verifies whether the threshold
limit B is reached or not. If not, the node stores the triple, otherwise, the node
performs lookup to find the new responsible node and saves the triple in it if
the triples do not reach the Maximum value amount of triples B . That integer
value of the mapping dictionary is saved in an ATLAS local database. Similarly,
deletion can cause that a subtree to collapse into a single leaf node and then the
corresponding balancing operations are executed.


4   Performance Analysis

We compared our solution for storage balancing with the Atlas algorithm and
with the approach in [16]. To do that we simulated a P2P with 1,000 nodes and
distributed among these nodes 1,000,000 triples from DBpedia, the frequency
per predicate is shown in Table 1. We evaluated how well the data is distributed
across these 1,000 nodes, and we mark as future work to evaluate the query
execution of a complete SPARQL benchmark. The source code for our evaluation
and the data used can be found in http://inf.utfsm.cl/~mosorio/p2p_rdf_
balance.


                 Frequency subject/object/predicate
                 1000000 rdf:type
                 7945      Category:Living people
                 1868      Category:Articles containing video clips
                 1805      Category:Townships in Minnesota
                 1343      Category:American films
                 1246      Category:Towns in Wisconsin
                 1212      Category:English-language films
                 1045      Category:Townships in Michigan
                 944       Category:Townships in Pennsylvania
                 928       Category:Cities in Iowa
                 917       Category:Villages in Illinois
                 915       Category:Cities in Texas
                 890       Category:Towns in New York
                 842       Category:Cities in Minnesota
                        Table 1: Frequency per predicate


Storage balancing We analyze the effect of data indexing algorithms from
Atlas and [16]. The goal is evaluate the distribution’s quality of these solutions.
In the figure 2 we present the results for each algorithms indexing 1 million triples
from DBpedia, Figures 2a,2b show the results for Atlas’s algorithm. In Figure
2a there is an outlier point which corresponds to the predicate rdf:type. The
figure 2b has the same results but we remove the node responsible for rdf:type
but we can see multiple outliers.
    On the other hand, [16] proposes a mixed approach, 2c shows the results, in
comparison with Atlas’s results we see an improvement in the distribution but we
see multiples outliers. On the other hand, the figure 2d shows the results for the
proposed algorithms using the maximum number of triples by node B = 1000,
in comparison with the last two algorithms, we can see that the triples are well
distributed and there are not outliers in the results.
    In our solution the B parameter determines when the data redistribution
should happen (which can be adjusted automatically). We thus evaluate the
algorithm with different B configurations. Figures 2d, 2e, 2f show the results
using different amount of triples and a different values for B.
    In our approach the tree’s depth may vary using different values for B. Table 2
show the different tree depths depending on the value of the B parameter, being
all of them low. Moreover, the depth of tree decreases when the B increases. If
the number of triples into the node increases over time, the parameter B can be
adjusted dynamically becoming an adaptive process.


                    Number of triples
              B     100.000 triples 1 million triples 10 million triples
              1000 2                6                 10
              5000 2                2                 6
              10000 1               2                 5
         Table 2: Tree’s depth using different numbers of triples and B


5   Conclusions

In this paper we discussed indexing techniques for RDF in P2P networks. We
identified a problem of storage balancing in Distributed Hash Tables (DHT) for
RDF stores and we proposed a solution to it by using a new data structure called
Prefix Hash tables (PHT). Using PHT we are able to achieve a better storage
distribution that the existing techniques in literature showing it empirically.
We evaluated PHT using different configurations demonstrating that PHT can
achieve a fair stability when distributing the RDF data across the P2P network.
Finally, we showed that the tree’s depth is fair and considering that the worst
case message complexity of this solution is O(log(N ) + D) we can conclude that
new scheme does not high;y increase time complexity.
In summary, the proposed algorithm limits the maximum number of triples by
each node, if the node reaches the maximum number the node performs a split
operation. Also, using PHT, a node can query and find the triple in the tree. The
time overhead by using PHT and the maximum number is low: O(log(N ) + D).
                        1000000
                                                                                            17500

                         800000                                                             15000
  numbers of triples


                                                                       numbers of triples
                                                                                            12500
                         600000
                                                                                            10000

                         400000                                                               7500

                                                                                              5000
                         200000
                                                                                              2500

                                   0                                                             0
                                            triples                                                  triples


(a) Algorithm used by Atlas: we see a                               (b) If we remove the previous triple
node which is storing more than 1M                                  with the rdf:type predicate we
triples containing the rdf:type predi-                              see a more balanced distribution of
cate.                                                               data, however there are several nodes
                                                                    which store large amounts of triples
                                                                    (10,000+ triples).

                        25000
                                                                                              1000

                        20000                                                                  800
   numbers of triples


                                                                         numbers of triples

                        15000                                                                  600

                        10000                                                                  400

                         5000                                                                  200

                               0                                                                0
                                         triples                                                     triples

(c) Using the 3-tuple index algo-                                   (d) Using our algorithm and using
rithm [16] we observe a data distri-                                and a distribution with a maximum
bution similar to 2b                                                of B=1000 nodes we do not see any
                                                                    node storing large amounts of triples.

                        5000                                                                10000

                        4000                                                                  8000
numbers of triples


                                                                    numbers of triples


                        3000                                                                  6000

                        2000                                                                  4000

                        1000                                                                  2000

                          0                                                                     0
                                        triples                                                      triples

(e) Using our algorithm and using                                   (f) Using our algorithm and using
and a distribution with a maximum                                   and a distribution with a maximum
of B=5000 nodes we see a few node                                   of B=10000 nodes we observe a data
storing medium amounts of triples.                                  distribution similar to 2e.

                                       Fig. 2: Indexing 1 million triples from DBpedia
6    References
 1. C. Buil-Aranda, A. Hogan, J. Umbrich, and P.-Y. Vandenbussche, “SPARQL Web-
    Querying Infrastructure: Ready for Action?,” in ISWC2013, pp. 277–293, 2013.
 2. I. Filali, F. Bongiovanni, F. Huet, and F. Baude, “A survey of structured p2p
    systems for rdf data storage and retrieval,” in Transactions on large-scale data-
    and knowledge-centered systems iii, pp. 20–55, Springer, 2011.
 3. Z. Kaoudi, M. Koubarakis, K. Kyzirakos, I. Miliaraki, M. Magiridou, and
    A. Papadakis-Pesaresi, “Atlas: Storing, updating and querying rdf (s) data on
    top of dhts,” Web Semantics: Science, Services and Agents on the World Wide
    Web, vol. 8, no. 4, pp. 271–277, 2010.
 4. “Rdf - semantic web standards.” https://www.w3.org/RDF/. (Accessed on
    06/20/2017).
 5. M. Dürst and M. Suignard, “Internationalized resource identifiers (iris),” tech. rep.,
    2004.
 6. W. Nejdl, B. Wolf, C. Qu, S. Decker, M. Sintek, A. Naeve, M. Nilsson, M. Palmér,
    and T. Risch, “Edutella: a p2p networking infrastructure based on rdf,” in Pro-
    ceedings of the 11th international conference on World Wide Web, pp. 604–615,
    ACM, 2002.
 7. E. Della Valle, A. Turati, and A. Ghioni, “Page: A distributed infrastructure for
    fostering rdf-based interoperability,” in DAIS, vol. 6, pp. 347–353, Springer, 2006.
 8. Z. Kaoudi, I. Miliaraki, and M. Koubarakis, “RDFS reasoning and query answering
    on top of DHTs,” The Semantic Web-ISWC 2008, pp. 499–516, 2008.
 9. H. Balakrishnan, M. F. Kaashoek, D. Karger, R. Morris, and I. Stoica, “Looking
    up data in p2p systems,” Communications of the ACM, vol. 46, no. 2, pp. 43–48,
    2003.
10. M. Cai and M. Frank, “Rdfpeers: a scalable distributed rdf repository based on a
    structured peer-to-peer network,” in Proceedings of the 13th international confer-
    ence on World Wide Web, pp. 650–657, ACM, 2004.
11. L. Ali, T. Janson, and G. Lausen, “3rdf: Storing and querying rdf data on top of
    the 3nuts overlay network,” in Database and Expert Systems Applications (DEXA),
    2011 22nd International Workshop on, pp. 257–261, IEEE, 2011.
12. K. Aberer, P. Cudré-Mauroux, M. Hauswirth, and T. Van Pelt, GridVine: Build-
    ing Internet-Scale Semantic Overlay Networks, pp. 107–121. Berlin, Heidelberg:
    Springer Berlin Heidelberg, 2004.
13. K. Aberer, P. Cudré-Mauroux, A. Datta, Z. Despotovic, M. Hauswirth,
    M. Punceva, and R. Schmidt, “P-grid: a self-organizing structured p2p system,”
    ACM SIGMOD Record, vol. 32, no. 3, pp. 29–33, 2003.
14. T. Janson, P. Mahlmann, and C. Schindelhauer, “3nuts: A locality-aware peer-to-
    peer network combining random networks, search trees, and dhts,” 2009.
15. L. Ali, T. Janson, G. Lausen, and C. Schindelhauer, “Effects of network structure
    improvement on distributed rdf querying,” in International Conference on Data
    Management in Cloud, Grid and P2P Systems, pp. 63–74, Springer, 2013.
16. L. Ali, T. Janson, and C. Schindelhauer, “Towards load balancing and parallelizing
    of rdf query processing in p2p based distributed rdf data stores,” in Parallel, Dis-
    tributed and Network-Based Processing (PDP), 2014 22nd Euromicro International
    Conference on, pp. 307–311, IEEE, 2014.
17. D. Battré, F. Heine, A. Hoing, and O. Kao, “Load-balancing in p2p based rdf
    stores,” in 2nd Workshop on Scalable Semantic Web Knowledge Base System, 2006.
18. Z. Kaoudi, Distributed RDF query processing and reasoning in peer-to-peer net-
    works. PhD thesis, 2011.