<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>C-Set : a Commutative Replicated Data Type for Semantic Stores</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Khaled Aslan</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pascal Molli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hala Skaf-Molli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stephane Weiss</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>INRIA Rennes</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Nantes</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Web 2.0 tools are currently evolving to embrace semantic web technologies. Blogs, CMS, Wikis, social networks and real-time notifications, integrate ways to provide semantic annotations and therefore contribute to the linked data and more generally to the semantic web vision. This evolution generates a lot of semantic datasets of different qualities, different trust levels and partially replicated. This raises the issue of managing the consistency among these replicas. This issue is challenging because semantic data-spaces can be very large, they can be managed by autonomous participants and the number of replicas is unknown. A new class of algorithms called Commutative Replicated Data Type are emerging for ensuring eventual consistency of highly dynamic content on P2P networks. In this paper, we define C-Set a CRDT specifically designed to be integrated in Triple-stores. C-Set allows efficient P2P synchronisation of an arbitrary number of autonomous semantic stores.</p>
      </abstract>
      <kwd-group>
        <kwd>scalability</kwd>
        <kwd>synchronisation</kwd>
        <kwd>peer-to-peer</kwd>
        <kwd>consistency</kwd>
        <kwd>replication</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Web 2.0 tools are currently evolving to embrace semantic web technologies.
Blogs, CMS, Wikis, social networks and real-time notifications, integrate ways
to provide semantic annotations and therefore contribute to the linked data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
and more generally to the semantic web vision. This evolution generates a lot
of semantic datasets of different qualities, different trust levels and partially
replicated.
      </p>
      <p>
        The problem of updating and synchronizing data in the semantic web has
been raised by Berners-Lee in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Efficient synchronization of semantic stores
organized in a peer-to-peer network can leverage problems of scalability and
allows different autonomous participant to collaboratively combine, improve and
enrich semantic datasets.
      </p>
      <p>
        Synchronizing semantic stores is challenging. Semantic data-spaces can be
very large, they can be managed by autonomous participants and the number
of replicas is unknown, potentially high. Only few replication algorithms can
fulfill these constraints and they all belong to the optimistic replication class
of algorithms [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. An optimistic replication model considers an unknown
number of sites, each site has a copy a shared object. An object can be modified
by applying locally an operation. Next, this operation is broadcasted to other
sites in order to be re-executed. The system is correct if it ensures the eventual
consistency property [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] i.e. all replicas are identical when the system is idle.
Thomas write rule [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] was the first algorithm to ensure eventual consistency in
duplicated databases. However, Thomas write rule requires the knowledge of the
number of participants (in order to provide a safe garbage collection scheme).
This constraint is not compatible with our context.
      </p>
      <p>
        Recently, a new class of optimistic replication algorithms called CRDTs is
emerging [
        <xref ref-type="bibr" rid="ref14 ref7">7,14</xref>
        ]. CRDT stands for Commutative Replicated Data Types. CRDTs
propose to define new abstract data type that provides commutative operations.
CRDTs ensure eventual consistency regardless of the order of operations at
reception. CRDT has been defined for arrays, linear sequence and tree.
      </p>
      <p>In this paper, we define a specific data type for sets called C-Set that can
be applied for a triple store and we prove that this type ensures eventual
consistency. With a traditional set, operations insert/delete do not commute. C-Set
provides commutative insert, delete operations, does not require causal
reception of operation and can leverage some problems of garbage collection. C-Set
is designed to be integrated within a semantic store in order to provide P2P
synchronisation of autonomous semantic store.</p>
      <p>The paper is organized as follow: Section 2 presents backgrounds and related
works. Section 3 defines a new CRDT type for set designed for triple stores.
Section 4 discusses our approach. Section 5 summarizes contributions and presents
future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background and related work</title>
      <p>Many previous work on replication in semantic P2P systems focused on sharing
RDF resources. They did not enable collaborative working for maintaining RDF
stores. In the context of data sharing, some sites publish their RDF data while
other sites consume this data. So there is no concurrent modifications and/or
operations. While collaborative systems consider a shared object between the
peers. This imply that the peers can modify the object concurrently and then
they can run a consolidation algorithm to ensure the consistency of the shared
object.</p>
      <p>
        Tim Berners-Lee and Dan Connolly proposed an ontology for the distribution
of differences between RDF graphs called Delta [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In their approach they rely
on the standard serialization of RDF graphs into text files then running a diff
between the resulted text files. However, the authors do not detail how eventual
consistency can be efficiently reached by their algorithms.
      </p>
      <p>
        RDFGrowth [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and Publish/Subscribe Networks [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] focus on semantic data
sharing where only one peer can modify the shared knowledge while others can
read them. However, as it was mentioned earlier, sharing is different from
collaboration. In sharing, some peers publish data while others can only read these
data and concurrent updates are not managed. In collaborative working
environments, some peers publish data, others can read and write these data and a
synchronization algorithm integrates concurrent updates. Collaborative
environments improve the quality of data, the experience of the collaborative Wikipedia
demonstrates this.
      </p>
      <p>
        Edutella [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] proposes a RDF-based metadata infrastructure for P2P
applications. It focuses on querying RDF metadata stored in distributed RDF
repositories. The objective is providing access to distributed digital educational
resources. Edutella distinguishes between two kinds of peers, simple and super
peer. Simple peers provide data resource along with its schema. Super peers are
used for several purposes, including data mediation, integration and query
routing. Edutella proposes a replication service. However, it is not mentioned how
to replicate and synchronize metadata.
      </p>
      <p>
        RDFSync [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] synchronizes a target RDF graph with a source one. RDF
graphs are decomposed unequivocally into minimal subsets of triples (Minimum
Self-Contained Graphs MSGs) and canonically represented by ordered lists of
the identifiers (hashes) of its composing MSGs. The synchronization algorithm
performs a diff between the source and the target of the ordered list of MSGs.
RDFSync can perform three kinds of synchronization, in the Target Growth Sync
(TGS) the target becomes equal to the merge of both graphs, in the Target Erase
Sync (TES) the target deletes unknown information by the source and finally
in Target Change Sync (TCS) the target becomes equal to the source. Figure 1
shows an example of RDF graph synchronization using RDFSync.
      </p>
      <p>
        RDFPeers [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is a scalable distributed RDF repository. It is based on a
structured P2P network. To enable faults tolerance, RDFPeers uses partial replication
of RDF data. Every RDF triple is stored at three nodes of the network. However,
it is not explicitly specified what happens in the case of concurrent updates on
copies.
      </p>
      <p>
        Thomas write rule [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] present techniques by which a number of loosely
coupled processes can maintain duplicate copies of a database, despite the
unreliability of their only means of communication. The copies of the database can be
kept consistent. However, in order to remove the old deleted entries ”garbage
collection” they propose the following scheme: each site could notify the other
sites whenever it hears about a deletion. If these notifications are transmitted in
order with the ”normal” sequence of modifications, then upon receipt of such a
notification a site can be sure that the sending site has delivered any outstanding
assignments to the deleted entry, has marked it as deleted, and will not generate
any new assignments to it. This implies the knowledge of all the sites in the
system. This constraint is not compatible with the P2P networks context.
      </p>
      <p>In summary, P2P Semantic Web replication researches focus on knowledge
sharing and querying. No algorithm have been designed for collaborative editing
of RDF graphs. Consequently, they do not take into account concurrent updates
on RDF graphs.
3</p>
    </sec>
    <sec id="sec-3">
      <title>C-Set definition</title>
      <p>
        A CRDT is a data structure where all concurrent operations commute, so they
do not require merge algorithms or integration procedure to enforce consistency.
The replicas of a CRDT converge automatically, without complex concurrency
control. Examples of algorithms that implement this data type are Woot [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
TreeDoc [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Logoot [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and Logoot-Undo [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. None of the existing algorithms
handles the set data type.
      </p>
      <p>A set with operations insert(element) and delete(element) is not a CRDT.
The counter-example is presented in figure 2. The example illustrates how three
sites can receive operations op1, op2, op3 in different order. Site 1 executes the
sequence [op1; op2; op3] while Site 3 executes the sequence [op1; op3; op2]. If the
execution of operations does not commute, then eventual consistency is violated
i.e. the set on site 1 is empty while the set on site 3 contains {x}.
3.1</p>
      <p>C-Set Data Structure
We want to define a CRDT for sets data type, where each element in this set
corresponds to an RDF triple. We represent the set S of elements as a set
of a couple of (e : element, x : Z). We define on this set four operations :
ins(e : element), del(e : element), rins((e : element, k : Z)) and rdel((e :
element, k : Z)) detailed in the following section.
3.2</p>
      <p>C-Set Algorithms
The operation ins(e : element) can be executed locally immediately. It
generates and sends remote insert operation rins((e : element, k : Z)) that is executed
remotely. In our model we give the ins operation precedence over the del
operation. So whenever an ins operation is executed it compensate the effect of all
the previously received del operations.
ins(e : element):
if (∃k ∈ Z : (e, k) ∈ S) then
if (k ≤ 0)</p>
      <p>S = (S/{(e, k)}) ∪ {(e, 1)};
send(rins((e, +|k| + 1)));
else if (k &gt; 0)</p>
      <p>S = (S/{(e, k)}) ∪ {(e, k + 1)};
send(rins((e, +1)));
endif
endif
else</p>
      <p>S = S ∪ {(e, 1)};
send(rins((e, +1)));
endif</p>
      <sec id="sec-3-1">
        <title>Algorithm 1: ins algorithm</title>
        <p>rins((e : element, k : Z∗)):
if (∃i ∈ Z : (e, i) ∈ S) then</p>
        <p>S = (S/{(e, i)}) ∪ {(e, k + i)};
else
S = S ∪ {(e, k)};
endif</p>
      </sec>
      <sec id="sec-3-2">
        <title>Algorithm 2: rins algorithm</title>
        <p>The operation del(e : element) is executed locally. It generates and sends the
remote delete operation rdel((e : element, k : Z∗)) that is executed remotely.
del(e : element):
if (∃k ∈ Z : (e, k) ∈ S)
if (k ≤ 0) then</p>
        <p>S = (S/{(e, k)}) ∪ {(e, k − 1)};
send(rdel((e, −1)));
else // k&gt;0</p>
        <p>S = S/{(e, k)};
send(rdel((e, −k)));
endif
endif</p>
      </sec>
      <sec id="sec-3-3">
        <title>Algorithm 3: del algorithm</title>
        <p>else
endif
rdel((e : element, k : Z∗)):
if (∃i ∈ Z : (e, i) ∈ S)</p>
        <p>S = (S/{(e, i)}) ∪ {(e, i + k)};
S = S ∪ {(e, +k)};</p>
      </sec>
      <sec id="sec-3-4">
        <title>Algorithm 4: rdel algorithm</title>
        <p>The proof that C-Set preserves eventual consistency is straightforward. On
each site, C-Set generates sequence of additions of elements that belong to (Z, +).
As addition in (Z, +) is commutative, C-Set converge.</p>
        <p>Discussions
C-Sets is a CRDT data type for sets that ensures eventual consistency. However,
as shown in figure 4, C-Set, as the Thomas write rule, relies on tombstones. This
means that a deleted element remains in the store. In the example in figure 4
the final value {(x, −1} is clearly a tombstone. Garbage collecting tombstones
is a challenging issue in a dynamic P2P network as in our context. It requires
each site to have knowledge about the state of all the others sites. Consequently
it makes the synchronisation dependant of the number of sites and requires
procedures to join and leave synchronisation groups.</p>
        <p>An interesting property of C-Sets is that tombstones can be removed locally
on one site without communication with others sites when counters associated
to set elements are equal to 0. Tombstones will remain if same elements are
concurrently deleted on several sites which is not the more frequent scenario.
Extensive experimentation is needed to prove the efficiency of this hypothesis.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and perspectives</title>
      <p>In this paper we presented C-Set : a CRDT for sets that ensure eventual
consistency. C-Set is designed to be integrated within a semantic store in order
to provide P2P synchronisation of autonomous semantic store. C-Sets is more
efficient than Thomas write rule, especially in managing the delete operations.</p>
      <p>
        Future work will integrate C-Set within an existing semantic store. Next, we
will be able to to evaluate the overhead of C-Sets using real-world semantic web
data such as DBpedia [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          , Tom Heath, and
          <string-name>
            <surname>Tim</surname>
          </string-name>
          Berners-Lee.
          <article-title>Linked Data - The Story So Far</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems</source>
          ,
          <volume>4</volume>
          (
          <issue>2</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          ,
          <year>January 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          , Jens Lehmann, Georgi Kobilarov, S¨oren Auer, Christian Becker, Richard Cyganiak, and
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Hellmann</surname>
          </string-name>
          .
          <article-title>DBpedia - A crystallization point for the Web of Data</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          ,
          <volume>7</volume>
          (
          <issue>3</issue>
          ):
          <fpage>154</fpage>
          -
          <lpage>165</lpage>
          ,
          <year>September 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Min</given-names>
            <surname>Cai</surname>
          </string-name>
          and
          <string-name>
            <given-names>Martin</given-names>
            <surname>Frank</surname>
          </string-name>
          .
          <article-title>RDFPeers: a scalable distributed RDF repository based on a structured peer-to-peer network</article-title>
          .
          <source>In Proceedings of the 13th international conference on World Wide Web</source>
          , pages
          <fpage>650</fpage>
          -
          <lpage>657</lpage>
          . ACM,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>P.A.</given-names>
            <surname>Chirita</surname>
          </string-name>
          , Stratos Idreos, Manolis Koubarakis, and
          <string-name>
            <given-names>Wolfgang</given-names>
            <surname>Nejdl</surname>
          </string-name>
          . Publish/
          <article-title>- subscribe for rdf-based p2p networks</article-title>
          .
          <source>The Semantic Web: Research and Applications</source>
          , pages
          <fpage>182</fpage>
          -
          <lpage>197</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>P.</given-names>
            <surname>Johnson</surname>
          </string-name>
          and R. Thomas. RFC677:
          <article-title>The maintenance of duplicate databases</article-title>
          .
          <year>1976</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Wolfgang</given-names>
            <surname>Nejdl</surname>
          </string-name>
          , Boris Wolf, Changtao Qu, and
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Decker</surname>
          </string-name>
          .
          <article-title>EDUTELLA: a P2P networking infrastructure based on RDF</article-title>
          .
          <source>Proceedings of the</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. G´erald Oster, Pascal Urso, Pascal Molli, and
          <string-name>
            <given-names>Abdessamad</given-names>
            <surname>Imine</surname>
          </string-name>
          .
          <article-title>Data Consistency for P2P Collaborative Editing</article-title>
          . In Conference on Computer-Supported Cooperative Work,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Nuno</given-names>
            <surname>Preguica</surname>
          </string-name>
          , Joan Manuel Marques,
          <string-name>
            <given-names>Marc</given-names>
            <surname>Shapiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Mihai</given-names>
            <surname>Letia</surname>
          </string-name>
          .
          <article-title>A Commutative Replicated Data Type for Cooperative Editing</article-title>
          .
          <source>2009 29th IEEE International Conference on Distributed Computing Systems</source>
          , pages
          <fpage>395</fpage>
          -
          <lpage>403</lpage>
          ,
          <year>June 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Yasushi</given-names>
            <surname>Saito</surname>
          </string-name>
          and
          <string-name>
            <given-names>Marc</given-names>
            <surname>Shapiro</surname>
          </string-name>
          .
          <article-title>Optimistic replication</article-title>
          .
          <source>ACM Computing Surveys</source>
          ,
          <volume>37</volume>
          (
          <issue>1</issue>
          ):
          <fpage>42</fpage>
          -
          <lpage>81</lpage>
          ,
          <year>March 2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Berners-Lee Tim</surname>
            and
            <given-names>Connolly</given-names>
          </string-name>
          <string-name>
            <surname>Dan</surname>
          </string-name>
          .
          <article-title>Delta: an ontology for the distribution of differences between RDF graphs</article-title>
          . http://www.w3.org/DesignIssues/Diff,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Giovanni</surname>
            <given-names>Tummarello</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Morbidoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bachmann-Gmur</surname>
          </string-name>
          , and Orri Erling.
          <article-title>RDFSync: efficient remote synchronization of RDF models</article-title>
          .
          <source>Proceedings of the ISWC/ASWC2007</source>
          , pages
          <fpage>537</fpage>
          -
          <lpage>551</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Giovanni</surname>
            <given-names>Tummarello</given-names>
          </string-name>
          , Christian Morbidoni, Joakim Petersson, Paolo Puliti, and
          <string-name>
            <given-names>F.</given-names>
            <surname>Piazza</surname>
          </string-name>
          .
          <article-title>RDFGrowth, a P2P annotation exchange algorithm for scalable Semantic Web applications</article-title>
          . The First International Workshop on Peer-to-
          <source>Peer Knowledge Management</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. St´ephane Weiss, Pascal Urso, and
          <string-name>
            <given-names>Pascal</given-names>
            <surname>Molli</surname>
          </string-name>
          .
          <article-title>Logoot : a scalable optimistic replication algorithm for collaborative editing on p2p networks</article-title>
          .
          <source>In International Conference on Distributed Computing Systems (ICDCS)</source>
          . IEEE,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Stephane</surname>
            <given-names>Weiss</given-names>
          </string-name>
          , Pascal Urso, and
          <string-name>
            <given-names>Pascal</given-names>
            <surname>Molli</surname>
          </string-name>
          .
          <article-title>Logoot-undo: Distributed collaborative editing system on p2p networks</article-title>
          .
          <source>IEEE Transactions on Parallel and Distributed Systems</source>
          ,
          <volume>21</volume>
          (
          <issue>8</issue>
          ),
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>