Concurrency Effects Over Variable-size Identifiers in
                Distributed Collaborative Editing

                  Brice Nédelec, Pascal Molli, Achour Mostefaoui, Emmanuel Desmontils
                                                      LINA, 2 rue de la Houssinière
                                                    BP92208, 44322 Nantes Cedex 03
                                                        first.last@univ-nantes.fr


ABSTRACT                                                                1.   INTRODUCTION
   Distributed collaborative editors such as Google Docs or                Distributed collaborative editors allow to distribute the
Etherpad allow to distribute the work across time, space                work across space, time and organizations. Some trending
and organizations. In this paper, we focus on distributed               editors such as Google Docs [5] use the Operational Trans-
collaborative editors based on the Conflict-free Replicated             form (OT) approach [14, 15]. However, an alternative based
Data Type approach (CRDT). CRDTs encompass a set of                     on Conflict-free Replicated Data Types (CRDTs) [12, 13]
well-known data types such as sets, graphs, sequences, etc.             exists. Compared to OT, CRDTs are more decentralized
CRDTs for sequences model a document as a set of ele-                   and scale better.
ments (character, line, paragraph, etc.) with unique iden-                 The CRDTs belong to the optimistic replication [10, 11]
tifiers, providing two commutative update operations: in-               approach. Therefore, replicas involved in the collaboration
sert and delete. The identifiers of elements can be either of           are guaranteed eventual convergence to an identical state.
fixed-size or variable-size. Recently, a strategy for assigning         To provide convergence in a replicated sequence, CRDTs use
variable-size identifiers called LSEQ has been proposed for             unique identifiers to link elements. When the sequence is a
CRDTs for sequences. LSEQ lowers the space complexity                   document, the elements can be characters, lines, paragraphs,
of variable-size identifiers CRDTs from linear to sub-linear.           etc.
While experiments show that it works locally, it fails to pro-             We distinguish two types of sequence CRDTs: (i) The
vide this bound with multiple users and latency. In this                tombstone CRDTs [1, 3, 7, 8, 9, 16, 18, 19] use fixed-size
paper, we propose h-LSEQ, an improvement of LSEQ that                   identifiers. However, the delete operations only mark and
preserves its space complexity among multiple collaborators,            hide elements to the users. Consequently, these deleted el-
regardless of the latency. Ultimately, this improvement al-             ements permanently affect performances. (ii) The variable-
lows to safely build distributed collaborative editors based            size identifiers CRDTs [8, 17] directly encode the order re-
on CRDTs. We validate our approach with simulations in-                 lation in the identifiers. However, their identifiers can grow
volving latency and multiple users.                                     linearly depending on the editing behaviour. Consequently,
                                                                        they remain unsafe in the distributed collaborative editing
Categories and Subject Descriptors                                      context.
                                                                           To overcome their respective limitations, these approaches
   I.7.1 [Document and Text Processing]: Document                       need an additional protocol [4, 9] related to garbage col-
and Text Editing—Document management; C.2.4 [Computer-                  lection mechanisms. Nevertheless, these protocols require
Communication Networks]: Distributed Systems—Dis-                       global knowledge over participants. In the general case, such
tributed applications; D.2.8 [Software Engineering]: Met-               knowledge remains prohibitively expensive in the context of
rics—Complexity measures                                                distributed networks subject to churn.
                                                                           Recently, LSEQ [6] allowed variable-size identifiers CRDTs
Keywords                                                                to get rid of this costly additional protocol by lowering their
                                                                        upper bound on space complexity from linear to sub-linear.
  Distributed Documents; Document Authoring Tools and
                                                                        Yet, LSEQ does not guarantee a safe allocation. Indeed, it
Systems; Distributed Collaborative Editing; Real-time Edit-
                                                                        uses multiple antagonist strategies which, without any coor-
ing; Conflict-free Replicated Data Types
                                                                        dination between collaborators, leads to an quadratic growth
                                                                        of identifiers.
                                                                           The contributions of this paper are: (i) experiments that
                                                                        highlight the negative effect of multiple users edition on the
                                                                        size of LSEQ identifiers (ii) experiments that highlight that
                                                                        an increasing latency does not have a negative impact on
                                                                        the size of LSEQ identifiers. (iii) an improvement of LSEQ
                                                                        called h-LSEQ that extends the single user sub-linear upper
This work is licensed under the Creative Commons Attribution-
                                                                        bound on space complexity to any number of users, regard-
ShareAlike 3.0 Unported License (CC BY-SA 3.0). To view a copy
of the license, visit http://creativecommons.org/licenses/by-sa/3.0/.   less of the latency, and without any additional cost.
                                                                           Ultimately, h-LSEQ allows distributed collaborative edi-
DChanges 2013, September 10th, 2013, Florence, Italy.                   tors to safely use variable-size identifiers CRDTs.
ceur-ws.org Volume 1008, http://ceur-ws.org/Vol-1008/paper5.pdf.
   The rest of the paper is organized as follow: Section 2 de-      previously mandatory garbage collecting protocol. Three
tails variable-size identifiers CRDTs for sequences, the allo-      aspects define LSEQ:
cation strategy LSEQ, and highlights the motivation of this            – base doubling: regarding the underlying tree model,
paper. Section 3 presents h-LSEQ a combination of LSEQ                   each node can have twice more children than its par-
and a hash-based choice strategy to overcome the LSEQ                    ent. It follows the properties of exponential trees [2].
limitations. Section 4 shows the results of experiments over           – two antagonist allocation strategies: boundary+ and
LSEQ and h-LSEQ considering latency and multiple users.                  boundary– that are designed for end-edition and front-
Finally, Section 5 reviews the related works.                            edition respectively.
                                                                       – random strategy choice: it randomly assigns an allo-
                                                                         cation strategy to each depth of the tree. It follows
2.   BACKGROUND                                                          the intuition: since we have no prior knowledge of the
   Conflict-free Replicated Data Types belong to the opti-               editing behaviour, the strategy choice should not favor
mistic replication approach [10, 11]. In the context of dis-             any editing behaviour. Consequently, the frequencies of
tributed collaborative editing, the replicated data is a doc-            appearance of each allocation strategy have to be equal.
ument, where: (1) each insert/delete operation is prepared             Despite the fact that numerous experiments have been
locally and broadcast, (2) each remote replica receives and         performed using LSEQ including real documents extracted
integrates the changes, (3) all involved replicas eventually        from Wikipedia, they did not include any concurrency. Al-
converge to an identical state.                                     though, Wikipedia’s revisions already present a serialization
   To provide the convergence property, CRDTs for sequences         of the document without explicit concurrency, experiments
model a document as a set of couples helt, idi where id ∈           with this type of documents are of great importance [1].
(I, <id ), <id being a strict and dense total order over I,
and elt any element, e.g., a character, a line. Two commu-                                        #insert operations
tative operations allow to update the document:                                           10     100    200    500   1000
   – insert(p ∈ I, elt, q ∈ I) that generates an identifier idelt    1 user (bit/id)      6.5    26.8  32.7    56.0   64.2
     with p < idelt < q and adds helt, idelt i to the document.      10 users (bit/id)    9.5   125.8 377.0 1962.1 5468.0
   – delete(idelt ) that removes helt, idelt i from the document.
   The variable-size identifiers CRDTs [8, 17] define identi-       Table 1: Average bit-length of LSEQ identifiers for single
fiers as a series of numbers that can designate paths in a          and multiple user(s) and the generation of documents of 10,
tree. An allocation strategy is in charge of choosing these         100, 200, 500, and 1000 lines.
sequences of numbers. It aims to keep these identifiers as
small as possible for the sake of performance.
                                                                       Table 1 shows the average bit-length of the identifiers as-
                                                                    signed by LSEQ on different size synthetic documents cre-
                                                                    ated either by a single user or by a group of 10 collaborators.
                  0                          99
                        20   21   50                                Both documents were edited at the end. We observe that
                                                                    while the identifiers resulting from a single author are sub-
                                                                    linearly upper-bounded, the bit-length of identifiers gener-
     Begin                                        (1)   End         ated by 10 users are quadratically increasing.

                                                                      Definition 1 (Problem statement). Let D be a doc-
             10   11
                                                                    ument on which n insert operations have been performed. Let
                                                                    I(D) = {id|( , id) ∈ D}. The function alloc(idp ,idq ) should
                                                                    provide identifiers such as:
                  (2)
                                                                                          P log2 (id)
                                                                                                 n
                                                                                                      < O(n)
                                                                                         id∈I
Figure 1: Underlying 100-ary tree model of variable-size
identifiers CRDTs. It contains 5 identifiers. 3 identifiers at         Definition 1 from [6] states the alloc function property: a
depth-1: [20], [21], [50] and 2 identifiers at depth-2: [20.10],    sub-linear upper-bound on the average bit-length of identi-
[20.11]. For the sake of simplicity the elements linked to          fiers. By excluding any reference to the two phases of opti-
identifiers are not displayed.                                      mistic replication operations, it encompasses both the local
                                                                    generation of identifiers and the integration of remote oper-
   Figure 1 illustrates the underlying model of variable-size       ations. According to Table 1, LSEQ only partially answers
CRDTs. If an insertion operation is performed at (1), the           the problem statement.
allocation strategy chooses an identifier [X] with 50 < X <            The identified problem concerns the preservation of the
99. When the insert operation is performed at (2), there is         space complexity of LSEQ from single user to multiple users
no room for another identifier, therefore, the new identifier       edition. Using multiple antagonist strategies without any co-
will be a sequence of three numbers: [20.10.X], where 0 <           ordination between collaborators leads to a quadratic growth
X < 100.                                                            of identifiers. Consequently, providing a mechanism of agree-
   These identifiers can grow linearly depending on the po-         ment in the choice of strategies would greatly improve LSEQ
sition of insert operations.                                        as well as any composition of allocation strategies. Further-
   Recently, LSEQ [6] lowered the space complexity of these         more, such improvement would make variable-size identifiers
CRDTs from linear to sub-linear without favouring any edit-         CRDTs actually usable in the distributed collaborative edit-
ing behaviour. This improvement aims to get rid of the              ing context.
3.     H-LSEQ                                                         Algorithm 1 h-LSEQ allocation function
   The h-LSEQ allocation strategy is mainly based on LSEQ,            1: let boundary := 10      Any constant
they differ in the strategy choice component. Using LSEQ,             2: let seed := 123456789       Init with document
each replica involved in the collaboration makes independent          3: let S := {h0, boundary+i, h1, boundary−i};
random choices. However, Section 2 showed that such strat-             4:       map<id, allocation strategy>
egy fails to provide sub-linearly upper-bounded identifiers in
                                                                       5:
collaboration involving multiple users.                                6: function alloc(p, q ∈ I)
   The solution is to reach a global agreement over replicas           7:    let depth := 0; interval := 0;
on which is the allocation strategy used at each depth of the          8:    while (interval < 1) do          Not enough for 1 insert
underlying tree model. Since LSEQ aims to get rid of proto-            9:        depth + +;
cols related to garbage collecting mechanisms, any solution           10:         interval := pref ix(q, depth) − pref ix(p, depth) − 1;
that requires additional communication is inconceivable.              11:     end while
                                                                      12:     let step := min(boundary, interval);          Process the
   We propose to use a hash strategy choice component which              maximum step to stay between p and q
have the same local behaviour than the original random                13:     let idStrat := h(depth);
strategy choice, but it also provides the required implicit
                                                                      14:          Call the hash function
global agreement. Indeed, instead of synchronizing the set
                                                                      15:     let id := S.get(idStrat).invoke(p, q, depth, step);
of users, we provide an a priori agreement by the mean of
                                                                      16:          Call the allocation strategy according to hash
a hash function. This agreement avoids the possibility of             17:    return id;
antagonist choices which would have led to a bad global al-           18: end function
location of identifiers. Furthermore, it does not introduce           19:
any additional cost to LSEQ.                                          20: function h ( depth ∈ N∗ )
   Algorithm 1 details the h-LSEQ allocation strategy. The            21:     return Random(seed ∗ depth).nextInt(0, 1);
changes over LSEQ are simply highlighted. This section ex-            22: end function
plains this algorithm. Extracted from [6], the prefix function        23:
return a copy of the identifier id, i.e., a series of numbers         24: function boundary+(p, q ∈ I, depth, step ∈ N∗ )
truncated at depth or increased until depth. The function             25:    let addV al := RandInt(0, step) + 1;
carefully encodes each number of this copy in a base depend-          26:    return pref ix(p, depth) + addV al;
ing on its depth in order to directly apply the arithmetic            27: end function
                                                                      28:
operations of line 26 and line 31. For instance, assuming             29: function boundary–(p, q ∈ I, depth, step ∈ N∗ )
that the departure base is 25 , a call to pref ix([13.42.37], 2)      30:    let subV al := RandInt(0, step) + 1;
encodes [13.42] using 5 + 6 bits.                                     31:    return pref ix(q, depth) − subV al;
   The functions boundary+ and boundary− are two allo-                32: end function
cation strategies, note that they do not strictly respect the         33:
alloc function signature in order to factorize the computa-           34: function prefix(id ∈ I, depth ∈ N∗ )
                                                                      35:    let idCopy := [ ];
tion of depth and step.                                               36:    for (cpt := 1 to depth) do
                                                                      37:        if (cpt < id.size) then     Copy the value
3.1      Locally                                                      38:            idCopy := idCopy.append(id.at(cpt));
   The strategy choice function is surjective and its signature       39:        else      Add 0 encoded in the right base
                                                                      40:            idCopy := idCopy.append(0base(cpt) );
is: h(depth ∈ N∗ ) : N. The returned value corresponds to an
                                                                      41:        end if
allocation strategy unique identifier. The algorithm of the           42:    end for
allocation strategy h-LSEQ follows 3 steps:                           43:    return idCopy;
     1. lines 8-11: computes the depth of the future identifiers      44: end function
        to allocate
     2. line 13: calls the strategy choice function h using the
        depth                                                         3.2    Remotely
                                                                        Each collaborator generates the same hash function thanks
     3. line 15: calls the allocation strategy using the identifier
                                                                      to the shared seed. Also, all collaborators use the same map-
        returned by h
                                                                      ping id → allocation strategy (cf. line 3). Consequently, all
   The strategy choice function returns strategy identifiers          hash functions map the same depth to the same allocation
following a uniform law. Thus, frequencies of appearance              strategies. The idea is to reach a consensus on which strate-
of each strategy are equal and does not favor any editing             gies to employ in order to avoid the waste of identifiers’ space
behaviour. Furthermore, the unpredictability due to ran-              due to the choice of antagonist strategies.
domness preserves the model from intentional and malicious              Figure 2 highlights the differences between the original
attacks.                                                              random strategy choice and our hash strategy choice. Both
   The hash function fulfills the requirement of the strategy         cases present two collaborators inserting 2 elements one af-
choice function h. However, contrarily to the random strat-           ter the other. The expected result is the sequence “abcd”.
egy choice component, it requires an initialization. Indeed,          Hence, Collaborator 1 generates the “a” and “c” and Collab-
the creator of the document must share a hidden seed (cf.             orator 2 generates the “b” and “d”. In both cases they draw
line 2) within the document that each collaborator will use           the same numbers. However, in the case of random strategy
to generate the hash function specific to this document. In           choice, one collaborator randomly chooses boundary+ twice
Algorithm 1 at line 21, we simply use the common random               while the other randomly chooses boundary– twice. When
function initialized with the seed and the depth.                     the editing behaviour is monotonic, the size of identifiers
                                    LSEQ h-LSEQ                                                    100
                                                                                                               revision
                                                                                                    90
                                                                                                    80
              9            30                           9   10   17   18                            70
                                                                                                    60


                                                                                   n rev
     0                                   31     0                            31
                                b                                                                   50
Begin     a       20 ids             End      Begin a       b    c         d End                    40
                                                                                                    30
                            7       63
                                                                                                    20
                                                                                                    10
                            c            d                                                           0
                                55 ids                                                                   0                20   40            60   80   100
                                                                  Collaborator 1
                                                                  Collaborator 2                   250
                                                                                                             LSEQ+rand
                                                                                                             LSEQ+hash
                                                                                                   200
Figure 2: Representation of the underlying exponential tree


                                                                                   id bit-length
model of LSEQ and h-LSEQ. Two collaborators insert two                                             150

elements at the end of the document in order to get the                                            100

sequence “abcd”. On the random side, the collaborators em-                                         50
ploy antagonist strategies. On the hash side, the collabora-
                                                                                                    0
tors use the same allocation strategy. Both sides draw the                                               0                20   40            60   80   100
                                                                                                                                    n line
same random numbers.
                                                                                   Figure 3: Experiments on a synthetic document of 100 lines
                                                                                   edited at the end by 10 collaborators with 10 operations
quickly grows. On the opposite, the collaborators using h-                         per user. The top figure shows the revision number of the
LSEQ implicitly agree on using boundary+ at depth-1, con-                          operation. The bottom figure shows the bit-length of each
sequently the allocation follows the properties of the one                         identifier allocated to each line. On average, the bit-length
user edition presented in [6].                                                     of LSEQ and h-LSEQ identifiers are 94.7 and 39.1 bit/id
  The next section aims to corroborate our assumptions by                          respectively.
performing experiments on different setups with varying the
number of collaborators and the latency of the network.
                                                                                   (hash). Both of these setups use boundary+ and boundary–
                                                                                   and have the same variable values boundary = 10 and base =
4.       EXPERIMENTS                                                               24+depth . A set of 10 collaborators produces a document of
   This section is composed of two parts. First, the experi-                       100 lines by performing 10 insert operations each.
mentation focuses on the influence of the number of collab-                           Results: Figure 3 shows on the top part the spectrum of
orators on the size of LSEQ and h-LSEQ identifiers. A set                          the synthetic document, i.e., the revision date of each line.
of synthetic collaborators generates a document by succes-                         Since the bars are getting taller when they are closer of the
sively performing insert operations at the end. This experi-                       end of the document, it indicates that the users edited at
ment aims to show the behaviour of the two strategy choices,                       the end monotonically. On the second part of the Figure 3,
and compare the results with the expected sub-linear space                         it shows the identifier bit-length associated to each line. We
complexity.                                                                        observe that the identifiers of rand setup quickly increase
   The second part of experiments consists in highlighting                         while the hash identifiers remain similar to the ones in the
the effect of latency on a document edited by multiple users.                      experiment made for one user in [6]. Consequently, the hash
Once again, it aims to compare LSEQ and h-LSEQ and                                 setup is the most suitable setup in the context of distributed
to show the impact of concurrency on the average size of                           collaborative editing.
identifiers in the document.                                                          Reasons: both setups use boundary+ and boundary– al-
   The experiments focus on the digit bit-length of generated                      location strategies. However, each collaborator in the rand
identifiers. Indeed, all variable-size CRDTs rely on source                        setup makes independent choices of allocation strategies when
and clock to guarantee the unicity of identifiers, neverthe-                       required. Thus, if a collaborator chooses a particular strat-
less, the space complexity mainly depends on the digit choice                      egy, and another user chooses the antagonist strategy, then
made by the allocation strategy.                                                   a large number of identifiers is wasted when their operations
   To perform these experiments, we implemented a simu-                            are delivered to each other. On the other hand, when the
lation framework called HumbleSimulator. The sources are                           same hash function spread over all collaborators generates
available on the Github platform under the terms of the                            the same strategy choices, it keeps the random behaviour lo-
GPL licence 1 .                                                                    cally and also makes an implicit agreement on which strategy
                                                                                   to employ at any given depth.
4.1      Multiple users experiment
   Objective: show that the random strategy choice with-                           4.2                       Latency experiment
out taking into account the strategies employed by other                              Objective: show that increasing the latency, i.e., the
collaborators leads to a quick growth in the size of identi-                       time between the generation and the delivery of an oper-
fiers. On the opposite, when all collaborators uses a common                       ation, does not imply a growth in the size of identifiers. On
allocation strategy at a given depth, the space complexity                         the contrary, when the latency increases, the average size of
remains sub-linearly upper-bounded.                                                allocated identifiers decreases. This behaviour is expected
   Description: we evaluate LSEQ and h-LSEQ strategies                             for LSEQ and h-LSEQ strategies.
on a group of 10 collaborators. Thus, the setups are (i) a ran-                       Description: we simulate the production of a 100 lines
dom strategy choice (rand) and (ii) a hash strategy choice                         document by a set of 10 collaborators. Each one of these col-
     1. https://github.com/Chat-Wane/HumbleSimulator                               laborators performs 10 insert operations at the end of the
                160
                                                                        LSEQ+rand
                                                                                          allocation strategy permits to reach this agreement. Conse-
                                                                        LSEQ+hash
                                                                                          quently, the improvement of LSEQ called h-LSEQ can be
                140
                                                                                          safely used in distributed collaborative editing. The latency
                                                                                          experiment shows that increasing the latency parameter only
                120
                                                                                          results in sharing the identifier space, and consequently low-
                100
                                                                                          ers the size of identifiers. As a corollary, considering an im-
                                                                                          mediate delivery of operations constitutes an upper-bound
id bit-length


                80                                                                        on the behaviour of the overall system for any latency.

                60
                                                                                          5.   RELATED WORK
                40                                                                           Current distributed collaborative editors use optimistic
                                                                                          replication [10, 11] to provide availability of data at a con-
                20                                                                        sistency cost. Optimistic replication designs operations in
                                                                                          two phases. First a replica prepares the result of an opera-
                 0
                      0       100     200   300             400   500       600     700
                                                                                          tion and then broadcasts it to other collaborators. Second,
                                                  latency                                 remote replicas integrate the newly received data.
                                                                                             Operational Transform approach (OT) and Conflict-free
Figure 4: Experiments of latency effects over the average                                 Replicated Data Type approach (CRDT) belong to the op-
bit-length of identifiers. The x-axis shows the latency varia-                            timistic class of replication. While CRDTs share the com-
tion in number of rounds, i.e., the average time between the                              putational cost of operations between the local and remote
send and the receive of an operation. The y-axis shows the                                parts of the replication protocol, the OT approaches prefer
average bit-length of identifiers. A group of 10 collaborators                            to offer free local generation and pay an additional cost on
generates a synthetic document of 100 lines by performing                                 the integration of remote operations. Since the number of
10 operations each. Two configurations of LSEQ: the orig-                                 remote operations quadratically grows when the number of
inal random strategy choice and the hash strategy choice                                  collaborators increases, the CRDTs scale better than OT
(h-LSEQ).                                                                                 approaches.
                                                                                             We distinguish two classes of CRDTs for sequences. The
                                                                                          tombstone based CRDT approaches use death certificate on
                                                                                          removed elements. Henceforth, these elements are hidden
document. The users generate operations following the Pois-                               to the users, but still remain in the underlying model and
son distribution of parameter λ = 100, i.e., users more likely                            undermine the performance. For instance, the number of
generate an operation after 100 rounds (arbitrary unit) from                              tombstones in popular or controversial pages in Wikipedia
their previous operation. Measurements are done at a la-                                  pages would be very high due to vandalism and recoveries.
tency of 5, 10, 20, 40, 80, 160, 320, 640 rounds. They concern                            The representatives of this class of CRDTs are WOOT [7],
the average bit-length of identifiers in the document. We                                 WOOTO [16], WOOTH [1], CT [3], Treedoc [8], PPS [18],
evaluate two setups: the random strategy choice (rand) and                                RGA [9] and [19].
the hash strategy choice (hash). Both setups use the alloca-                                 To get rid of these marked elements, the tombstone based
tion strategies boundary+ and boundary− with boundary =                                   CRDTs require additional protocols related to garbage col-
10 and departure base = 24+depth .                                                        lecting mechanisms. In [4, 9] they propose approaches to
   Results: Figure 4 shows the results of the latency ex-                                 safely purge the tombstones. However, it requires a global
periments. As expected, the two setups allocate identifiers                               knowledge over collaborators. In [9], they describe an ap-
with an average bit-length that quickly decreases when the                                proach that obtains the global knowledge by maintaining a
latency increases. Therefore, the latency does not badly                                  vector clock with one entry per participant although it is
affect LSEQ and h-LSEQ allocation strategies. Also, it con-                               too costly in large distributed network where collaborators
firms observations made in the previous experiment: hash                                  can enter and leave the system. In [4], they describe the
performs better than rand, especially under low latency.                                  core nebula approach that permits to reach the consensus
   Reasons: increasing the latency delays the arrival of op-                              over participants. However, this approach constrains the
erations to the remote collaborators. Thus, each collabora-                               network topology to make the consensus reachable and uses
tor works locally until operations generated by other users                               a costly catch up protocol.
are delivered. Considering the extreme case with the maxi-                                   The alternative to tombstones is the variable-size identi-
mum latency, each collaborator generates his 10 insertions,                               fiers CRDT approach. These CRDTs directly encode the
and then merges the others’ 90 operations. Consequently,                                  total order of elements into their unique identifiers. Thus,
the 100 operations share the same space, i.e., the digit part                             they do not require tombstones. However, the identifiers
of identifiers can be the same for multiple elements, and the                             can grow. In Logoot [17] and Treedoc [8], identifiers have
total order is maintained by the source and clock.                                        a linear space complexity compared to the number of in-
                                                                                          sert operations. Furthermore, their size heavily depends on
4.3                       Synthesis                                                       the position of insertions. For example, performing inser-
   The experiments evaluated the effect of concurrency on                                 tions repetitively at the beginning of a document leads to
the average bit-length of identifiers by varying the number                               identifiers with a size equals to the number of insert opera-
of collaborators and the latency. They showed that an im-                                 tions. Consequently, such approaches require an additional
plicit and a priori agreement on allocation strategy choice                               re-balance protocol, like tombstone CRDTs. However, their
is required to maintain the sub-linear upper-bound on space                               underlying allocation strategies can be replaced by LSEQ [6]
complexity. Employing the same hash function to choose the                                that provides a sub-linear upper-bound on the space com-
plexity regardless of the editing behaviour.                         Symposium on Document Engineering, pages 103–112,
   The LSEQ approach uses two antagonist allocation strate-          Mountain View, California, États-Unis, Sept. 2011.
gies. Although paper [1] argues about the importance of con-
currency in CRDTs experiments, the original paper of LSEQ         [2] A. Andersson and M. Thorup. Dynamic ordered sets
does not consider this aspect. In this paper, we showed               with exponential search trees. J. ACM, 54(3), June
on synthetic documents that its original configuration does           2007.
not scale in the number of users. The added value of syn-         [3] V. Grishchenko. Deep hypertext with embedded
thetic documents over real collaborative documents is the             revision control implemented in regular expressions. In
total flexibility on their parameters, especially on the edit-        Proceedings of the 6th International Symposium on
ing behaviour. Indeed, the number of available real collab-           Wikis and Open Collaboration, WikiSym ’10, pages
orative documents (including concurrency) is very limited.            3:1–3:10, New York, NY, USA, 2010. ACM.
Moreover, such experiments are costly to set up, and often
biased by the nature of the task to perform.                      [4] M. Letia, N. Preguiça, and M. Shapiro. Crdts:
   In this paper, we propose to replace the random strategy           Consistency without concurrency control. Arxiv
choice component of LSEQ by a hash-based choice strategy              preprint arXiv:0907.0929, 2009.
to reach an implicit consensus on which strategy to employ.
This configuration called h-LSEQ, does not require addi-          [5] D. A. Nichols, P. Curtis, M. Dixon, and J. Lamping.
tional computation and extends the sub-linear complexity              High-latency, low-bandwidth windowing in the jupiter
bound of LSEQ from a single user to multiple collaborators            collaboration system. In Proceedings of the 8th annual
regardless of the latency. Thus, h-LSEQ constitutes a safe            ACM symposium on User interface and software
allocation strategy for sequences CRDTs, even in concurrent           technology, pages 111–120. ACM, 1995.
cases.                                                            [6] B. Nédelec, P. Molli, A. Mostefaoui, and
                                                                      E. Desmontils. LSEQ: an Adaptive Structure for
                                                                      Sequences in Distributed Collaborative Editing. In
6.   CONCLUSION                                                       ACM, editor, 13th ACM Symposium on Document
   In this paper, we presented an improvement of the alloca-          Engineering, Sept. 2013.
tion strategy LSEQ called h-LSEQ. Contrarily to the origi-
nal approach, our configuration supports multiple users re-       [7] G. Oster, P. Urso, P. Molli, and A. Imine. Data
gardless of the latency. Indeed, replacing the random strat-          consistency for p2p collaborative editing. In
egy choice by the hash-based choice strategy allows to reach          Proceedings of the 2006 20th anniversary conference
a global sub-linear upper-bound on space complexity. As a             on Computer supported cooperative work, pages
consequence, distributed collaborative editors can safely use         259–268. ACM, 2006.
h-LSEQ and show better scalability than current trending          [8] N. Preguiça, J. M. Marquès, M. Shapiro, and
editors such as Google Docs, Etherpad, etc.                           M. Letia. A commutative replicated data type for
   The hash-based choice strategy is a common function over           cooperative editing. In Distributed Computing
participants that maps a depth in the underlying tree to an           Systems, 2009. ICDCS’09. 29th IEEE International
allocation strategy identifier. This surjective function ini-         Conference on, pages 395–403. Ieee, 2009.
tialized with a shared seed allows to reach an agreement
on strategies with no additional synchronization cost. Con-       [9] H.-G. Roh, M. Jeon, J.-S. Kim, and J. Lee. Replicated
sequently, there is no loss in the identifiers’ space due to          abstract data types: Building blocks for collaborative
antagonist allocation strategies.                                     applications. Journal of Parallel and Distributed
   This paper highlights the importance of multiple users             Computing, 71(3):354–368, 2011.
analysis in CRDTs for sequences, particularly in the case of
multiple underlying allocation strategies. We also showed        [10] Y. Saito and M. Shapiro. Replication: Optimistic
that latency does not badly affect the size of identifiers in         Approaches. technical report, 2002.
variable-size identifiers CRDTs. However, we performed ex-       [11] Y. Saito and M. Shapiro. Optimistic replication. ACM
periments with synthetic documents in order to show the               Comput. Surv., 37(1):42–81, Mar. 2005.
general behaviour of the allocation without considering se-
mantically consistent documents.                                 [12] M. Shapiro, N. Preguiça, C. Baquero, and
   Future works include the formal proof of the sub-linear            M. Zawirski. A comprehensive study of Convergent
upper-bound on space complexity of h-LSEQ. A probability              and Commutative Replicated Data Types. Rapport de
analysis of the worst-case scenario is mandatory to show that         recherche RR-7506, INRIA, Jan. 2011.
it seldom happens. We plan to experiment h-LSEQ on a cor-
pus of real documents including multiple collaborators and       [13] M. Shapiro, N. Preguiça, C. Baquero, and
concurrency. Finally, we aim to build a real distributed col-         M. Zawirski. Conflict-free replicated data types.
laborative editor based on a variable-size identifiers CRDT           Stabilization, Safety, and Security of Distributed
using h-LSEQ as allocation strategy.                                  Systems, pages 386–400, 2011.
                                                                 [14] C. Sun and C. Ellis. Operational transformation in
                                                                      real-time group editors: issues, algorithms, and
References                                                            achievements. In Proceedings of the 1998 ACM
 [1] M. Ahmed-Nacer, C.-L. Ignat, G. Oster, H.-G. Roh,                conference on Computer supported cooperative work,
     and P. Urso. Evaluating CRDTs for Real-time                      CSCW ’98, pages 59–68, New York, NY, USA, 1998.
     Document Editing. In ACM, editor, 11th ACM                       ACM.
[15] C. Sun, X. Jia, Y. Zhang, Y. Yang, and D. Chen.
     Achieving convergence, causality preservation, and
     intention preservation in real-time cooperative editing
     systems. ACM Transactions on Computer-Human
     Interaction (TOCHI), 5(1):63–108, 1998.
[16] S. Weiss, P. Urso, and P. Molli. Wooki: A p2p
     wiki-based collaborative writing tool. In
     B. Benatallah, F. Casati, D. Georgakopoulos,
     C. Bartolini, W. Sadiq, and C. Godart, editors, Web
     Information Systems Engineering – WISE 2007,
     volume 4831 of Lecture Notes in Computer Science,
     pages 503–512. Springer Berlin Heidelberg, 2007.
[17] S. Weiss, P. Urso, and P. Molli. Logoot: a scalable
     optimistic replication algorithm for collaborative
     editing on p2p networks. In Distributed Computing
     Systems, 2009. ICDCS’09. 29th IEEE International
     Conference on, pages 404–412. IEEE, 2009.
[18] Q. Wu, C. Pu, and J. Ferreira. A partial persistent
     data structure to support consistency in real-time
     collaborative editing. In Data Engineering (ICDE),
     2010 IEEE 26th International Conference on, pages
     776–779, 2010.
[19] W. Yu. A string-wise crdt for group editing. In
     Proceedings of the 17th ACM international conference
     on Supporting group work, GROUP ’12, pages
     141–144, New York, NY, USA, 2012. ACM.