=Paper=
{{Paper
|id=Vol-454/paper-2
|storemode=property
|title=Time-limited data storing in peer to peer networks
|pdfUrl=https://ceur-ws.org/Vol-454/paper2.pdf
|volume=Vol-454
}}
==Time-limited data storing in peer to peer networks==
<pdf width="1500px">https://ceur-ws.org/Vol-454/paper2.pdf</pdf>
<pre>
        Time-limited data storing in peer to peer networks
                                             Scherbakov Konstantin
                                                        SPbSU
                                                   Saint Petersburg,
                                                        Russia
                                    konstantin.scherbakov@gmail.com

                                                              coefficient of node’s usefulness for the whole network.
ABSTRACT                                                      So nodes, that are more useful for the network, should
                                                              have opportunity to use these mechanisms more often
Different techniques of using peer to peer networks are
                                                              and more completely.
mentioned. The process of data distribution in peer to
peer network by the single node’s initiative for purpose
of further downloading by this node’s owner is                UDD P2P network model
described. A common model of special peer to peer
                                                              Let’s assume that one person or group of persons have
network (UDD network), developed for single node
                                                              some u-data, that is very useful for this person or group
initiated data distribution, with its main characteristics
                                                              of persons, but not interesting for other participants of
and used algorithms is introduced. Some ways of
                                                              our future u-data distribution (UDD) p2p network. This
improving existing peer to peer networks for the
                                                              person wants to have access to this data from any place,
purpose of compliance to UDD network specification
                                                              where he can connect to internet (and to our UDD p2p
are mentioned. Experimental results of single node’s
                                                              network too). He also wants to have ability to extend his
initiated data distribution in one of the existing peer to
                                                              data storing time and he expects that there is some
peer networks (with our improvements applied) are
                                                              mechanism used in this network that prevents with high
analyzed.
                                                              probability his data from being deleted from all network
                                                              nodes during its limited storing time.
Introduction and related work
                                                              What properties we expect for our UDD p2p network?
Pure and hybrid peer to peer (p2p) networks are widely        Let’s assume that our network will have the properties
used now in our life. There are at least two main ways        listed below:
to use them:
                                                                  1.   Network nodes should have their own limited
    -    for grid/cloud computing/storage (Amazon S3,                  long-term memory
         Nirvanix SDN, Gridnut, Wuala etc.)
                                                                  2.   Network nodes should have access to some
         [1,5,9,13,19,25]
                                                                       computing resources (for example their own
    -    for data exchange between interconnected                      CPU) to calculate data content and keyword
         nodes (bittorrent, exeem, ed2k, kademlia,                     hashes
         gnutella 1,2 etc) [2,4,6,7,21,22]
                                                                  3.   Network should have some mechanisms for u-
If p2p network is designed for data exchange, new                      data storing and effective distributing and
nodes usually connect to it for the purpose of retrieving              deleting
some files, they are interested in, or share to other nodes
                                                                  4.   Network should provide high probability data
some files, they want to. But there is at least one more
                                                                       disappearance preventing mechanism
alternative way of using such type of p2p networks. One
node can connect to network for storing its own                   5.   Network should have mechanisms of effective
data/files over other nodes for the purpose of retrieving              data search request routing and metadata
this data by node’s owner from any other place. This                   storing [3,8,10-12,14,15,17,20,23,24,26,27]
data may be uninteresting for other nodes of network
and may be encrypted. So let’s call such type of data the         6.   Network nodes may connect and disconnect
u-data. This way is rarely used because of lack of data                from network constantly and therefore network
access control, data availability control and u-data fast              provide some mechanisms of retrieval node
distribution mechanisms in existing p2p networks.                      connect/disconnect rate

Our main goal is to introduce the architecture of             This is ideal network for our purposes. Real p2p
prototype of such p2p network and search request              networks (especially file sharing) often have 3-4 of
routing/data storing/indexing protocols for it, that          these properties, but we won’t examine them at this
allows any node to store securely its u-data over the         point. It’s only important to note, that some of this
network for some fixed time interval and also grants this     networks may be rather easily extended in special way
node’s owner ability to retrieve his data from any other      to have all properties of our ideal p2p u-data storing
location during fixed period of time mentioned below          network [16]. It’s important to introduce and describe
and control availability of this data over the network        the method of u-data distribution in UDD network
during its limited storing period. Using of these             initiated by some specific node.
mechanisms by particular node should depend of some           In usual p2p network data distribution process usually
have some general stages:                                      foreach ($neighbors_arr as
                                                               $neighbor_num=>$neighbor_value)
    -    Data hashing (data content hashing, data
         description keyword extraction and hashing            {
         etc.)                                                     if (reservespacerequest(
    -    Data announcing (more precisely this stage                &$neighbors_arr[$neighbor_num],&$m_ok,
         should be called “data hashes announcing”)                $space, $datahash)==1)
    -    Waiting for external requests for announced               $neighbors_arr[$neighbor_num]
         data from other network nodes
                                                                   ['reserved']=1;
    -    Data uploading to external network nodes
                                                                   $neighbors_arr[$neighbor_num]
         interesting in data announced
                                                                   ['checked']=1;
It’s a good way of distribution process for data that may
be interesting and useful for other peers in usual file        }
sharing p2p network, where waiting for external                $cur_neighbor=
requests stage hasn’t infinite durability for the data         get_firstkey(&$neighbors_arr);
being distributed. But we assumed that our data is
                                                               while ($m_ok<$n)
uninteresting for other nodes. Furthermore, our data
may be encrypted by its owner to prevent effective             {
using of it by other network node’s owners. Therefore              addneighbors_callback(&$neighbors_arr,
3rd stage of usual data distribution process will have
infinite durability for u-data in usual file sharing               &$m_ok,$cur_neighbor,
network. So how we can effectively distribute data in              'reservespacerequest');
our ideal network? Nodes have some limited long-term
                                                                   if(($cur_neighbor=get_nextkey(
memory. If we try to send our data in all accessible
nodes, some of them may reject it, because of lack of              &$neighbors_arr,$cur_neighbor))===
free disc space. Our first goal is to share u-data between         false) break;
N nodes (including the initial one), where N is defined
by data and initial node’s (n_init) owner (N should be         }
limited according to the node n_init usefulness                $invited=send_download_invitation(
coefficient, mentioned above). Lets assume that n_init         &$neighbors_arr);
have M neighbors. If M>N we will ask them about their
                                                               /*some code deleted from listing*/
free space and ask them to reserve sizeof(u-data) bytes
for our u-data storing with M queries and get M_ok             $resultnodes=send_data_info(
results with satisfying answer. If M_ok < N, we will ask       &$neighbors_arr);
this nodes for their neighbor lists and then repeat our
first step. If M<N, we can ask them for neighbors first        As was mentioned above our ideal network can be
and then ask extended node list for free space                 constructed by extending some real p2p file sharing
reservation. But we can save some network bandwidth            network (for example bittorrent of ed2k/kademlia
by uniting these two types of requests into a single one.      network). But if we realize a possibility of distributing
So n_init can ask its neighbors to reserve some free           u-data in some client application for these networks
space for its u-data and in the same request it can ask        without any limitation, very few nodes will use it,
them to forward this request to their neighbors. After         because while being unlimited, this process can make a
receiving at least M satisfying results, n_init sends to       lot of parasitic traffic in the network and slow down
each of them full list of nodes from this list and offers to   downloading and uploading of usual files. So if we want
receive his data. M nodes start to receive this data, using    to extend these networks with u-data distribution ability,
swarm cooperation. After that they receives a unique           we need to describe some methods that can guarantee
id’s, which are actually a structure with some fields:         limited use of this function, for example by useful for
data hash, data lifetime/finish storing date, replication      whole network nodes only. Usually coefficient of
coordinator priority/id (this value will help to determine     usefulness is represented by formula
current replication coordinator for stored u-data), M,
owner node’s data access key (a special access key for         coef_usf=node_upl/node_dwn,
this portion of data only), owner node’s id key, search        where node_down>0, and infinity otherwise.
keywords, etc. This process is briefly described in the
code listing below                                             When any nodes download some data they need, their
                                                               node_dwn values increases, and when they upload some
Listing 1:                                                     data to other nodes, their node_upl values increases.
                                                               These values can never decrease. For our extended
$neighbors_arr = get_neighbors();                              network, we can change this formula to
//neighbor list
$m_ok=0;                                                       fair_coef_usf = (u_usual+ deltads*d_special) /
(d_usual + deltaus*u_special),                                  of u-data distributed in our network? Node n_init should
                                                                send special request with data hash and new storing
where u_usual – outgoing traffic value for usual data,
                                                                time to all known nodes with our hash table structures.
d_usual – incoming traffic value for usual data,
                                                                Then these special nodes should find a record about this
u_special and d_special – incoming and outcoming
                                                                data by its content hash. And finally they should update
traffic value for special data (distributed over our u-data
                                                                storing time for this record and send update request to
distribution mechanism), deltads, deltaus – special
                                                                known nodes with this data stored. This is non-
weight coefficients, introduced for correcting value of
                                                                guarantied way to update data storing time, but our goal
fair_coef_usf after using by some node possibility to
                                                                is to update data on some nodes. All other work should
distribute u-data .
                                                                do current data availability coordinator.
Now let’s return to the end of u-data distribution


process. Let’s assume that our network have some
special nodes, that allow storing data indices (it’s true, if
we’ll extent bittorrent or ed2k network, that are now).         Listing 2:

All M nodes with our data stored on it sends data hash,         $reqtype=get_rec_type($request);
finish storing date, search keywords to their neighbor
                                                                if ($rectype==’updatetime’ &&
nodes, which are marked as indexing nodes, which
stores these values in a special data structure, you can        ($newtime>time()+$delay))
see on the scheme 1 below.                                      {
As we can see on this scheme, our special node has 2                $dataobj=0;
hash tables. First hash table contains hashes of search
                                                                    search_by_datahash($hashstructure2,
keywords, stored in dynamic array as array keys. Each
value in this array contains link to a special list of              $hash,&$dataobj);
pointers to objects, where each object contains content             update_storingtime_delayed(&$dataonj,
hash for data, relevant to this keyword, data note and
filename, node id’s or addresses, where this data is                $newtime,$delay);
stored. We can store ids for nodes with dynamic ip                  mark_for_update(&$dataonj);
addresses and raw ip addresses for nodes with the static
ones. But if we only have hash table for keywords, our              send_updatetimerec_delayed(&$dataonj,
special nodes will be useless for search by data content            $newtime);
hash request routing. So we’ll create a second hash table
                                                                }
for this purpose. It contains known data content hashes
links to special node, with links to the head of special        elseif ($rectype==’deleterec’)
object list, which also contains links to real objects with     {
data about data hash, keyword hash, node id’s and/or
ips, mentioned above. So, when search request arrives,              $dataobj=0;
we split it to keywords and search for their hashes in              search_by_datahash(&$hashstructure2,
first table. Then we get data filenames, keyword
                                                                    $hash,&$dataobj);
relevance in some way, data notes and send it back to
node, that initiated this request and let it decide, what to        send_deleterec_delayed(&$dataonj,
do with the result retrieved. When hash search request              $delay);
arrives, we send back to the request initiator ids / ip
addresses of nodes that can store this data.                        delete_delayed(&$dataonj,$delay);

How we can use this structure for updating storing-time         }
/* $delay represents time in seconds,                            coordinator suddenly disconnects, other nodes elects a
after which this object is actually                              new one by special data id, they have. They use last
deleted after it was disabled for search                         received nodes list in this process. Node with minimum
requests */
                                                                 data id (this may be UNIX time for example) wins the
In the listing 2 we can see code that allows indexing            election and became a new coordinator. When the old
nodes to update or delete objects with data hashes and           coordinator arrives back in network, he sends requests
nodes ids/ips lists. It’s significant that objects are not       to known nodes with u-data stored and receives new
deleted immediately, so we only mark them as deleted             coordinator address. Then he tells new coordinator to
and set some delay time, after that it will be actually          tell other nodes about new old coordinator send current
deleted during the regular indexing node’s maintenance           node list and became a regular node.
process. This process should take place regularly in the
periods of low CPU/network/etc load of indexing node.            Experimental results
While maintenance, indexing node should check all
objects for it’s data storing time and if necessary,             And now let’s make some experiments with our data
physically delete them, if undelete flag is not set. Else if     distribution mechanism in the real network. We have
this flag is set, node indexing should take away delete          extended existing bittorrent client bittornado, torrent
flag and update storing time, if it’s higher than current        tracker tbdev [2,21,22] and a special tool to emulate
time, and if there is no delete flag for this or higher          large number of different nodes (this tool is written on
storing time for this data hash, else node should also           PHP). We are using 2 servers: C2D E4400, 3GB ram,
delete object permanently.                                       Centos 5.0 and C2D E6400, 3GB ram, FC6. We know
                                                                 some statistics for the normal bittorrent network, based
So, how we can use this structure for searching u-data,          on this 2 servers: network average size = 4100 peers;
distributed over the network and for updating its storing        about 1700 peers connected to network with intervals
time / non-guaranteed deleting etc.?                             more 24 hrs. Average peer renewal speed: 0.08 peer per
Here you can see this process:                                   second. Average incoming/outcoming speed of all peers
                                                                 = 1.4 mbit/s
                                                                 Nobody other then us was given modified p2p network
Listing 3:                                                       client, so we will create virtual nodes on server 1, make
if (is_server_load_low())                                        server 2 indexing node and start to distribute data over
                                                                 our virtual network with most characteristics equal to
{                                                                the real one.
    foreach ($hashstructure2 as $hkey =>
                                                                 Firstly we’ll assume, that all peers have appropriate disc
    $hvalue)                                                     space for our data and we wont’s actually save it on it’s
    {                                                            discs. We’ll only save data hashes, storing time and
                                                                 indexes instead of it. 1700/4100 * 100% roughly = 41%.
        $actionlist=getactions(                                  So in all our experiments 41% of peers will be always
        &$hashstructure2[$hkey]);                                connected to network.
        if ($actionlist)                                         In first two sets of our experiments we’ll set and fix the
                                                                 coordinators updating interval to 1 hour and start to
            writelog(executeactions(
                                                                 distribute data to varying number of nodes. We’ll also
            &$hashstructure2[$hkey],                             vary average number of connected nodes (in experiment
            $actionlist));                                       set two). Our goal is to determine conditions, when last
                                                                 node with our data will leave network and it became
        while (!is_server_load_low())                            inaccessible and how many nodes we need to have 25%
        {                                                        of nodes at the end of one coordinator update period.
            sleep(5);                                            Then we’ll also vary coordinator updating interval and
        }
                                                                 will determine the same conditions, when all stored data
                                                                 will disappear. But firstly we’ll assume that our network
    }                                                            doesn’t have a constantly connected (core) nodes.
}                                                                Every experiment we’ll repeat 100 times to determine
                                                                 best and worst results for current conditions.

We have mentioned data coordinator below. Let’s                  In the first set of experiments (100 nodes connected to
briefly describe its functions. Regularly all nodes,             network in average, coordinator update time = 1 hour;
storing our u-data sends requests to current coordinator         every hour 59 random nodes leaves network and 59
to check its availability, tell it that they are available and   other connects)1 we can see, that when N (total nodes to
receive a list of other nodes with u-data stored.
Coordinator monitors these requests and has a list of            1
                                                                   Raw data for this set of experiments can be found at
currently active nodes with u-data presented. If                 http://195.70.211.9/syrcodis09_set1.txt
                                             Diagram 1. First set of experiments.

which our data is distributed initially) reaches value of        see slightly higher number of minimum required nodes
14, more than 25% of nodes will still have our data on it        for data saving on 25% of nodes at the end of
after 1 hour in the worst result acquired. Theoretically N       coordinator update period. It’s close to 25 nodes,
should reach the value of 60 to give as a guarantee of           because in this set of experiments 59 unique nodes
data saving after one coordinator update period at least         leaves the network, and in set 1 this value is distributed
on one node, but practical results are better, because           in the interval (1,59).
probability of the event “all nodes with data leave our
                                                                 Now let’s vary total number of connected nodes with
network after one coordinator update period” decreases
                                                                 other properties of network equal to set 2. At the
exponentially (and will be about O(10^(-59*2)) for
                                                                 Diagram 3, which represents the 3rd set of experiments3
N=59 ). So in real network we don’t need to distribute
                                                                 we can see results for 4000 connected nodes and for N
data on such huge amount of nodes to save our data
                                                                 from 1 to 2996 with step 85. While N reaches value of
with probability very close to 1.
                                                                 86, we have more than 50% of nodes with our data alive


                                               Diagram 2. Second set of experiments.


When taking a look to diagram 2 (second set of                   after coordinator update period
experiments, where we have 41 persistently connected
                                                                 While network connect/disconnect rate is fixed, it’s
nodes, and other properties are equal to set 1)2 we can

2                                                                3
  Raw data for this set of experiments can be found at             Raw data for this set of experiments can be found at
http://195.70.211.9/syrcodis09_set2.txt                          http://195.70.211.9/syrcodis09_set3.txt
                                      Diagram 3. Third set of experiments.


better to have more non-persistently connected nodes         [2] BitTorrent full specification (version          1.0).
for minimizing the number of nodes, to which our data        http://wiki.theory.org/BitTorrentSpecification
should be distributed for having the probability of
                                                             [3] A. Crespo and H. Garcia-Molina. Routing Indices
saving very close to 1 after one coordinator update
                                                             for Peer-to-Peer Systems. In ICDCS, July 2002.
period. This probability will decrease with time very
slowly, so we’ll have its high enough at the end of our      [4] Exeem project specification: http://www.exeem.it.
data storing interval, because it’s not equal to infinity.
More experiments are required to determine maximum           [5] I. Foster. Peer to Peer & Grid Computing. Talk at
satisfying storing time interval for high probability of     Internet2 Peer to Peer Workshop, January 30, 2002
data saving. We can also say that varying coordinator        [6]       Gnutella       project           specification:
update interval can help us to increase the probability of   http://www.gnutella.com.
data saving. It’s important to rightly determine update
interval before distributing any data in network and vary    [7]      Kademlia:       A      Design     Specification.
this interval while data life cycle to minimize the          http://xlattice.sourceforge.net/components/protocol/kade
number of nodes to which data is distributed and save        mlia/specs.html
the network bandwidth and node’s computing resources.        [8] V. Kalogeraki, D. Gunopulos, D. Zeinalipour-Yazti.
We can do this by sending special requests to indexing       A Local Search Mechanism for Peer-to-Peer Networks.
nodes for example (we have one of them in our                In CIKM, 2002.
experiments – it’s an extended bittorrent tracker). These
actions will make a little additional non data-transfer      [9] J. Kubiatowicz, D. Bindel, Y. Chen. Ocean-store:
traffic over the network, but will save us significantly     An architecture for global-scale persistent storage. In
more traffic between nodes.                                  ASPLOS, 2000.
                                                             [10] C. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker.
Conclusions                                                  Search and Replication in Unstructured Peer-to-Peer
                                                             Networks. In ICS, 2002.
So, we have described a model of ideal p2p network for
data distribution initiated by a single node. We also        [11] D. Menasce and L. Kanchanapalli. Probabilistic
introduced some methods, that should be used in such         Scalable  P2P    Resource    Location    Services.
type of network and make three series of experiments         SIGMETRICS Perf. Eval. Review, 2002.
with good results. Our next work will be concerned to        [12] I.Nekrestyanov. Distributed search in topic-
examining more deeply routing mechanisms in that type        oriented document collections. In SCI'99, volume 4,
of networks and introducing methods that will allow          pages 377-383, Orlando, Florida, USA, August 1999.
making data distribution and retrieval more secure.
                                                             [13]     Nirvanix    SDN     official    documentation:
                                                             http://nirvanix.com/sdn.aspx
REFERENCES
                                                             [14] S. Ratnasamy, P. Francis, M. Handley. A scalable
[1]      Amazon     S3     official  documentation:          content-addressable network. In ACM SIGCOMM,
http://docs.amazonwebservices.com/AmazonS3/2006-             August 2001.
03-01/
[15] A. Rowstron and P. Druschel. Pastry: Scalable,         [22]        TorrentPier    official    documentation:
distributed object location and routing for large-scale     http://torrentpier.info
peer-to-peer systems. In Middleware, 2001.
                                                            [23] D. Tsoumakos and N. Roussopoulos. Adaptive
[16] K. Scherbakov. Search request routing in Bittorrent    Probabilistic Search for Peer-to-Peer Networks. In 3rd
and other P2P based file sharing networks, SYRCoDIS         IEEE Int-l Conference on P2P Computing, 2003.
2008, Saint-Petersburg, Russia
                                                            [24] D. Tsoumakos, N. Roussopoulos. Analysis and
[17] I. Stoica, R. Morris, D. Karger. Chord: A scalable     comparison of P2P search methods. Proceedings of the
peer-to-peer lookup service for internet applications. In   1st international conference on Scalable information
Proc. ACM SIGCOMM, 2001.                                    systems, Hong Kong, 2006
[18] M. Stokes. Gnutella2 Specifications Part One.          [25]    Wuala    project   official    documentation:
http://gnutella2.com/gnutella2_search.htm.                  http://www.wuala.com/en/about/
[19] D. Talia, P. Trunfio. A P2P Grid Services-Based        [26] B. Yang and H. Garcia-Molina. Improving Search
Protocol: Design and Evaluation. Euro-Par 2004              in Peer-to-Peer Networks. In ICDCS, 2002.
[20] A. S. Tanenbaum. Computer Networks. Pren-tice          [27] B. Zhao, J. Kubiatowicz, and A. Joseph. Tapestry:
Hall, 1996.                                                 An infrastructure for fault-tolerant wide-area location
                                                            androuting. Technical Report UCB/CSD-01-1141,
[21] TBSource official documentation: http://www.tb-
                                                            Computer Science Division, U. C. Berkeley, April 2001
source.info

</pre>