=Paper=
{{Paper
|id=Vol-454/paper-2
|storemode=property
|title=Time-limited data storing in peer to peer networks
|pdfUrl=https://ceur-ws.org/Vol-454/paper2.pdf
|volume=Vol-454
}}
==Time-limited data storing in peer to peer networks==
Time-limited data storing in peer to peer networks
Scherbakov Konstantin
SPbSU
Saint Petersburg,
Russia
konstantin.scherbakov@gmail.com
coefficient of node’s usefulness for the whole network.
ABSTRACT So nodes, that are more useful for the network, should
have opportunity to use these mechanisms more often
Different techniques of using peer to peer networks are
and more completely.
mentioned. The process of data distribution in peer to
peer network by the single node’s initiative for purpose
of further downloading by this node’s owner is UDD P2P network model
described. A common model of special peer to peer
Let’s assume that one person or group of persons have
network (UDD network), developed for single node
some u-data, that is very useful for this person or group
initiated data distribution, with its main characteristics
of persons, but not interesting for other participants of
and used algorithms is introduced. Some ways of
our future u-data distribution (UDD) p2p network. This
improving existing peer to peer networks for the
person wants to have access to this data from any place,
purpose of compliance to UDD network specification
where he can connect to internet (and to our UDD p2p
are mentioned. Experimental results of single node’s
network too). He also wants to have ability to extend his
initiated data distribution in one of the existing peer to
data storing time and he expects that there is some
peer networks (with our improvements applied) are
mechanism used in this network that prevents with high
analyzed.
probability his data from being deleted from all network
nodes during its limited storing time.
Introduction and related work
What properties we expect for our UDD p2p network?
Pure and hybrid peer to peer (p2p) networks are widely Let’s assume that our network will have the properties
used now in our life. There are at least two main ways listed below:
to use them:
1. Network nodes should have their own limited
- for grid/cloud computing/storage (Amazon S3, long-term memory
Nirvanix SDN, Gridnut, Wuala etc.)
2. Network nodes should have access to some
[1,5,9,13,19,25]
computing resources (for example their own
- for data exchange between interconnected CPU) to calculate data content and keyword
nodes (bittorrent, exeem, ed2k, kademlia, hashes
gnutella 1,2 etc) [2,4,6,7,21,22]
3. Network should have some mechanisms for u-
If p2p network is designed for data exchange, new data storing and effective distributing and
nodes usually connect to it for the purpose of retrieving deleting
some files, they are interested in, or share to other nodes
4. Network should provide high probability data
some files, they want to. But there is at least one more
disappearance preventing mechanism
alternative way of using such type of p2p networks. One
node can connect to network for storing its own 5. Network should have mechanisms of effective
data/files over other nodes for the purpose of retrieving data search request routing and metadata
this data by node’s owner from any other place. This storing [3,8,10-12,14,15,17,20,23,24,26,27]
data may be uninteresting for other nodes of network
and may be encrypted. So let’s call such type of data the 6. Network nodes may connect and disconnect
u-data. This way is rarely used because of lack of data from network constantly and therefore network
access control, data availability control and u-data fast provide some mechanisms of retrieval node
distribution mechanisms in existing p2p networks. connect/disconnect rate
Our main goal is to introduce the architecture of This is ideal network for our purposes. Real p2p
prototype of such p2p network and search request networks (especially file sharing) often have 3-4 of
routing/data storing/indexing protocols for it, that these properties, but we won’t examine them at this
allows any node to store securely its u-data over the point. It’s only important to note, that some of this
network for some fixed time interval and also grants this networks may be rather easily extended in special way
node’s owner ability to retrieve his data from any other to have all properties of our ideal p2p u-data storing
location during fixed period of time mentioned below network [16]. It’s important to introduce and describe
and control availability of this data over the network the method of u-data distribution in UDD network
during its limited storing period. Using of these initiated by some specific node.
mechanisms by particular node should depend of some In usual p2p network data distribution process usually
have some general stages: foreach ($neighbors_arr as
$neighbor_num=>$neighbor_value)
- Data hashing (data content hashing, data
description keyword extraction and hashing {
etc.) if (reservespacerequest(
- Data announcing (more precisely this stage &$neighbors_arr[$neighbor_num],&$m_ok,
should be called “data hashes announcing”) $space, $datahash)==1)
- Waiting for external requests for announced $neighbors_arr[$neighbor_num]
data from other network nodes
['reserved']=1;
- Data uploading to external network nodes
$neighbors_arr[$neighbor_num]
interesting in data announced
['checked']=1;
It’s a good way of distribution process for data that may
be interesting and useful for other peers in usual file }
sharing p2p network, where waiting for external $cur_neighbor=
requests stage hasn’t infinite durability for the data get_firstkey(&$neighbors_arr);
being distributed. But we assumed that our data is
while ($m_ok<$n)
uninteresting for other nodes. Furthermore, our data
may be encrypted by its owner to prevent effective {
using of it by other network node’s owners. Therefore addneighbors_callback(&$neighbors_arr,
3rd stage of usual data distribution process will have
infinite durability for u-data in usual file sharing &$m_ok,$cur_neighbor,
network. So how we can effectively distribute data in 'reservespacerequest');
our ideal network? Nodes have some limited long-term
if(($cur_neighbor=get_nextkey(
memory. If we try to send our data in all accessible
nodes, some of them may reject it, because of lack of &$neighbors_arr,$cur_neighbor))===
free disc space. Our first goal is to share u-data between false) break;
N nodes (including the initial one), where N is defined
by data and initial node’s (n_init) owner (N should be }
limited according to the node n_init usefulness $invited=send_download_invitation(
coefficient, mentioned above). Lets assume that n_init &$neighbors_arr);
have M neighbors. If M>N we will ask them about their
/*some code deleted from listing*/
free space and ask them to reserve sizeof(u-data) bytes
for our u-data storing with M queries and get M_ok $resultnodes=send_data_info(
results with satisfying answer. If M_ok < N, we will ask &$neighbors_arr);
this nodes for their neighbor lists and then repeat our
first step. If M0, and infinity otherwise.
keywords, etc. This process is briefly described in the
code listing below When any nodes download some data they need, their
node_dwn values increases, and when they upload some
Listing 1: data to other nodes, their node_upl values increases.
These values can never decrease. For our extended
$neighbors_arr = get_neighbors(); network, we can change this formula to
//neighbor list
$m_ok=0; fair_coef_usf = (u_usual+ deltads*d_special) /
(d_usual + deltaus*u_special), of u-data distributed in our network? Node n_init should
send special request with data hash and new storing
where u_usual – outgoing traffic value for usual data,
time to all known nodes with our hash table structures.
d_usual – incoming traffic value for usual data,
Then these special nodes should find a record about this
u_special and d_special – incoming and outcoming
data by its content hash. And finally they should update
traffic value for special data (distributed over our u-data
storing time for this record and send update request to
distribution mechanism), deltads, deltaus – special
known nodes with this data stored. This is non-
weight coefficients, introduced for correcting value of
guarantied way to update data storing time, but our goal
fair_coef_usf after using by some node possibility to
is to update data on some nodes. All other work should
distribute u-data .
do current data availability coordinator.
Now let’s return to the end of u-data distribution
process. Let’s assume that our network have some
special nodes, that allow storing data indices (it’s true, if
we’ll extent bittorrent or ed2k network, that are now). Listing 2:
All M nodes with our data stored on it sends data hash, $reqtype=get_rec_type($request);
finish storing date, search keywords to their neighbor
if ($rectype==’updatetime’ &&
nodes, which are marked as indexing nodes, which
stores these values in a special data structure, you can ($newtime>time()+$delay))
see on the scheme 1 below. {
As we can see on this scheme, our special node has 2 $dataobj=0;
hash tables. First hash table contains hashes of search
search_by_datahash($hashstructure2,
keywords, stored in dynamic array as array keys. Each
value in this array contains link to a special list of $hash,&$dataobj);
pointers to objects, where each object contains content update_storingtime_delayed(&$dataonj,
hash for data, relevant to this keyword, data note and
filename, node id’s or addresses, where this data is $newtime,$delay);
stored. We can store ids for nodes with dynamic ip mark_for_update(&$dataonj);
addresses and raw ip addresses for nodes with the static
ones. But if we only have hash table for keywords, our send_updatetimerec_delayed(&$dataonj,
special nodes will be useless for search by data content $newtime);
hash request routing. So we’ll create a second hash table
}
for this purpose. It contains known data content hashes
links to special node, with links to the head of special elseif ($rectype==’deleterec’)
object list, which also contains links to real objects with {
data about data hash, keyword hash, node id’s and/or
ips, mentioned above. So, when search request arrives, $dataobj=0;
we split it to keywords and search for their hashes in search_by_datahash(&$hashstructure2,
first table. Then we get data filenames, keyword
$hash,&$dataobj);
relevance in some way, data notes and send it back to
node, that initiated this request and let it decide, what to send_deleterec_delayed(&$dataonj,
do with the result retrieved. When hash search request $delay);
arrives, we send back to the request initiator ids / ip
addresses of nodes that can store this data. delete_delayed(&$dataonj,$delay);
How we can use this structure for updating storing-time }
/* $delay represents time in seconds, coordinator suddenly disconnects, other nodes elects a
after which this object is actually new one by special data id, they have. They use last
deleted after it was disabled for search received nodes list in this process. Node with minimum
requests */
data id (this may be UNIX time for example) wins the
In the listing 2 we can see code that allows indexing election and became a new coordinator. When the old
nodes to update or delete objects with data hashes and coordinator arrives back in network, he sends requests
nodes ids/ips lists. It’s significant that objects are not to known nodes with u-data stored and receives new
deleted immediately, so we only mark them as deleted coordinator address. Then he tells new coordinator to
and set some delay time, after that it will be actually tell other nodes about new old coordinator send current
deleted during the regular indexing node’s maintenance node list and became a regular node.
process. This process should take place regularly in the
periods of low CPU/network/etc load of indexing node. Experimental results
While maintenance, indexing node should check all
objects for it’s data storing time and if necessary, And now let’s make some experiments with our data
physically delete them, if undelete flag is not set. Else if distribution mechanism in the real network. We have
this flag is set, node indexing should take away delete extended existing bittorrent client bittornado, torrent
flag and update storing time, if it’s higher than current tracker tbdev [2,21,22] and a special tool to emulate
time, and if there is no delete flag for this or higher large number of different nodes (this tool is written on
storing time for this data hash, else node should also PHP). We are using 2 servers: C2D E4400, 3GB ram,
delete object permanently. Centos 5.0 and C2D E6400, 3GB ram, FC6. We know
some statistics for the normal bittorrent network, based
So, how we can use this structure for searching u-data, on this 2 servers: network average size = 4100 peers;
distributed over the network and for updating its storing about 1700 peers connected to network with intervals
time / non-guaranteed deleting etc.? more 24 hrs. Average peer renewal speed: 0.08 peer per
Here you can see this process: second. Average incoming/outcoming speed of all peers
= 1.4 mbit/s
Nobody other then us was given modified p2p network
Listing 3: client, so we will create virtual nodes on server 1, make
if (is_server_load_low()) server 2 indexing node and start to distribute data over
our virtual network with most characteristics equal to
{ the real one.
foreach ($hashstructure2 as $hkey =>
Firstly we’ll assume, that all peers have appropriate disc
$hvalue) space for our data and we wont’s actually save it on it’s
{ discs. We’ll only save data hashes, storing time and
indexes instead of it. 1700/4100 * 100% roughly = 41%.
$actionlist=getactions( So in all our experiments 41% of peers will be always
&$hashstructure2[$hkey]); connected to network.
if ($actionlist) In first two sets of our experiments we’ll set and fix the
coordinators updating interval to 1 hour and start to
writelog(executeactions(
distribute data to varying number of nodes. We’ll also
&$hashstructure2[$hkey], vary average number of connected nodes (in experiment
$actionlist)); set two). Our goal is to determine conditions, when last
node with our data will leave network and it became
while (!is_server_load_low()) inaccessible and how many nodes we need to have 25%
{ of nodes at the end of one coordinator update period.
sleep(5); Then we’ll also vary coordinator updating interval and
}
will determine the same conditions, when all stored data
will disappear. But firstly we’ll assume that our network
} doesn’t have a constantly connected (core) nodes.
} Every experiment we’ll repeat 100 times to determine
best and worst results for current conditions.
We have mentioned data coordinator below. Let’s In the first set of experiments (100 nodes connected to
briefly describe its functions. Regularly all nodes, network in average, coordinator update time = 1 hour;
storing our u-data sends requests to current coordinator every hour 59 random nodes leaves network and 59
to check its availability, tell it that they are available and other connects)1 we can see, that when N (total nodes to
receive a list of other nodes with u-data stored.
Coordinator monitors these requests and has a list of 1
Raw data for this set of experiments can be found at
currently active nodes with u-data presented. If http://195.70.211.9/syrcodis09_set1.txt
Diagram 1. First set of experiments.
which our data is distributed initially) reaches value of see slightly higher number of minimum required nodes
14, more than 25% of nodes will still have our data on it for data saving on 25% of nodes at the end of
after 1 hour in the worst result acquired. Theoretically N coordinator update period. It’s close to 25 nodes,
should reach the value of 60 to give as a guarantee of because in this set of experiments 59 unique nodes
data saving after one coordinator update period at least leaves the network, and in set 1 this value is distributed
on one node, but practical results are better, because in the interval (1,59).
probability of the event “all nodes with data leave our
Now let’s vary total number of connected nodes with
network after one coordinator update period” decreases
other properties of network equal to set 2. At the
exponentially (and will be about O(10^(-59*2)) for
Diagram 3, which represents the 3rd set of experiments3
N=59 ). So in real network we don’t need to distribute
we can see results for 4000 connected nodes and for N
data on such huge amount of nodes to save our data
from 1 to 2996 with step 85. While N reaches value of
with probability very close to 1.
86, we have more than 50% of nodes with our data alive
Diagram 2. Second set of experiments.
When taking a look to diagram 2 (second set of after coordinator update period
experiments, where we have 41 persistently connected
While network connect/disconnect rate is fixed, it’s
nodes, and other properties are equal to set 1)2 we can
2 3
Raw data for this set of experiments can be found at Raw data for this set of experiments can be found at
http://195.70.211.9/syrcodis09_set2.txt http://195.70.211.9/syrcodis09_set3.txt
Diagram 3. Third set of experiments.
better to have more non-persistently connected nodes [2] BitTorrent full specification (version 1.0).
for minimizing the number of nodes, to which our data http://wiki.theory.org/BitTorrentSpecification
should be distributed for having the probability of
[3] A. Crespo and H. Garcia-Molina. Routing Indices
saving very close to 1 after one coordinator update
for Peer-to-Peer Systems. In ICDCS, July 2002.
period. This probability will decrease with time very
slowly, so we’ll have its high enough at the end of our [4] Exeem project specification: http://www.exeem.it.
data storing interval, because it’s not equal to infinity.
More experiments are required to determine maximum [5] I. Foster. Peer to Peer & Grid Computing. Talk at
satisfying storing time interval for high probability of Internet2 Peer to Peer Workshop, January 30, 2002
data saving. We can also say that varying coordinator [6] Gnutella project specification:
update interval can help us to increase the probability of http://www.gnutella.com.
data saving. It’s important to rightly determine update
interval before distributing any data in network and vary [7] Kademlia: A Design Specification.
this interval while data life cycle to minimize the http://xlattice.sourceforge.net/components/protocol/kade
number of nodes to which data is distributed and save mlia/specs.html
the network bandwidth and node’s computing resources. [8] V. Kalogeraki, D. Gunopulos, D. Zeinalipour-Yazti.
We can do this by sending special requests to indexing A Local Search Mechanism for Peer-to-Peer Networks.
nodes for example (we have one of them in our In CIKM, 2002.
experiments – it’s an extended bittorrent tracker). These
actions will make a little additional non data-transfer [9] J. Kubiatowicz, D. Bindel, Y. Chen. Ocean-store:
traffic over the network, but will save us significantly An architecture for global-scale persistent storage. In
more traffic between nodes. ASPLOS, 2000.
[10] C. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker.
Conclusions Search and Replication in Unstructured Peer-to-Peer
Networks. In ICS, 2002.
So, we have described a model of ideal p2p network for
data distribution initiated by a single node. We also [11] D. Menasce and L. Kanchanapalli. Probabilistic
introduced some methods, that should be used in such Scalable P2P Resource Location Services.
type of network and make three series of experiments SIGMETRICS Perf. Eval. Review, 2002.
with good results. Our next work will be concerned to [12] I.Nekrestyanov. Distributed search in topic-
examining more deeply routing mechanisms in that type oriented document collections. In SCI'99, volume 4,
of networks and introducing methods that will allow pages 377-383, Orlando, Florida, USA, August 1999.
making data distribution and retrieval more secure.
[13] Nirvanix SDN official documentation:
http://nirvanix.com/sdn.aspx
REFERENCES
[14] S. Ratnasamy, P. Francis, M. Handley. A scalable
[1] Amazon S3 official documentation: content-addressable network. In ACM SIGCOMM,
http://docs.amazonwebservices.com/AmazonS3/2006- August 2001.
03-01/
[15] A. Rowstron and P. Druschel. Pastry: Scalable, [22] TorrentPier official documentation:
distributed object location and routing for large-scale http://torrentpier.info
peer-to-peer systems. In Middleware, 2001.
[23] D. Tsoumakos and N. Roussopoulos. Adaptive
[16] K. Scherbakov. Search request routing in Bittorrent Probabilistic Search for Peer-to-Peer Networks. In 3rd
and other P2P based file sharing networks, SYRCoDIS IEEE Int-l Conference on P2P Computing, 2003.
2008, Saint-Petersburg, Russia
[24] D. Tsoumakos, N. Roussopoulos. Analysis and
[17] I. Stoica, R. Morris, D. Karger. Chord: A scalable comparison of P2P search methods. Proceedings of the
peer-to-peer lookup service for internet applications. In 1st international conference on Scalable information
Proc. ACM SIGCOMM, 2001. systems, Hong Kong, 2006
[18] M. Stokes. Gnutella2 Specifications Part One. [25] Wuala project official documentation:
http://gnutella2.com/gnutella2_search.htm. http://www.wuala.com/en/about/
[19] D. Talia, P. Trunfio. A P2P Grid Services-Based [26] B. Yang and H. Garcia-Molina. Improving Search
Protocol: Design and Evaluation. Euro-Par 2004 in Peer-to-Peer Networks. In ICDCS, 2002.
[20] A. S. Tanenbaum. Computer Networks. Pren-tice [27] B. Zhao, J. Kubiatowicz, and A. Joseph. Tapestry:
Hall, 1996. An infrastructure for fault-tolerant wide-area location
androuting. Technical Report UCB/CSD-01-1141,
[21] TBSource official documentation: http://www.tb-
Computer Science Division, U. C. Berkeley, April 2001
source.info