<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A scalable architecture for peer privacy on the Web</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Spyros Kotoulas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruud Stegers</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Articial Intelligence, VU University Amsterdam</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We consider the privacy implications of exchanging Web data and present the design of a peer-to-peer discovery mechanism with strong privacy guarantees. Its benets are that it does not rely on a trusted third party, spreads the computational cost among participants and places minimal restrictions on privacy policies. Furthermore, it provides scalability both in terms of system size and level of privacy oered. We outline an implementation that relies on two overlays: a distributed hash table for discovery and an anonymising overlay. Finally, we evaluate the proposed mechanism against a number of privacy threats and analyse its complexity.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Organizations holding personal information (e.g. government, social networking
websites, banks) provide a Web Service where user agents, after authenticating,
can retrieve and modify information that concerns them. User agents cannot keep
track of the organizations with data that concerns them since the latter is not
inserted or maintained by them. How can these Web Services be located while
maintaining the privacy of users and data? The basic challenges in designing
such a system are that the number of such Web Services may be large, thus it is
not possible to contact all of them for each query and that user agents and web
services should only be able to see data and queries they are authorized to.</p>
      <p>
        Centralized architectures where a single organization maintains an index of
all data and enforces access control policies suers from the following drawbacks:
This organization must be completely trusted by all participants. Additionally,
the maintenance of such a gargantuan volume of data would not be an easy,
let alone nancially viable, task. Finally, the centralized infrastructure should
be able to enforce the access control and privacy policies of the information
providers. This inhibits or precludes the use of locally calculated policies and
policies that use private criteria. For example, some Web Service wants to
authenticate users itself and does not share user passwords with the central index.
This is an especially important restriction for the Social Web or any system
where we do not have a priori trust agreements [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. For these reasons, we
advocate the use of a peer-to-peer paradigm, where information providers will be
able to keep data locally, or at least in the same authority domain and enforce
their own privacy and access control policies.
      </p>
      <p>The focus of this work is on scalable privacy, with scalable referring to both
the size of the system and the level of the privacy guarantees provided. The
threat to the privacy of an individual is related to the ability of an adversary to
associate information or other individuals with it. Instead of a one-size-ts-all
approach, the level of privacy in our method is tunable. We explore the
tradeo between privacy and performance ranging from high performance, providing
similar performance and slightly better privacy guarantees than current resource
discovery systems and search engines to high privacy, providing anonymity for
the participating parties and non-disclosure of the content, its description or a
unique identier for it. Note that the focus of this work is not on privacy and
access control policies, but on an architecture and mechanisms that allow a more
open set of such policies.</p>
      <p>We propose a Peer-to-Peer framework for private discovery and querying of
(Semantic Web) data and develop a method to locate data providers without
revealing their location, identity or the descriptions of the content. To this end we
describe a distributed scalable index, implemented on top of a Distributed Hash
Table (DHT). Additionally we show a method to obfuscate and approximate
the descriptors in the index, so as to prevent disclosure of resource descriptions
and combine it with an anonymising network to protect the identities of the
parties involved. We test a selection of the possible privacy settings against a
set of possible attacks and analyse their performance, paying special attention
to the trade-o between the additional privacy mechanisms and the associated
computational and network overhead.</p>
      <p>In section 2 we describe the privacy threats covered by our work. Related work
is described in section 3. In section 4 we introduce our architecture for scalable
privacy, including a description of the technologies we employ to achieve this.
Finally, we analyse our architecture in relation to a set of privacy attacks and
the associations dened in section 2.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Privacy threats</title>
      <p>We present the threats within a discovery/sharing system in terms of the ability
of an adversary to associate aspects of content (e.g. a descriptor or the content
itself) to a user. We use the term provider for a peer that shares content and
seeker for a peer that consumes content. Content may be anything (e.g. RDF
data or Social networking prole or a medical record) while a content identier
is a unique identier for some content (e.g. a cryptographic hash of the content
or a URI). A discovery mechanism facilitates the sharing process by allowing a
seeker to either locate content or potential providers based on a set of descriptors
(e.g. keywords or derivatives). The discovery mechanism usually maintains an
index of descriptors to providers.</p>
      <p>Table 1 presents possible privacy impairing association. We observe that: (a)
Any association in the top cells is detrimental for privacy (e.g. a mapping from
content to identity ). (b) It is possible to derive more sensitive information from
less sensitive information, moving from the bottom of the table to the top (e.g.
an adversary that knows the set of URIs that appear in a SPARQL query can get
a good estimate of the query itself). (c) Even perceived insignicant associations
can lead to signicant privacy breaches (e.g. two users that are associated with
CCoonntteenntt (deegs.crdiopctiuomne(netg). keywords) Identity (eg real name)
QQuueerryy (deegs.crSiPptAioRnQ(Legq. useertyo)f URIs) () Location (eg. IP address)
SCyosntteemnt(IuDsa(geego.fG) UID) Any user in the system
Table 1. Privacy impairing associations. This table shows the dierent associations
that can be made between content and identity ordered by descending signicance.
similar content IDs have similar interests, thus user proling is possible). (d)
dierent association types can be combined. For example, an adversary that
knows the IP addresses where the system is accessed from ( system - location
association) and has access to an anonymous query log ( query - any user in the
system association). She can infer a (weak) association between queries and IP
addresses. We will refer back to this table in section 5.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Background</title>
      <p>Discovery/sharing systems are widely used both in closed domains and on the
public web. Although encryption schemes exist to ensure security, signicant
challenges concerning privacy and scale remain. Search engines can scale but
have abysmal privacy guarantees. Peer-to-peer systems satisfy the basic principle
of lacking a central trusted party, but mechanisms to guarantee privacy are not
available yet.</p>
      <p>We provide an overview of the privacy implications (for seekers and providers)
of current discovery/sharing mechanisms in combination with their scalability
properties:
Centralized index An index of all content or content descriptions is kept at a
collection of centrally managed servers. Information seekers post queries to
the index and receive the content, a content identier or a provider to ask
for the content. This can be a scalable solution, assuming adequate
nancial resources to maintain this infrastructure. Nevertheless, it requires that
all parties (both seekers and providers) trust this central index with their
privacy, since it has complete control and access to all content and queries.
Furthermore, providers should entrust the index to correctly enforce the
policies specied with the data (if any).</p>
      <p>
        Broadcast-based unstructured overlays Unstructured overlays are a type
of peer-to-peer overlay where peers maintain a set of ad-hoc connections to
other peers [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Queries are broadcasted to all peers, or a signicant subset
of peers in the system. They provide good privacy guarantees concerning
the information providers, since there is no index that can be used to derive
privacy sensitive information from. Additionally, every provider maintains
full control of it own data. The privacy of the seekers on the other hand is
not protected since their queries can be seen by any peer in the network.
Broadcast-based systems are not scalable since queries need to be sent to a
large subset of the total peers in the system.
      </p>
      <p>
        Structured Overlays Structured peer-to-peer overlays impose a global
structure on the peer connections [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. They can be used to implement ecient
and scalable indexes spread over a large number of nodes. They are very
scalable but fare poorly with provider privacy since the index is readable by
a large number of possibly untrusted nodes, which can be used to derive
privacy sensitive information about the provider. Any single compromised
index node is a direct threat to privacy of the seekers that uses it, since the
queries are exposed..
      </p>
      <p>Condential indexes Condential indexes aim at hiding information from the
indexed content. The general pattern is to introduce noise to the results in
order to avoid strong associations. We describe two recent implementions of
condential indexes that are very relevant to our work.</p>
      <p>
        The condential index proposed by Bawa et al [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] relies on a public index
that divides the providers over a set of privacy groups. Within the group,
they collaboratively create a bloom lter with the descriptors of their
content using a randomized process so as to avoid exposure of the content of
individual providers. These lters, along with the list of all providers in the
group are sent to a central server, where an index is created that maps bits
in the bloom lter to privacy groups. The privacy of the providers is
protected since the privacy groups contain providers that have or do not have
the requested content (introduction of false positives).
      </p>
      <p>
        Zerber [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] uses a set of largely untrusted servers to maintain an inverted
index of descriptors. Access control is enforced on the index. To defend against
compromised index nodes, no single index server is given enough information
to reconstruct a descriptor by itself. To this end, descriptors are encrypted
using k out of n cryptography. This means that every descriptor is split and
encrypted into n dierent parts, each owned by a dierent index node. Since
every index node enforces access control locally, as many as k compromised
index nodes are required for an adversary to obtain a complete descriptor.
For additional protection, the index maps sets of terms instead of single
terms to document descriptors (called a merged posting list mechanism).
The decrypted descriptors contain the original index terms (so the irrelevant
entries returned as result of the merging can be ltered out) and the location
of the document. Zerber+R [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] additionally provides top-k ranked results.
Our own method falls into this category and is further described in section
4.1. The major improvement over these systems is that our system
additionally provides anonymity and can scale to a much larger index, since it is
maintained by system participants.
      </p>
      <p>Encrypted indexes Encrypted indexes protect the content by encrypting
sensitive parts of the index entries. Since this generally requires complex key
management schemes, scalability is a problem.</p>
    </sec>
    <sec id="sec-4">
      <title>Our Architecture</title>
      <p>
        We are using an index-assisted peer-to-peer model similar to the one in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. To
have their information indexed, providers (a’) extract a content descriptor (e.g.
keywords, hashes of keywords, ...) and (b’) store it in the index together with
an identier to contact the provider. To locate and access content, seekers need
to take the following steps: (a) they create a set of content descriptors based
on their query (e.g. a set of keywords, hashes of keywords, ...), (b) they send
this set to the index which in turn (c) returns a number of providers which are
(possibly) able to provide the information. (d) The seeker selects a number of
these providers based on local preferences and (e) negotiates directly with the
provider to retrieve the requested information. Our method places no restriction
on the privacy and access control policies for step (e), except that providers
need to be able to enforce them themselves locally. The focus of this work is
on dening a scalable architecture for this model and develop mechanisms to
guarantee privacy and anonymity.
4.1
      </p>
      <p>Scalable architecture
We will outline a distributed and scalable implementation. The meaning of the
word scalable is two-fold: First, the performance of the system does not
deteriorate severely as the number of participants increases. Second, dierent levels
of privacy are supported. The infrastructure that will be presented functions
through the synergy of two peer-to-peer overlays: the Indexing overlay and the
Anonymising Overlay . The former provides a global scalable index on top of
a DHT. The latter provides a distributed and scalable mechanism to hide the
identity of peers, whenever this is required.</p>
      <p>Index-assisted query routing The indexing overlay is responsible for
providing mappings from the set of descriptors in a query to a set providers that
possibly have content that matches these descriptors. It should be able to store
a very large number of descriptors, thus a scalable implementation is required.</p>
      <p>
        Distributed Hash Tables(DHTs) are a type of structured peer-to-peer overlays
that impose a global structure on the peer connections [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Typically, each item
stored in the DHT is associated with a hash ID chosen from a large key space.
This space is partitioned in a way similar to hash tables, but instead of bins, they
have peers. This distributed data-structure is self-maintaining, self-organizing
and guarantees lookups and insertions using, in most cases, O(log(N )) messages,
where N is the number of peers in the DHT overlay [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>DHTs are well suited for implementing a large, global index, maintained by
the participants of the system. This is, in fact, straightforward: for every element
of the descriptor, peers insert in the index a hdescriptor; peer addressi pair. For
querying, seekers make one search for each of the query descriptors to retrieve
the relevant providers.</p>
      <p>The performance gain aside, using a DHT to maintain the index shifts the
responsibility from one organization to multiple organisations, which is both a
blessing and a curse: the index is no longer held in one location, meaning that
no single entity has complete control of the entire index. On the other hand, the
fact that it is partitioned means that many entities have control over parts of it.
Obviously this has consequences for privacy.</p>
      <p>Hiding the descriptors Since we do not trust the index nodes, we cannot
index the content descriptors in plaintext. We identify some methods to
reduce the information conveyed in the indexed descriptors and make them
nonunderstandable to prevent direct association of content and query descriptors to
users or locations.</p>
      <p>Obfuscation Instead of indexing the descriptor itself, we index its secure hash
(e.g. using SHA-1) using hash function h. For example, for the descriptor h Soccer i =
134:23:32:2, we will store hh( Soccer )i = 134:32:42:2. When querying, we use
the same hash function for the query descriptors. Note that secure hashes are a
one-way function, so it is very dicult to retrieve the value that was hashed to
produce a certain result.</p>
      <p>Approximation It is possible to perform a dictionary attack by calculating the
hashes of all descriptors the attacker is interested in and matching them to the
indexed descriptors. To alleviate this problem, we introduce false positives in
our results. Let P (t) be the set of providers that actually have content with
descriptor t and P 0 (t) the false positives. We dene exposure for an index entry
with descriptor t as the ratio between the number of true positives relative to the
total number of answers et = jP 0(tjP)j+(tj)Pj (t)j , a low e meaning that the providers
for t have low exposure (high privacy) and e = 1 that they are fully exposed.</p>
      <p>
        This is implemented by varying the size of the hash, so as to increase
collisions. In previous work [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], we have shown that decreasing the length of the hash
leads to a proportional increase in privacy. If we truncate hash values to l bits,
any query would match a portion of r = 1=2l of all descriptors in the system.
This is directly translatable to an average exposure of
e =
1
2l
(1)
      </p>
      <p>
        For maximum privacy, l will be equal to 0, i.e. any query would match all
descriptors stored in the index and return all peers in the system. For maximum
performance, 2l should be much higher than the total number of descriptors in
the system, so as to completely avoid collisions. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]we have also shown that
using an SHA-1 secure hash we can guarantee the privacy of individual terms
for a typical webpage description corpus.
      </p>
      <p>
        Anonymising network Hiding the descriptors in the index is not enough:
identities and locations can be used to perform association attacks (see table 1
and section 5). To provide stronger security guarantees, it is desirable to hide the
identity of the peers in the network. Onion routing emerged from this general
wish [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Onion routing anonymises communication channel and protects the
identities and locations of the participants.
      </p>
      <p>
        In onion routing, a set of nodes forms an anonymising routing overlay. The
key principle, as described by Chaum[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], is that every router over which a packet
is send is only aware of the identity of the previous router and the next router.
This ensures that a single trusted router is in essence enough to protect the total
path. This is achieved using layered encryption. Starting from the destination
router, the peer will add the address of the router An to the data, and encrypt
it with the public key Kn+ 1 of the router before that in the routing chain. The
resulting package is an onion containing several layers of encrypted addresses
and data which at each level can only be read by the router with the correct
private key:
      </p>
      <p>
        When a router receives a packet, it will decrypt it with its own private key,
and use the now readable address of the next router to send the contained
(encrypted) data part to the next node. The destination will nd the original
payload when decrypting the last layer. The simplied scheme described here has
been extended with nonces, symmetric keys and other mechanisms to properly
ensure robust and scalable anonymity[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>The original onion routing proposal dened ’return envelopes’ to protect the
identity of the client from the server. However, this method does not scale well.
Next generation implementations provide other means to publish anonymously,
e.g. the Hidden Services in the TOR project, which hide the identities both of
the server and the client.</p>
      <p>We can use this mechanism to provide volatile peer identities, i.e. pseudonyms.
Instead of publishing hdescriptor; ip addressi pairs on the index, it is now
possible to publish hdescriptor; pseudonymi pairs. After a seeker has obtained a
psuedonym, it will use the anonymising network to contact the provider to
retrieve the content anonymously. This will protect the privacy of the provider by
making association of content to a real-world identity or location
impossible .</p>
      <p>In order to protect the descriptor from statistical correlation attacks, it is
possible for a provider to publish using multiple pseudonyms. We will expand
on this in section 5.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Analysis</title>
      <p>
        Here we dene four classes of settings based on dierent combinations of the
techniques discussed above.
1. Minimal A DHT is used as an index. Keys are extracted from content
and queries. They are published unmodied and associated with the (IP)
address of the information provider. This setting protects associations with
the content and the query, since only their descriptions are published or
queried.
2. Obfuscate and approximate The key is obfuscated using a secure hash
function. The level of privacy and the performance impact are inversely
proportional to the length of the hash l. This setting addtionally protects
query and content descriptors.
3. Anonymise The key is published unmodied but the identity of the
information provider and the seeker is concealed using an anonymising network.
Instead of a unique address, a pseudonym to contact the peer is published.
The level of privacy can be chosen by the provider at indexing time, in
the choice of how many terms are connected to a single service descriptor.
Compared to setting 1, this setting additionally protects associations with
identities or locations.
4. Full The key is obfuscated and an anonymising network is used. This
setting oers the highest privacy guarantees. Both the level of obfuscation and
anonymity can be chosen. This settings protects against all associations in
table 1, except for the association that some unknown user is using the
system and the association that the system is used by some location. Even
though it is not signicant privacy-wise, even these last associations could
be prevented by using a public anonymising network (e.g. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ])
Attacks We analyse the susceptibility of the aforementioned approaches to a
set of common attacks. We use table 1 as grounding to describe the consequences
of successful attacks.
      </p>
      <p>A1: Compromised index An adversary can seize control of the index. In our
approach, this is possible by compromising the DHT (how this is possible is
implementation-dependent and beyond the scope of this paper). The privacy
implications for a read-compromised index are negligible: the data in the index
was already readable for everyone in the rst place, thus no extra information
is gained by taking over one or more index nodes. All settings are equally
susceptible to compromising the DHT. For the rest of our analysis, we will assume
that the DHT is read-compromised, i.e. that the full index is readable by the
adversary.</p>
      <p>
        A2: Compromised anonymiser Anonymity is provided only while the
anonymising overlay is not compromised. Low-latency anonymisers are highly desirable
for performance reasons, however they generally are susceptible to end-to-end
communication correlation attacks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Nevertheless, in order to perform this
attack, the adversary needs to be able to eavesdrop on a large part of the
network. Generally speaking, any compromized communication channel in setting
3 and 4 reduces the privacy for the peer(s) involved in that communication to
the same level as in setting 1 and 2 respectively. The degradation of privacy is
gradual and dependent on the power of the adversary to observe communication
channels and the size of the system. Approaches exist to reduce these risks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
A3: Taking the intersection of the returned providers for correlated descriptors
An attacker may exploit the fact that descriptors from the same provider are
correlated. By posting several related queries and taking the intersection of the sets
of returned providers, the attacker can associate provider locations or identities
with descriptors.
      </p>
      <p>For a set of n queries with one descriptor t1:::tn each , the intersection of the
sets of peers returned by the system is:
the set P (t1) \ P (t2) ::: \ P (tn) represents all true positives for the intersection,
while the 2n 1 conjunctions represent the false positives, since each conjunction
contains at least one P 0. From previously given denition of exposure (formula
1), we have that:
e =</p>
      <p>jP (t1) \ P (t2) ::: \ P (tn)j
j[P (t1) \ P (t2) ::: \ P (tn)] [ ::: [ [P 0 (t1) \ P 0 (t2) ::: \ P 0 (tn)]j
We will consider the worst-case scenario where the adversary somehow knows
that descriptors t1:::tn are completely correlated in the index, i.e. they appear
exactly on the same providers. Thus, P (t1) = P (t2) = ::: = P (tn). In this case,
P (t1) \ P (t2) ::: \ P (tn) = P (t1) and
e =</p>
      <p>jP (t1)j
j[P (t1)] [ [P (t1) \ P 0 (t2)] [ ::: [ [P 0 (t1) \ P 0 (t2) ::: \ P 0 (tn)]j
In the denominator, all conjunctions with P (t1) will be absorbed by the
disjunction with P (t1). Thus, e = jP (t1)[[P 0(t1j)P\(Pt10()tj2):::\P 0(tn)]j . We also have that
P (t) \ P 0 (t) = , thus
e =</p>
      <p>jP (t1)j
jP (t1)j + jP 0 (t1) \ P 0 (t2) ::: \ P 0 (tn)j</p>
      <p>The attacker has no control over the false positives P 0 (t1)\P 0 (t2) :::\P 0 (tn) ;
which depend only on the properties of the dataset. Nevertheless, since A
A \ B, the size jP 0 (t1) \ P 0 (t2) ::: \ P 0 (tn)j will decrease as the number of
conjunctive terms increases. On the other hand, it becomes increasingly dicult to
nd a larger set of correlated descriptors.</p>
      <p>Note that this attack is not possible in settings 3 and 4 since the providers
are using pseudonyms. It is enough to publish each correlated descriptor under
a dierent pseudonym to leave the adversary with no additional information
about the set P (t1) \ P (t2) ::: \ P (tn). If it is not possible for providers to know
which terms are correlated, they can publish every descriptor under a dierent
pseudonym.</p>
      <p>The vulnerability of setting 2 depends on the length of the hash l (since the
ratio of true positives P (t)to false positives P 0 (t) is dependent on l).
A4: Malicious provider tries to prole seekers A compromised index may
collude with a provider to prole seekers based on their searches. Similar to the
previously described attack, but proling using queries instead of content thus
attacking the privacy of the seeker instead of the provider. Seekers may prevent
this by using several index nodes.</p>
      <p>DHT
dd.</p>
      <p>ad.</p>
      <p>r
e
s
i
m
y
n
o
n</p>
      <p>A
aa.</p>
      <p>pa.
sa.</p>
      <sec id="sec-5-1">
        <title>Providers</title>
      </sec>
      <sec id="sec-5-2">
        <title>Seekers</title>
        <p>A5: Censorship and denial of service by index nodes Though strictly speaking
not a privacy attack, censorship and denial of service are related attacks that
can be prevented by providing peer privacy.</p>
        <p>A malicious index node may censor all descriptors that correspond to a given
descriptor, by not returning any answer to queries for those. In settings 1 and
3, this is easy for any subject, since the descriptors are stored in plaintext. In
settings 2 and 4, this is more dicult: since multiple descriptors map to the same
hash values, the malicious node would have to censor all of them, since there
is no way of knowing which ones refer to the given descriptor. The number of
censored descriptors would be 1=2l N , where l is the length of the hash and N the
total number of descriptors in the system. Detecting such a malicious index node
is easy: it is enough to create some honeypot providers advertising descriptors
that are likely to be censored and match results returned by the suspect index
node.</p>
        <p>Denying services to a given seeker or provider is possible in settings 1 and 2:
a seeker could be presented with wrong or limited results, and the index could
refuse storing the descriptors for a given provider. In settings 3 and 4, this is not
possible, since providers use pseudonyms and seekers are anonymous. Assuming
that malicious nodes also deny service to particular provider pseudonyms, the
level of protection depends on the number of pseudonyms per provider.
Performance We will analyse the performance of our architecture focusing on
the trade-o between privacy and performance. Furthermore, we will focus on
its scalability in terms of network size, index size and the throughput of the
anonymiser.</p>
        <p>In gure 1, we can see the communication channels in our system. Note
that the circles represent roles fullled by a physical host. E.g. a single physical
host may be part of the DHT, part of the anonymising network and a content
provider. In fact, we will assume that the number of hosts participating in the
DHT and the anonymiser increases linearly with the number of providers and
seekers.</p>
        <p>We have the following communication channels: dd and aa for communication
among DHT nodes and anonymiser nodes respectively, ad for DHT nodes and
anonymiser nodes, sa for seekers and anonymiser nodes and pa for providers
and anonymiser nodes. The cost associated with sending one message through
a communication channel is dened as cChannelName. E.g. the cost of doing a
DHT search is cdd.</p>
        <p>Considering that DHTs and anonymiser networks can have an arbitrary
number of contact points (or entry nodes), channels ad, pa and sa will not be
congested. The probable scalability issue may lie within the DHT network
maintaining the index(dd) or the anonymiser network(aa).</p>
        <p>The communication resulting from a query with x descriptors is as follows:
The seeker will divide it to x sub-queries with one descriptor each. Each of these
will have to be routed through the anonymiser to reach the DHT, where a lookup
will take place. The DHT will return s providers through the anonymiser, with
a ratio r of false positives. The protocol the seeker will use to negotiate with the
providers is beyond the scope of our approach. We will assume that negotiation
has a cost of cn.</p>
        <p>Thus, the cost associated with the query is:</p>
        <p>cq = x (csa + caa + cad + cdd + caa + csa) + s cn (csa + cpa)
A typical DHT lookup cost, in terms of IP messages is O(cdd) = O(log(NDHT )),
where NDHT is the number of nodes in the DHT and a typical anonymiser
routing cost is O(caa) = O(1). Sending messages over channels sa, ad and pa has
a cost of one, since they are directly on top of the underlying network protocol.
From these, we can bound the total cost of a query to:</p>
        <p>O(cq) = O(x (csa + caa + cad + cdd + caa + csa) + s cn (csa + cpa)) =
x [2 O(1) + 2 O(1) + O(log(NDHT ))]+2 s cn O(1) = O(x log(NDHT )+2 s cn)</p>
        <p>From the denition of exposure and equation 1, we have that e = as = 21l ,
thus, s = a 2l where a is the average number of providers that match a query(true
positives). We consider the cost of the negotiation constant, thus O (cn) = O(1).
Then, an upper bound for query cost is:</p>
        <p>O(cq) = O(x log NDHT + a 2l+1)
where x is the number of descriptors in the query, NDHT is the number of nodes
in the DHT, a is the number of providers that match the query and l is the
length of the hash.</p>
        <p>It is interesting to note that the cost for using the anonymiser does not
increase the overall complexity for querying. Moreover, the size of the DHT,
and thus the size of the index, increase only logarithmically, indicating good
scalability. The size of the hash l also plays an important role, signifying a
trade-o between privacy and scalability. The limiting factor in our system is
the number of matching peers a, since it is expected to grow linearly with the
number of providers in the system. This makes the need for ranking in the index
evident, but we will have to leave this for future work.</p>
        <p>The cost of indexing consists of merely indexing every descriptor using the
anonymiser. Due to space restrictions, we only present the nal result:</p>
        <p>O(ci) = O(x log NDHT )
where x is the number of descriptors and NDHT the number of nodes in the
DHT. This clearly scales well.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>
        As pointed out in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], privacy control has moved well beyond settings where an
access control list or group access control would apply. Parties are no longer
known in advance and privacy policies and trust have to be calculated
on-they. In an open (Semantic) Web, information sharing consists of locating and
acquiring the data. Most privacy control approaches have focused on the latter.
Our approach focuses on the former, providing an open discovery
infrastructure where providers and queriers reveal minimal information. We have shown
that our design is scalable in terms of system size and privacy level provided
and that associations between aspects of content and identity can be protected.
Furthermore, we have explored the trade-o between privacy and performance
and analysed the message complexity for querying. A general recommendation
about which setting to choose cannot be given, since it depends entirely on the
requirements of an application and costs involved.
      </p>
      <p>Future work includes implementation of this approach as a Sesame plug-in
and integration of privacy policies.</p>
      <p>We would like to thank Eyal Oren for the fruitful discussions and pointers
and Frank van Harmelen for reviewing our work. This work is supported by the
European Commission under the LarKC project (FP7-215535).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Mayank</given-names>
            <surname>Bawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Roberto J. Bayardo</given-names>
            <surname>Jr.</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Rakesh</given-names>
            <surname>Agrawal</surname>
          </string-name>
          .
          <article-title>Privacy-preserving indexing of documents on the network</article-title>
          .
          <source>In In Proc. of the VLDB</source>
          , pages
          <fpage>922933</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>David</given-names>
            <surname>Chaum</surname>
          </string-name>
          .
          <article-title>Untraceable electronic mail, return addresses, and digital pseudonyms</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>4</volume>
          (
          <issue>2</issue>
          ),
          <year>February 1981</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Roger</given-names>
            <surname>Dingledine</surname>
          </string-name>
          , Nick Mathewson, and
          <string-name>
            <given-names>Paul</given-names>
            <surname>Syverson</surname>
          </string-name>
          . Tor:
          <article-title>The second-generation onion router</article-title>
          .
          <source>In In Proceedings of the 13th USENIX Security Symposium</source>
          , pages
          <fpage>303320</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>G.</given-names>
            <surname>Kan. Gnutella</surname>
          </string-name>
          . In A. Oram, editor, Peer-to-
          <source>Peer: Harnessing the Power of Disruptive Technologies</source>
          , pages
          <fpage>94122</fpage>
          .
          <source>O'Reilly and Associates</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Spyros</given-names>
            <surname>Kotoulas</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ronny</given-names>
            <surname>Siebes</surname>
          </string-name>
          .
          <article-title>Scalable discovery of private resources</article-title>
          .
          <source>In IEEE SECOVAL at SECURECOMM</source>
          , Nice, France,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Keong</given-names>
            <surname>Lua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Crowcroft</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Lim</surname>
          </string-name>
          .
          <article-title>A survey and comparison of peer-to-peer overlay network schemes</article-title>
          .
          <source>IEEE Communications Surveys &amp; Tutorials</source>
          , pages
          <fpage>7293</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Alexandre</given-names>
            <surname>Passant</surname>
          </string-name>
          , Philipp Krger, Michael Hausenblas, Daniel Olmedilla, Axel Polleres, and
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Decker</surname>
          </string-name>
          .
          <article-title>Enabling trust and privacy on the social web</article-title>
          .
          <source>In W3C Workshop on the Future of Social Networking</source>
          , Barcelona, Spain,
          <year>January 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Sergej</given-names>
            <surname>Zerr</surname>
          </string-name>
          , Elena Demidova, Daniel Olmedilla, Wolfgang Nejdl, Marianne Winslett, and
          <string-name>
            <given-names>Soumyadeb</given-names>
            <surname>Mitra</surname>
          </string-name>
          .
          <article-title>Zerber: r-condential indexing for distributed documents</article-title>
          .
          <source>In 11th International Conference on Extending Database Technology (EDBT</source>
          <year>2008</year>
          ) , volume
          <volume>261</volume>
          of ACM International Conference Proceeding Series , pages
          <fpage>287298</fpage>
          , Nantes,France,
          <year>March 2008</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Sergej</surname>
            <given-names>Zerr</given-names>
          </string-name>
          , Daniel Olmedilla, Wolfgang Nejdl, and
          <string-name>
            <given-names>Wolf</given-names>
            <surname>Siberski</surname>
          </string-name>
          . Zerber+r:
          <article-title>Top-k retrieval from a condential index</article-title>
          .
          <source>In 12th International Conference on Extending Database Technology (EDBT/ICDT</source>
          <year>2009</year>
          ) , ACM International Conference Proceeding Series, Saint-Petersburg, Russia,
          <year>March 2009</year>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>