Interlinking Distributed Social Graphs∗

                                                      Matthew Rowe
                                                      OAK Group
                                             Department of Computer Science
                                                  University of Sheffield
                                            Regent Court, 211 Portobello Street
                                            S1 4DP Sheffield, United Kingdom
                                                m.rowe@dcs.shef.ac.uk

ABSTRACT                                                         of the person, who their contacts are with useful contact
The rise in use of the social web has forced web users to        information, and an identity reference point through a URI.
duplicate their identity in fragmented information spaces.
Commonly these spaces contain rich identity representations      Work by the linked data1 community has linked together
hidden within walled garden data silos. This paper presents      existing accessible social web data sources (Flickr exporter,
work to export social graphs from such data silos as RDF         DBPedia) and published the linked content. Despite social
datasets, and provide linkage between these social graphs        web platforms and services offering advanced functionalities
according to a graph matching paradigm. Our work con-            to enhance a person’s online identity, exporting the inter-
tributes to the linked data movement by providing a de-          nal social graph is not supported therefore inhibiting ac-
centralised social graph containing linked data describing       cess to useful datasets. Such sites are described as ’data
fragmented identity components.                                  silos’, where data is hidden within a walled garden. Ex-
                                                                 ported and aggregated social graphs from different services
Categories and Subject Descriptors                               could be used when signing up to a new web service to im-
H.4 [Information Systems Applications]: General                  port the user’s existing social network, trust networks could
                                                                 be created based on transitive relationships in the social
                                                                 network, and recommendation systems could retrieve sug-
General Terms                                                    gestions based on the imported social graph (based on the
Linked Data                                                      actions of social network members).

Keywords                                                         This aggregation also creates a dentralised description of a
Semantic Web, Social Web, Linked Data, RDF, FOAF                 person’s online identity. The aggregation contains linked
                                                                 data which can be referenced for additional identity infor-
1.   INTRODUCTION                                                mation from distributed data silos. Thus exporting informa-
                                                                 tion from closed data services and opening up information
The social web has allowed web users to replicate their of-
                                                                 for reuse [7].
fline lives and actions in an online environment. Sharing
photos, messaging friends and networking to build relation-
                                                                 In this paper we present our contribution to the linked data
ships both socially and professionally has drawn users into
                                                                 project by describing an approach to export social data in
using social web platforms to organise their lives. This has
                                                                 a semantic graph format using RDF from a range of social
brought about the generation of rich social data attached to
                                                                 web services, and aggregate the generated RDF by providing
individual web users, describing their online identity. Prop-
                                                                 links between the exported datasets. The former component
erties of this identiy include biographical information; name,
                                                                 relies on mapping XML schemas to the appropriate ontology,
address, date of birth, and social infrormation; relation-
                                                                 whereas the latter is a graph matching problem providing
ships, contacts and affiliations. The modern web user has
                                                                 links between person instances in separate graphs that are
a fragmented identity distributed over multiples platforms
                                                                 found to refer to the real world person or entity.
and services. Each service contains a different social graph,
aggregating each graph would generate a clear description
                                                                 Deciding when a link should be created is an issue due to
∗Copyright is held by the author/owner(s).                       the non-existence of global unique identifiers for people, us-
LDOW2009, April 20, 2009, Madrid, Spain.                         ing simply the name of a person is problematic due to name
                                                                 ambiguity. Therefore we utilise as much additional seman-
                                                                 tic information as possible to aid the linking process. We
                                                                 present our own approach to matching person instances us-
                                                                 ing low-level reasoning, and evaluate its success against two
                                                                 formal graph matching functions.


                                                                 2.     SOCIAL WEB LINKED DATA INTIATIVES
                                                                 1
                                                                     http://linkeddata.org/
Representing social data found within social web services         must fulfill:
and platforms has been investigated by the SIOC (Semantically-
Interlinked Online Communities) project2 . Work by Bojars
et al presented in [1] demonstrates how social web platforms           1. Export social data contained within data silos into the
such as blogging services and social bookmarking platforms                same semantic form.
can express their content using the SIOC specification, pro-
viding semantic formalisations for authors, and their gener-           2. Link person instances from separate social networks
ated content. SIOC extends the FOAF (Friend of a Friend)                  referring to the same real world person.
specification [3] designed to describe personal data consist-          3. Maximise the number of correct links while minimising
ing of biographical information and express relationships                 the number of incorrect links.
with other people in a social network. FOAF is well suited
to our work, as each social web platform is built on the social        4. Publish a decentralised linked social graph.
model of making relationships and linking people together.
                                                                  4.     SOCIAL GRAPH EXPORTATION
Although such formalisations exist it is essential to export      Exporting social graphs from walled garden data silos com-
information from data silos into the suitable format. Work        monly involves the trivial task of mapping XML schemas
by Alexandre Passant presented in [9] describes how infor-        offered by the web service to a semantic specification. We
mation is exported from Flickr3 , the photo sharing platform,     extend our previous work in [11] by incorporating OpenID11
as RDF using the SIOC and FOAF specifications. Similarly          to address person resolution, and enable information link-
work has also been carried out to produce an exporter4 of         age. At a low-level this involves the requirement of an
RDF using FOAF from the microblogging site Twitter5 , al-         OpenID resource for the social graph owner, which is then
though the exportation of information is somewhat limited         assigned to the foaf:Person instance in the graph using the
by not extracting any geographical or additional biograph-        foaf:openid relation. For exporting social graphs, we use
ical information for social network members. Recent work          the FOAF specification to describe available social informa-
by QDos6 has created a FOAF Builder application7 capable          tion in a semantic format. In the remainder of this section
of providing linked data representations between a person’s       we present several exporters developed to extract RDF from
interests with the DBPedia dataset. However, in the context       several web platforms12 .
of this paper it fails to aggregate social graphs from data si-
los, instead pointing links to the existence of accounts within
those silos for a given person.                                   4.1      Social Networking Sites
                                                                  We have created social graph exporters for two social net-
Social data extraction was described by Halpin in [6] as ex-      working sites. The first exporter extracts social data from
porting information into RDF using GRDDL8 from XHTML,             Facebook by mapping the returned XML response to the
providing social markup existed. The markup had to be             FOAF ontology thus capturing the identity information, and
mapped to a well defined vocabulary such as FOAF or SIOC          the social network consisting of instances of foaf:Person
depending on the information used. I.e., the XFN Microfor-        linked by the foaf:knows property to provide relationship
mat9 would be mapped to FOAF. In our previous work [11]           ties. The FOAF ontology is well suited to capturing social
we describe our methodology for exporting social data from        information due its extensive expression and definition of
the social networking site Facebook10 , producing RDF ac-         identity information. Thus we consider XML schemas used
cording to the FOAF specification. We extend this work in         by social web services to be subsets of the FOAF specifica-
the context of this paper.                                        tion. We found that there were no properties that we could
                                                                  not map from the XML schema to concepts from the FOAF
                                                                  ontology.
3.   REQUIREMENTS
Our approach to aggregating social graphs is split into two       Geographical information including city and country is form-
distinct stages: The first stage generates RDF using the nec-     liased as an instance of geo:Feature by assigning the per-
essary ontologies from various social web services, and the       son’s city and country to the geo:name and geo:inCountry
second stage then aggregates these social graphs. RDF pro-        properties from the Geonames ontology13 . We chose the
vides a useful formalisation to describe information within       Geonames ontology due to its adoption by the Semantic
each data silo, capturing biographical information and social     Web community as a standard for describing geographical
network information. Where possible our approach must             concepts. Each social network member is assigned a URI
provide links to social network members in separate net-          using the user identification number from the service, un-
works that are the same person. Based on these function-          fortunately the exportation of email addresses and web sites
alities we have defined four requirements that our approach       is not allowed which would serve as a useful dereferencing
                                                                  point. The following is a snippet of the RDF exported from
2
   http://sioc-project.org/                                       Facebook.
3
   http://www.flickr.com
 4
   http://tools.opiumfield.com/twitter/mattroweshow/rdf
 5
   http://www.twitter.com                                         <foaf:Person rdf:ID="#me">
 6                                                                  <foaf:name>Matthew Rowe</foaf:name>
   http://qdos.com/
 7                                                                  <foaf:givenname>Matthew</foaf:givenname>
   http://foafbuilder.qdos.com/
 8                                                                11
   http://www.w3.org/2004/01/rdxh/spec                               http://openid.net
 9                                                                12
   http://www.gmpg.org/xfn/                                          http://ext.dcs.shef.ac.uk/˜u0057/SocialGraphAggregator/
10                                                                13
   http://www.facebook.com                                           http://www.geonames.org/ontology/
  <foaf:family_name>Rowe</foaf:family_name>
  <foaf:openid rdf:resource="http://matthewroweshow.blogspot.com/"/>
  <foaf:knows>
    <foaf:Person rdf:about="#617555567">
      <foaf:name>Sam Chapman</foaf:name>
      <foaf:givenname>Sam</foaf:givenname>
      <foaf:family_name>Chapman</foaf:family_name>
      <foaf:img
        rdf:resource="http://profile.ak.facebook.com/617555567.jpg"/>
      <foaf:holdsAccount>
        <foaf:OnlineAccount>
          <foaf:accountServiceHomepage
            rdf:resource="http://www.facebook.com/" />
          <foaf:accountName>617555567</foaf:accountName>
        </foaf:OnlineAccount>
      </foaf:holdsAccount>
      <foaf:based_near>
        <geo:Feature>
          <geo:name>Sheffield</geo:name>
          <geo:inCountry>United Kingdom</geo:inCountry>
        </geo:Feature>
      </foaf:based_near>
    </foaf:Person>
  </foaf:knows>
  <foaf:knows>
    ...
  </foaf:knows>
</foaf:Person>
                                                                                      Figure 1: Twitter Social Graph


We applied the same approach when exporting social graphs               Following the exportation of social graphs from their host-
from the social networking site MySpace14 . As MySpace                  ing service, it is essential to enrich the graph where pos-
follows the OpenSocial15 specification, our exportation tool            sible. The first stage of the exportation process only cre-
can be adapted to easily extract social graph information               ates a geographical reference using the place name which
from several other social networking platforms supporting               can cause problems with ambguity. Therefore we enrich
the same specification (Bebo, Orkut, Hi5). As the OpenSo-               this representation by resolving the place name with unique
cial specification and MySpace’s data accessibility are still in        identifiers. To perform this process we query the Geonames
development at the time of writing this paper and conduct-              Web Service16 (which accesses the Geonames dataset) using
ing our work, we were unable to extract rich social network             the geo:name and geo:inCountry properties assigned to the
data. This reduced the exportation process to only the per-             geo:Feature instance for each social network member. This
son name of each social network member described using                  returns a list of possible URIs for the location. We select
foaf:name and the user identification number from the site              the most relevant URI from the list and then assign this to
which was employed as a URI, each is assigned to an in-                 the geo:Feature instance, and the latitude and longitude
stance of foaf:Person which we bind to the social graph                 of the location which are assigned using the geol:lat and
owner using the foaf:knows property. The ability to only                geol:long properties from the Geolocations vocabulary [2].
extract person names and a URI meant that we were unable                This additional enrichment offers extra information for per-
to export geographical information.                                     son resolution in the graph space, essential for the following
                                                                        social graph aggregation procedures.
4.2       Microblogging Platforms
For exporting RDF from the micro-blogging site Twitter we               5.      SOCIAL GRAPH AGGREGATION
followed the same methodology as the exportation of social              Aggregating social graphs identifies matching foaf:Person
graphs from the previous social networking sites. Twitter               instances in separate graphs and provides links between these
also offers an XML response (without the required authen-               instances using the owl:sameAs property. The main chal-
tication) enabling a mapping to be made between the XML                 lenge we face is deciding if two instances of foaf:Person
schema to concepts from the FOAF ontology. As Twitter al-               with the same name in different social networks refer to the
lows access to geographical information we model this data              same real world person or entity. To make this decision
as an instance of geo:Feature, and assign the city and coun-            we formalise a graph from each instance of foaf:Person
try to the geo:name and geo:inCountry properties respec-                and all outgoing properties and relations in the social net-
tively. Each social network member is described using an                work: This graph contains biographical information about
instance of foaf:Person, and the geo:Feature instance is                the person, which can then be compared using graph match-
assigned using the foaf:based_near property. We also used               ing techniques to derive a similarity measure and therefore
the display name used by each member of the social net-                 the possibility of a match.
work as their URI. Figure 1 shows an example social graph
exported from Twitter.                                                  Formally we define the person graph as G = hV, Ei where
                                                                        V denotes all the nodes within the graph represented as
4.3       Graph Enrichment                                              resources and literals extracted from the foaf:Person in-
                                                                        stance in the social network, and E denotes the edges con-
14
     http://www.myspace.com
15                                                                      16
     http://code.google.com/apis/opensocial/                                 http://www.geonames.org/export/
necting those nodes represented as semantic relations and                 between music datasets for artists, and records. In the con-
properties. Therefore the social network found within a                   text of our work we follow a similar line of application by
given FOAF file contains a set of person graphs, each de-                 attempting to provide links between members of different
scribing a different real world entity. Imagine we have two               social networks, essentially held in separate datasets. There-
FOAF files F1 and F2 where g ∈ F1 and h ∈ F2 describes                    fore we can adapt this approach for our work by comparing
all the graphs within F1 and F2 respectively, and a similiar-             person graphs, and deriving matches based on the best pos-
ity function sim(g, h) that measures the similarity of the two            sible node mappings according to the cumulative similarity
graphs: Our task is to find the graphs in both F1 and F2 that             score: This score like node/edge overlap must exceed a pre-
achieves the maximum similarity measure whilst exceeding                  defined threshold for a match to be valid.
a given threshold, and therefore provide a match. This re-
duces the aggregation problem to a graph matching task; to                5.3     Graph Reasoning
investigate this procedure we now present three alternative               Due to the semantic structure of the graphs we are compar-
methods for computing graph similarity. We evaluate and                   ing it is possible to utilise semantic metadata to detect a
compare the success of each method later.                                 positive match using some basic low-level reasoning: Imag-
                                                                          ine we have two graphs that we wish to compare: We ex-
5.1    Node/Edge Overlap                                                  tract the string literal from both graphs connected from
Graph matching using node and edge overlap as described                   the foaf:Person instance using the foaf:name property,
in [8] utilises the Jaccard distance [4] between two graphs to            thereby returning the person name of each graph. We com-
derive a similarity measure, the intuition behind this method             pare the names using the Levenstein string metric to derive
being that the fewer edits required to transform one graph                a match. If the names match we then move on to compar-
into another, then the more similar the graphs are. Essen-                ing other properties from each graph to confirm that the
tially this method counts the number of edit operations to                foaf:Person instances are in fact referring to the same real
perform the transform, which are then normalised by sum-                  world object.
ming the node and edge counts from each graph. Therefore
the edit distance is described as:                                        5.3.1    Unique Identifiers
                                                                          Unique identifiers, where available, can be exported from
                                    |Vg ∪ Vh | + |Eg ∪ Eh |
                                                                          social web services and defined using the foaf:homepage,
              sim(g, h) = 1 − 2                                           foaf:mbox and foaf:phone properties for the website, email
                                  |Vg | + |Vh | + |Eg | + |Eh |
                                                                          address and telephone number respectively within the social
                                                                          graph. We find the edges in each graph that point to such
When generating the intersection of the node set from g                   unique identifiers, and compare them. Our intuition is that
and h we used the Levenstein string similarity measure [5]                a match would provide suffficient confidence to confirm the
to derive term similarity. If the similarity measure is above             link between foaf:Person instances in both social networks.
a predefined threshold then the nodes are classed as equiva-              However, should an edge only exist in one graph there is not
lent. String matching is not required for finding the intersec-           sufficient knowledge to confirm the link, therefore we move
tion between the edge sets due to their semantic types being              on to analysing further semantic information defined in the
formally defined, this reduces the comparison to a trivial                graph space.
matter of binary comparison of objects. It is important to
note that for consistency in our work, we used the same Lev-
enstein string similarity measure when comparing literals in
                                                                          5.3.2    Geographical Reasoning
each graph matching method.                                               When deciding if two people from different social networks
                                                                          refer to the same real world entity we rely on geograph-
                                                                          ical information as another useful information source for
5.2    Node Mapping                                                       reaching a match decision. We follow the intuition that the
Work by [10] provides a method to match graphs by provid-                 owner of several social networks would not be friends with
ing possible mappings between nodes in each graph. In a                   two or more people who share the same name and live in
similar approach to our work, semantic triples are modeled                the same place. For example, a person named ”Matthew
as edges in a graph describing a link between an object and               Rowe” is friends with ”Sam Chapman” on Facebook, and
a subject by a predicate taking the form (s, p, o). This ap-              friends with ”Sam Chapman” on Twitter. It is likely that
proach derives similarity measures between every possible                 ”Sam Chapman” refers to the same person in each social net-
combination of object nodes (with an outgoing edge) and                   work. On Facebook ”Sam Chapman” is described as living in
also between every combination of subject nodes (with an                  ”Sheffield” and on Twitter ”Sam Chapman” is also described
incoming edge) in separate graphs. Thus creating a list of                as living in ”Sheffield”. Therefore we believe such additional
all possible node mapping combinations between two graphs                 information is sufficient to confirm that both instances of
along with the similarity measure of the two nodes. The set               ”Sam Chapman” are the same person, and should therefore
of node mappings between two graphs is chosen that max-                   be linked.
imises the cumulative similarity score. Therefore the simi-
larity between two graphs is defined as:                                  We extract the geo:Feature class attached to each instance
                                                                          of foaf:Person by the foaf:based_near property. In the
              Pn                 i    .      Pn               j    .
                                                                          previous section, we added additional geographical edges to
                 i=1 max(strsim(sg , sh )) +  j=1 max(strsim(og , oh ))   the geo:Feature instance to describe the latitude and lon-
sim(g, h) =
                                       #mappings
                                                                          gitude of the location, together with a URI obtained from
                                                                          the geonames web service. Given that we wish to com-
The application of the approach in [10] is to detect links                pare instances of foaf:Person sharing the same name, we
first compare the location URI assigned to the geo:Feature               return match
class. Should the URIs match then we confirm that both
foaf:Person instances refer to the same real world entity.            Get location URI geoidg from g assigned to geo:Feature
                                                                      Get location URI geoidh from h assigned to geo:Feature
However, if the URIs are different we compare the geograph-           If geoidg = geoidh
ical proximity of the locations. Our reasoning behind this               return match
comparison is that people will divulge more sensitive infor-          Else If geoidg = null and geoidh = null
mation in different social networks, for example in a walled             return maybematch
garden social networking site the user feels safer, averting          Else
prying eyes and would therefore state what suburb they re-               Get city name cg from g using geo:name
side in. Conversely, on a micro-blogging platform the user               Get city name ch from h using geo:name
may only define the city they reside in. One method is to                return checkSuburb(cg , ch )
derive the geographical distance using the latitude and lon-        Else
gitude described by the geo:lat and geo:long properties,              return nomatch
and calculate the distance between the two points using the
Haversine formula from [13]. Should the derived distance be
less than a predefined threshold then the person instances         checkSuburb(cg , ch ) :
are deemed to be the same. However, following several ex-            Get districts of city label lg for cg
periments we found such a method to be prone to making               If strsim(lg , ch )
incorrect matches in rural areas. Therefore decided to use              return match
the semantics of the location to better effect:                      Else
                                                                        return maybematch
We analyse the place names to derive a relation between
the locations and discover the semantics of that relation, if
one exists. For example, a person named “Matthew Rowe”
may reside in “Crookes” whereas another person, also named         As we demonstrate above, the sim() function returns three
“Matthew Rowe” may reside in “Sheffield”. We derive the            classes of match: match, maybematch, and nomatch. If the
locality of Crookes to analyse if there exists a relation to       foaf:name and geo:Feature URI match in each graph then
Sheffield. To do this we query DBPedia17 using the following       we are fairly confident that the foaf:Person instances refer
SPARQL query:                                                      to the same real world entity, we also believe that the same
                                                                   level of confidence should be attached to matching a suburb
                                                                   using the the checksuburb() function. In both cases we re-
SELECT ?city WHERE {                                               turn match. However, we are less confident when location
  <http://dbpedia.org/resource/Crookes>
    <http://www.w3.org/2004/02/skos/core#subject>
                                                                   infomation is not available for either person, likewise if we
      ?districts .                                                 cannot discover if either foaf:Person instance’s location is
  ?districts                                                       somehow related, therefore we only return a maybematch.
    <http://www.w3.org/2004/02/skos/core#prefLabel>
      ?city
                                                                   Only when the foaf:name properties do not match are we
}                                                                  confident that the foaf:Person instances refer to different
                                                                   people.

This query returns the literal “Districts of Sheffield” describ-
ing the label for the DBPedia category containing all dis-
tricts in Sheffield. The geo:name property can then be matched
                                                                   6.   PRODUCING LINKED DATA
                                                                   Following the previous section we now have a set of matched
against this literal to confirm that the location “Crookes” is
                                                                   graphs where each graph refers to the same real world entity
a district of “Sheffield” and therefore the graphs should be
                                                                   or person. We produce links between these representations
linked as they refer to the same real world entity. We define
                                                                   in the form of a new RDF graph describing the aggregated
the similarity function using the following pseudocode:
                                                                   social graph content. We do not wish to duplicate infor-
                                                                   mation contained within each exported social graph, but
sim(g, h) :                                                        instead provide links to this information for later reuse (we
  Get person name ng from g using foaf:name                        explain our reasoning behind this decision in the preced-
  Get person name nh from h using foaf:name                        ing subsection). For biographical information we aggregate
  If strsim(ng , nh ) > threshold                                  all available properties from each social graph to generate a
                                                                   complete identity representation. For example the Facebook
       Get mboxg from g using foaf:mbox                            social graph contains identity information such as name,
       Get mboxh from h using foaf:mbox                            and data of birth, whereas the Twitter and MySpace so-
       If mboxg =mboxh                                             cial graphs contain the homepage, aggregating this contain
          return match                                             defragments this person identity to generate a complete pro-
                                                                   file.
       Get homepageg from g using foaf:homepage
       Get homepageh from h using foaf:homepage                    A new social network is created containing the aggrega-
       If homepageg =homepageh                                     tion of individual social networks from each social graph,
                                                                   matched instances of foaf:Person are merged to create a
17
     http://dbpedia.org                                            new instance as follows:
                                                                  latter option being particularly well suited to this setting
                                                                  where users are allowed access to a social graph depending
                                                                  on their authenticated FOAF file containing a trusted per-
                                                                  son that matches the requested social graph.

                                                                  Another motivation behind this design decision was for the
                                                                  allowance of extensibility and addition of new triples into the
                                                                  aggregated social graph. If we imagine that a user has cre-
                                                                  ated their aggregated social graph containing exported social
                                                                  data from Facebook and Twitter, meanwhile they have also
                                                                  recently built up a rich social graph of contacts and informa-
                                                                  tion on the professional networking site LinkedIn20 . Linking
                                                                  this new social graph to the existing aggregated graph only
                                                                  requires the addition of new triples to the exisiting sctruc-
                                                                  ture, which in essence would be either a new foaf:Person
                                                                  instance linked by the foaf:knows property for a new person
                                                                  in the aggregated graph space, or a new owl:sameAs relation
               Figure 2: Linked Social Graphs                     pointing to the foaf:Person instance in the exported social
                                                                  graph from LinkedIn.

<foaf:knows>
  <foaf:Person rdf:about="#samchapman">                           7. EXPERIMENTS
    <foaf:name>Sam Chapman</foaf:name>
    <owl:sameAs
                                                                  7.1 Data Sets
      rdf:about="http://namespace.com/fb.rdf#617555567"/>         In order to evaluate and compare the success of our graph
    <owl:sameAs                                                   matching methodology against the two alternative methods
      rdf:about="http://namespace.com/twitter.rdf#samchapman"/>   we generated three files containing valid RDF according to
  </foaf:Person>
</foaf:knows>                                                     the FOAF ontology. Each file was obtained from a different
                                                                  web service (Facebook, MySpace, and Twitter) using the
                                                                  exportation methodology described in section 4, and each
For this instance we only include the foaf:name property          file holds information related to one web user who has an
and an identifier. We use hash values for identifiers accord-     account with each social web service.
ing to guidelines described in [12] due to the relatively small
size of the datasets being considered for linkage. Where          We analyse the success of the three matching methods when
possible we reuse the identifiers from the available social       attempting to match instances of foaf:Person in the dif-
graphs. As figure 2 demonstrates we reuse the identifier          ferent datasets by evaluating for type I errors (false posi-
from the Twitter social graph for the merged foaf:Person          tives) and type II errors (false negatives). Positives indicate
instances. For foaf:Person instances that contain no ag-          a match and therefore where the datasets should be inter-
gregated content (I.e., only appear in the Facebook social        linked for that concept, negatives indicate where a match
graph), we simply reuse the identifier from the accompany-        should not take place and therefore no linked data should
ing social graph (eg. User identification number). For each       exist. The optimum method should produce neither of these
merged instance of foaf:Person we include a reference us-         error types.
ing the owl:sameAs property to the resource containing the
instance.                                                         Figure 3 demonstrates how each individual dataset used in
                                                                  the experiment contains possible overlaps which should be
6.1       Social Graph Control                                    linked together. These overlaps consist of social network
By providing linked data representations for each instance of     members from each dataset who are the same person. For
foaf:Person in the aggregated social graph we attempt to          example, ”Sam Chapman” appears in the Facebook dataset,
minimise the duplication of personal information while offer-     and ”Sam Chapman” also appears in the Twitter dataset,
ing decentralisation through linked data. This minimisation       therefore a decision must be made whether a link could be
is an essential component of the defragmentation of iden-         established. The lack of an intersection between the Twitter
tity information. It also passes responsibility for data access   and MySpace datasets is due to the fragmentation of identity
to separate locations and therefore separate access policies.     information in each data silo. The user in question whose in-
For example, if a person may wish for their Facebook so-          formation we extracted from each service, used Twitter for
cial graph to remain separate and only accessible by people       professional purposes, Facebook for both professional and
they know and trust then access responsibility is delegated       social, and MySpace for music and social purposes. There-
to the hosting service. Work by Yeung et al presented in [7]      fore the music and professional elements should not overlap,
presents a case for the control of social data using trusted      thus separating the Twitter and Myspace datasets.
hosting services, thereby delegating access responsibility to
the hosting party. Recent advancements in access to so-           7.2       Results
cial data now allows authentication to be controlled by such      Tables 1 and 2 show the results obtained from our analysis
services such as OAuth18 and recently FOAF+SSL19 . The            together with the gold standard indicated in the final col-
18                                                                umn. As we can see from Table 1 node/edge overlap returns
     http://oauth.net/
19                                                                20
     http://esw.w3.org/topic/foaf+ssl                                  http://www.linkedin.com
                                                                                N’/E’ Overlap   N’ Mapping   G’ Reasoning   GS’
                                                                    True Pos          2              10           5          10
                                                                    True Neg         662            555          660        662
                                                                    False Pos         0             107           2           0
                                                                    False Neg         8               0           5           0

                                                               Table 2:         Matching the Facebook and MySpace
                                                               Datasets


                                                               ments none of the three graph matching methods produced
                                                               links between these datasets, therefore we decided to omit
             Figure 3: Social Graph Overlap
                                                               the results as there was nothing to present.
             N’/E’ Overlap   N’ Mapping   G’ Reasoning   GS’
 True Pos          7              8            11        11    8.       CONCLUSION AND FUTURE WORK
 True Neg         387            388          389        389
                                                               This paper presents our work investigating the exportation
 False Pos         2              1            0          0
 False Neg         4              3            0          0    of social data described using semantic ontologies, and the
                                                               linking of this data where possible. Comparing the out-
Table 1:      Matching the Facebook and Twitter                come of this work with the previously detailed requirements
Datasets                                                       it is clear that social data has been exported from data
                                                               silos in a semantic form, and person instances from sepa-
                                                               rate social networks are linked together where possible. Our
the poorest results by correctly matching the fewest per-      method to perform low-level reasoning when matching per-
son graphs and also incorrectly matching the most person       son graphs yields good results in comparison with similar
graphs. Node mapping performs well but still falsely classi-   methods by maximising the number of correctly matched
fies 3 instances of foaf:Person as being no matches. Graph     person instances and minimising the number of incorrect
Reasoning outperforms both previous methods by produc-         matches.
ing the most correct links between the person graphs. The
reason for this method’s outperformance is due to the large    The produced RDF containing linked data describes links
number of triples available in each person graph. Both the     between matched person instances. Our decision not to ag-
Facebook and Twitter datasets contain rich social data that    gregate all biographical information for each social network
can be exported from each service, namely geographical in-     member is due to the privacy policies that we believe social
formation that can be used to classify positive matches.       data must adhere to. Instead, links to the existence of this
In the experiments we permitted links to be created for        data are provided. The data contained within the exported
foaf:Person concepts that returned a maybematch when           social graph data can then be controlled by a separate access
performing graph reasoning.                                    policy. This fits in nicely with recent work to address pri-
                                                               vacy and trust within the Semantic Web community, where
The matching of graphs within the Facebook and MySpace         technologies such as FOAF+SSL (to control social graph ac-
datasets yields interesting results. As Table 2 demonstrates   cess) and POWDER21 (to describe social graph properties)
node/edge overlap performed poorly by only finding 2 in-       are being adopted.
stances of foaf:Person that intersected the datasets. Node
mapping derived 10 positive links between foaf:Person in-      Our future work will include additional social graph expor-
stances, but incorrectly created 107 links. Graph Reason-      tation tools for other web services, and also the release of the
ing did not classify as many correct links as node mapping     aggregation service. We also plan to test our graph matching
yet only falsely generated two links between the datasets.     approach on additional larger datasets, and hope that exist-
The results from this part of the evaluation throw up some     ing XML schemas expand to allow additional information to
interesting points: When we consider that we wish to min-      be exported.
imise the number of false links, then the most naive method;
node/edge overlap is the most reliable, however this pro-
cedure creates very few correct links. The reason for the
                                                               9.       ACKNOWLEDGEMENTS
                                                               We would like to thank both Harry Halpin and Sam Chap-
poor performance of node mapping (generated 107 incorrect
                                                               man for allowing their information to be presented in this
links) and graph reasoning (only generating 5 correct links)
                                                               paper as examples. And also Neil Ireson for his meticulous
is due to the triples available in each dataset. Unlike the
                                                               proof reading expertise.
Twitter dataset, the MySpace dataset only contains a name
property for each instance of foaf:Person. This means that
the graph matching task has very few triples and therefore a   10.       REFERENCES
limited graph structure to perform low-level reasoning with,        [1] U. Bojars, A. Passant, R. Cyganiak, and J. Breslin.
and in the case of node mapping must rely heavily on the                Weaving SIOC into the Web of Linked Data. In
string similarity metric to derive the mapping which as the             Proceedings of the WWW 2008 Workshop Linked Data
results demonstrates is unreliable.                                     on the Web (LDOW2008), Beijing, China, Apr 2008.
                                                                    [2] D. Brickley. Basic geo (wgs84 lat/long) vocabulary,
As mentioned previously when discussing the datasets for                January 2003.
use during the experiment, the Twtiter and MySpace datasets
                                                               21
do not contain any overlap. When performing the experi-             http://www.w3.org/TR/2008/WD-powder-dr-20081114/
 [3] D. Brickley and L. Miller. FOAF vocabulary
     specification. Technical report, FOAF project, May
     2007. Published online on May 24th, 2007 at.
 [4] H. Bunke, P. Dickinson, M. Kraetzl, and W. D.
     Wallis. A Graph-Theoretic Approach to Enterprise
     Network Dynamics. Birkhauser, 2006.
 [5] S. Chapman, B. Norton, and F. Ciravegna. Armadillo:
     Integrating knowledge for the semantic web. In
     Proceedings of the Dagstuhl Seminar in Machine
     Learning for the Semantic Web, February 2005.
 [6] H. Halpin. Social semantic mashups: Exploring
     networks using microformats and grddl. In Proceedings
     of XML Conference, 2006.
 [7] C. man Au Yeung, I. Liccardi, K. Lu, O. Seneviratne,
     and T. Berners-Lee. Decentralization: The future of
     online social networking. In W3C Workshop on the
     Future of Social Networking Position Papers, 2009.
 [8] P. Papadimitriou, A. Dasdan, and H. Garcia-Molina.
     Web graph similarity for anomaly detection (poster).
     In WWW ’08: Proceeding of the 17th international
     conference on World Wide Web, pages 1167–1168.
     ACM, 2008.
 [9] A. Passant. :me owl:sameAs flickr:33669349@N00. In
     Proceedings of the WWW 2008 Workshop Linked Data
     on the Web (LDOW2008), Beijing, China, Apr 2008.
[10] Y. Raimond, C. Sutton, and M. Sandler. Automatic
     interlinking of music datasets on the semantic web. In
     Proceedings of the WWW 2008 Workshop Linked Data
     on the Web (LDOW2008), Beijing, China, Apr 2008.
[11] M. Rowe and F. Ciravegna. Getting to me: Exporting
     semantic social network from facebook. In Proceedings
     of the ISWC 2008 Workshop Social Data on the Web
     (SDOW2008), October 2008.
[12] L. Sauermann, R. Cyganiak, and M. Voelkel. Cool uris
     for the semantic web. Technical report, Deutsches
     Forschungszentrum fuer Kuenstliche Intelligenz
     GmbH, 2007.
[13] R. W. Sinott. Virtues of the haversine, 1984.