=Paper=
{{Paper
|id=None
|storemode=property
|title=Inferring Web Citations using Social Data and SPARQL Rules
|pdfUrl=https://ceur-ws.org/Vol-595/paper1.pdf
|volume=Vol-595
}}
==Inferring Web Citations using Social Data and SPARQL Rules==
Inferring Web Citations using Social Data and
SPARQL Rules
Matthew Rowe
OAK Group, Department of Computer Science, University of Sheffield, Regent Court,
211 Portobello Street, S1 4DP Sheffield , United Kingdom
m.rowe@dcs.shef.ac.uk
Abstract. Web users who disseminate their personal information risk
falling victims to malevolent web practices such as lateral surveillance
and identity theft. To avoid such practices, web users must manually
search for web resources which may cite them and then perform analyses
to decide which resources do cite them; a time-consuming and frequently
repeated practice. This paper presents a automated rule-based technique
to identify web citations, intended to alleviate this manual process. Seed
data is leveraged from user profiles on various Social Web platforms and
is then used as seed data from which SPARQL rules are constructed and
applied in infer web citations. An evaluation of this technique against
humans performing the same task shows higher levels of precision.
1 Introduction
A large proportion of Web platforms now encourage web users to become in-
volved and participate online through the provision of interactive functionalities.
Blogging platforms such as Live Journal allow web users who lack the technical
expertise to build their own web page to publish their thoughts and musings in
an online space and news services such as the BBC now allow web users to be-
come involved by sharing content and being involved in topical discussions. The
increased involvement of web users has lead to a reduction in online privacy and
an increase in the publication of personal information on the World Wide Web.
The sensitive nature of this information has in turn lead to a rise in malevolent
web practices such as lateral surveillance [1] and identity theft - which currently
costs the UK economy £1.2 per annum1 . To avoid falling victim to such practices
web users are forced to manually search for web resources (i.e. web pages) which
may cite them and identify which web resources do refer to them.
Manually performing the task of identifying web citations is time consum-
ing, laborious and information intensive - given the ever increasing amount of
information which is published on the Web. Furthermore the rate at which the
Web grows requires this process to be repeated frequently so that new web ci-
tations can be identified and the correct action taken (i.e. removing sensitive
information). This presents a clear motivation for the application of automated
1
http://www.identitytheft.org.uk/faqs.asp
techniques to identify web citations. To be effective, such techniques must be
provided with seed data which describes the identity of the person whose web ci-
tations are to be found. However producing this seed data manually is expensive
- given the rich identity descriptions which are required to enhance the ability
of the technique to detect web citations.
This paper presents an approach to identify web citations using a rule-based
technique. Social graphs are exported from user profiles built on individual Social
Web platforms using the Resource Description Framework (RDF) and the Friend
of a Friend (FOAF) 2 and GeoNames3 ontologies to describe the semantics of the
exported data. Seed data is built by combining several social graphs together,
from distributed Social Web platforms, into a single RDF model. SPARQL rules
[6] are then constructed from RDF instances within the seed data using a tech-
nique inspired by existing general-to-specific rule induction techniques from the
state of the art. Web resources are gathered and the SPARQL rules are applied
to the resources in order to infer web citations by matching information within
the seed data - which describes a person’s identity on the Social Web - with
a given web resource. One of the governing intuitions behind this approach is
that a given person will appear in web resources with other individuals that
he/she knows - this is reflected in similar identity disambiguation literature [4,
5]. User profiles from Social Web platforms provide the necessary data to facili-
tate support the application of this intuition. We present a thorough evaluation
of this technique and compare its performance against several baseline measures
including human processing.
We have structured the paper as follows: section 2 discusses related work,
explaining current rule induction techniques and rule-based classification ap-
proaches. Section 3 presents the approach to detecting web citations by describ-
ing the generation of seed data, the construction of SPARQL rules and their
application. Section 4 presents an evaluation of the technique against humans
performing the same task. Section 5 describes the conclusions and plans for
future work within this area.
2 Related Work
Identifying web citations has been attempted in [2] using background knowledge
provided as a list of names representing the social network of a person, posssible
web citations are then gathered by searching the Web and clustered based on
their link structures as either citing the person or not. Similar work by [12] gath-
ers and labels web pages as either containing an entity reference or not, seed data
is provided as labelled training examples from which a machine learning classifier
is learnt and applied to unlabelled web pages. Several commercial services have
been proposed to identify web citations in order to tackle identity. For example
Garlik’s Data Patrol4 service searches across various data sources for personal
2
http://xmlns.com/foaf/0.1/
3
http://www.geonames.org/ontology/
4
http://www.garlik.com/dpindividuals.php
information about the individual and uses information when the user signs up
to the service to aid the process (includes biographical information such as the
name and address of the person). The Trackur5 service monitors social media
on the Web for references and citations with the intended goal of monitoring the
reputation of an individual. Similar to Data Patrol, Trackur requires background
information about the person whose reputation is to be monitored.
Rules provide a systematic means to infer a web citation and are built us-
ing two distinct types of induction technique: general-to-specific and specific-to-
general. The former use a sequential covering strategy by gradually specialis-
ing a general skeleton rule resulting in coverage of the example set for a given
class label. For instance the FOIL [8] algorithm constructs a general rule to
match examples from a given set and adds literals which are most gainful to
the antecedent of the rule thus specialising it. Specific-to-general rule induction
algorithms employ a simultaneous covering strategy by constructing many rules
simultaneously from the examples and then generalising those rules. The C4.5
algorithm [7] creates a decision tree from the labelled examples, rules are then
built from paths within the tree. As the produced rules overfit the examples,
C4.5 performs post-pruning to generalise the rules by reducing literals.
3 SPARQL Rules: Identifying Web Citations
Identifying web citations of a given person involves matching known informa-
tion which describes the person’s identity which information present within web
resources. Our approach to identifying web citations uses the intuition that a
person will appear in web resources with other people he/she knows (e.g. work
pages, electoral rolls) by gathering seed data which describes both biographical
and social network information from the Social Web. Should information appear
in a web resource which is related to the given individual - via the seed data -
then this information is matched and a web citation is inferred. Fig. 1 presents
an overview of the approach which is divided into three stages: first seed data
describing the individual whose web citations are to be found is gathered from
user profiles on the Social Web. Second, we then query the World Wide Web
and the Semantic Web using the person’s name to gather possible web citations.
Third, we build rules from the seed data and apply them to the web resources to
infer web citations. We now explain the various stages of the approach in greater
detail.
3.1 Generating Seed Data
We build seed data by exporting individual social graphs from several disparate
Social Web platforms - Facebook6 , Twitter7 and MySpace8 . An exported social
5
http://www.trackur.com/
6
http://www.facebook.com
7
http://www.twitter.com
8
http://www.myspace.com
Fig. 1. An approach to build and apply SPARQL rules to identify web citations
graph defines the profile of the user which is visible within a given platform.
The digital identity of a web user is fragmented across the Social Web between
different platforms, therefore, by exporting various user profiles, we are able
to capture a more complete identity representation of the person. Social Web
platforms allow access to data through APIs and expose data in proprietary
formats (i.e. XML responses using an XML schema unique to the platform). We
export social graphs as RDF models using concepts from the FOAF ontology
to describe biographical and social network information and concepts from the
Geonames ontology to define location information, thereby providing a consistent
interpretation of identity information from disparate profile sources.
Social graphs are then linked together to form a single RDF model which is to
be used as seed data. Interlinking is performed using reasoning over the available
information. We use handcrafted rules which compare instances of foaf:Person
in disparate graphs and infer a match based on the available data (e.g. same
homepage, same location). This provides a more detailed social graph from which
SPARQL rules can be built. A full explanation of this technique falls outside of
the scope of this paper however, instead we refer the reader to [9]. As Fig. 1 shows
the compilation of seed data is finalised once the single social graph is built. This
graph contains the features from which SPARQL rules are constructed, a snippet
of which would look as follows (using n3 syntax):
rdf:type foaf:Person ;
foaf:name "Matthew Rowe" ;
foaf:homepage ;
foaf:mbox ;
foaf:knows ;
foaf:knows .
rdf:type foaf:Person ;
foaf:name "Fabio Ciravegna";
foaf:mbox ;
foaf:homepage .
rdf:type foaf:Person ;
foaf:name "Sam Chapman" ;
foaf:mbox ;
foaf:homepage .
3.2 Gathering Possible Web Citations
To identity web citations for a given person we must gather a set of possible
citations so that our rules can then be used to identify which cite the person.
We search the World Wide Web using Google and Yahoo and the Semantic
Web using Watson9 and Sindice10 by querying for the person’s name whose web
citations are to be found. The results from the queries are then gathered together
provide a collection of web resources which may cite the person. In order to
apply SPARQL rules we create RDF models describing the knowledge structure
of the web resources - this allows triple patterns to be matched which associate
information within the social graph to information within the web resource.
For XHTML documents containing lightweight semantics such as RDFa we use
Gleaning Resource Descriptions from Dialects of Language (GRDDL)11 to apply
XSL Transformations to the documents, thereby gleaning an RDF model. An
example gleaned model from the OAK Group member page12 looks as follows:
foaf:topic ;
foaf:topic ;
foaf:topic .
rdf:type foaf:Person ;
foaf:name "Fabio Ciravegna";
foaf:homepage .
rdf:type foaf:Person ;
foaf:name "Matthew Rowe" ;
foaf:homepage .
rdf:type foaf:Person ;
foaf:name "Sam Chapman" ;
foaf:homepage .
For HTML documents we build RDF models from person features by using
DOM manipulation to identify context windows and then extracting person in-
formation from those windows. A single context window contains information
about a single person: his/her name together with or without an email, web
address, location. We extract these attributes from the window using Hidden
Markov Models trained for the task, and then build an RDF model of the web
resource containing these features using the same ontologies as the seed data (i.e.
creating an instance of foaf:Person). An extensive discussion of this technique
falls outside the scope of the paper, instead we refer the reader to [10]. Meta-
data models returned from querying the Semantic Web are left in tact, given
that machine-readable descriptions are already provided.
3.3 Inferring Web Citations using SPARQL Rules
Seed data is provided in the form of a social graph, defined using RDF, describ-
ing both the biographical and social network information of a given person. This
9
http://kmi-web05.open.ac.uk/WatsonWUI/
10
http://www.sindice.com|
11
http://www.w3.org/TR/grddl/
12
http://oak.dcs.shef.ac.uk/people
single RDF model provides the solitary example from which rules can be con-
structed - given that this is the only information which is known to describe the
person whose web citations are to be identified. This limits the ability of state of
the art rule induction techniques such as FOIL and C4.5 to function effectively,
given their reliance on sufficiently large example sets. Rather than relying on a
large example set, rules are instead constructed from RDF instances within the
social graph.
RDF Instance Extraction An RDF instance represents a resource within a
given RDF model which can either be an anonymous node or identified by a URI.
An instance is a unique object, which in the case of the seed data can be a social
network member - identified as an instance of foaf:Person - or a location related
to a given person - identified as an instance of geo:Feature. Such instances form
a useful basis for inferring a web resource as referring to an individual, given
that if information describing the instance is also found within a web resource
then the web resource can be identified as citing the person. To leverage RDF
instances from a given RDF model we use a Resource Leaves (Equation (1))
construct which selects all the triples () attributed to a given resource
(r) from a given RDF model/graph (G) where the object (o) of each triple does
not act as the subject of other triples (). This returns a set of resources
and literals which form leaves surrounding the resource (r) such that no paths
are beyond those leaves - these in turn provide the features from which rules are
built.
RLSG (r) = {< r, p, o > | < r, p, o >∈ G∧ # ∃p! , o! < o, p! , o! >∈ G} (1)
Building SPARQL Rules SPARQL provides a mechanism for querying RDF
models by matching graph patterns. SPARQL rules allow a given graph to be
derived - denoted by the triples within the CONSTRUCT clause of the rule - by
matching triples within the WHERE clause of the rule. SPARQL rules are built
using a general-to-specific strategy inspired by FOIL [8] and only a single positive
example - the social graph - to construct general skeleton rules which are then
specialised. The strategy for building rules works according to Algorithm 1,
which will now be explained.
Each resource is extracted from the supplied social graph (seed data) (line
2). A skeleton SPARQL rule is created for each resource (line 3) comprised of the
name of the social graph owner together with triples identifying a person within
a web resource with the same name. The Resource Leaves construct (RLSG ) is
then used to extract a tripleset for the resource (r) from the social graph (line
4) - this forms the information which is used to build the rules. The algorithm
then goes through the tripleset and builds the rules as follows: if the resource
(r) is an instance of foaf:Person and is not the social graph owner - and a social
network member - then the triple () is added to the WHERE clause of R
(line 7). Additional triples are added to relate the social graph owner with the
Algorithm 1 buildRules(rp ,G) : Induces rules from RDF instances. Input to
this algorithm is the social graph (G) and the person whose web resources are
to be disambiguated (rp ). Rules are induced and added to the rule base (RB).
Input: G, rp
Output: RB
1: RB = ∅
2: for each resource r ∈ G do
3: R = CONSTRUCT {} WHERE {