=Paper=
{{Paper
|id=None
|storemode=property
|title=Data.dcs: Converting Legacy Data into Linked Data
|pdfUrl=https://ceur-ws.org/Vol-628/ldow2010_paper01.pdf
|volume=Vol-628
|dblpUrl=https://dblp.org/rec/conf/www/Rowe10a
}}
==Data.dcs: Converting Legacy Data into Linked Data==
Data.dcs: Converting Legacy Data into Linked Data∗ ∗
Matthew Rowe
OAK Group
Department of Computer Science
University of Sheffield
Regent Court, 211 Portobello Street
S1 4DP Sheffield, United Kingdom
m.rowe@dcs.shef.ac.uk
ABSTRACT such data. This therefore makes the process of producing
Data.dcs is a project intended to produce Linked Data de- linked data limited.
scribing the University of Sheffield’s Department of Com-
puter Science. At present the department’s web site con- In this paper we use the case of the University of Sheffield’s
tains important legacy data describing people, publications Department of Computer Science (DCS). The DCS web site
and research groups. This data is distributed and is pro- contains information about people - such as their name,
vided in heterogeneous formats (e.g. HTML documents, email address, web address and location, research groups
RSS feeds), making it hard for machines to make sense and publications. The department provides a publication
of such data and query it. This paper presents an ap- database located separately from the main site on which
proach to convert such legacy data from its current form DCS members manually upload their papers. Each mem-
into a machine-readable representation which is linked into ber of the department is responsible for their own personal
the Web of Linked Data. The approach describes the triplifi- web page, this has lead to the formatting and presentation
cation of legacy data, coreference resolution and interlinking of legacy data to vary greatly between pages, where some
with external linked datasets. pages contain RDFa and others are plain HTML documents
with the bare minimum of markup. This impacts greatly
Categories and Subject Descriptors on the usability of the site in general and the slow process
by which information can be acquired. For instance find-
H.4 [Information Systems Applications]: General
ing all the publications which two or more research groups
have worked on in the past year would take a large amount
General Terms of filtering and data processing. Furthermore the publica-
Linked Data tion database is rarely updated to reflect publications by the
department and its members.
Keywords
Linked Data, Triplification, Coreference Resolution This use case presents a clear motivation for generating a
richer representation of legacy data describing the DCS. We
define legacy data as data which is present in proprietary
1. INTRODUCTION formats and which describes important information about
Recent work has addressed the issue of producing linked data the department - i.e. publications. Leveraging legacy data
from data sources conforming to well-stuctured relational from HTML documents which make up the DCS web site
databases [2]. In such cases data already follows a logical and converting this data into a machine-readable form us-
schema making the creation of linked data a case of schema ing formal semantics would link together related informa-
mapping and data transformation. The majority of the Web tion. It would link people with their publications, research
however does not conform to such a rigid representation, groups with their members and allow co-authors of research
instead the heterogeneous structures and formats which it papers to be found. Furthermore by linking the dataset into
exhibits makes it hard for machines to parse and interpret the Web of Linked Data would allow additional information
∗
The research leading to these results has received funding to be inferred such as the conferences which members of the
from the EU project WeKnowIt10 (ICT-215453). DCS have attended and provide up-to-date publications list-
∗Copyright is held by the author/owner(s). ings - thereby avoiding the current slow update process by
LDOW2010, April 27, 2010, Raleigh, USA. linking to popular bibliographic databases such as DBLP.
In this paper we document our current efforts to convert
this legacy data to linked data. We present our approach to
pursue this goal which is comprised of three stages: first we
perform triplification of legacy data found within the DCS
- by extracting person information from HTML documents
and publication information from the current bibliography
system. Second we perform coreference resolution and inter-
linking of the produced triples - thereby linking people with
their publications and fusing data within separate HTML clues to regions within the documents from which person
documents together. Third we connect our produced dataset information should be extracted. Once regions of extrac-
to distributed linked datasets in order to provide additional tion have been identified then extraction patterns are used
information to agents and humans browsing the dataset. to extract relevant information based on its proximity in the
document. An effort to extract personal information (name,
We have structured the paper as follows: section 2 describes email, homepage, telephone number) from within web pages
related work in the field of producing linked data from legacy has been presented in [3] using a system called ”Armadillo”.
data and discusses similar efforts to our problem setting ex- A lexicon of seed person names is compiled from several
plored within the information extraction community. Sec- repositories which are then used to guide the information
tion 3 presents a brief overview of our approach and the extraction process. Heuristics are used to extract person in-
pipeline of the architecture which is employed. Section 4 de- formation surrounding a name which appears within a given
scribes the triplication process which generates triples from web page.
legacy data within HTML documents and the publication
database. Section 5 presents the SPARQL rules we em- Work by [18] has explored the application of Hidden Markov
ployed to discover coreferring entities. Section 6 describes Models to extract medical citations from a citation reposi-
our preliminary method for weaving our dataset into the tory by inputting a sequence of tokens and then outputting
linked data cloud. Section 7 finishes the paper with the con- the relevant labels for those tokens based on the HMM’s
clusions which we have learnt from this work and our plans predicted states: Title, Author, Affiliation, Abstract and
for future work. Reference. Prior to applying the HMMs, windows within
HTML documents are derived known as component zones,
2. RELATED WORK or context windows, these zones within the HTML document
are considered for analysis in order to extract information
Recent efforts to construct linked data from legacy data
from. Similar work by [7] has applied HMMs to the task
include Sparqplug [4] where linked data models are con-
of extracting citation information. Work within the field
structed based on Document Object Model (DOM) struc-
of attribute extraction has placed emphasis on the need to
tures of HTML documents. The DOM is parsed into an
extract information describing a given person from within
RDF model which then permits SPARQL [11] queries to
web pages. For instance [9] uses extraction patterns (i.e.
be processed over the model and relevant information re-
regular expressions) defined for different person attributes
turned. Although this work is novel in its approach to se-
to match content within HTML documents. An approach
mantifying web documents, the approach is limited by its
by [16] to extract person attributes from HTML documents
lack of rich metadata descriptions attributed to elements
first identifies a list of candidate attributes within a given
within the DOM. Existing work by [2] presents an approach
web page using hand crafted regular expressions - these are
to expose linked data from relational databases by creating
related to different individuals. All HTML markup is then
lightweight mapping vocabularies. The effect is such that
filtered out leaving the textual content of the documents.
data which previously corresponded to a bespoke schema is
Attributes which appear closest to a given person name are
provided as RDF according to common ontological concepts.
then assigned to that name.
Metadata generation - so called triplification - is discussed
extensively in [10] in order to generate metadata describing
conferences, their proceedings, attendees and organisations 3. CONVERTING LEGACY DATA INTO
participating. Due to the wide variation in the provided LINKED DATA
data formats - i.e. excel spreadsheets, table documents - In order to convert legacy data into linked data we have im-
metadata was generated by hand. Despite this such work plemented a pipeline approach. Figure 1 shows the overview
provides a blue print for generating metadata by describing of this approach which is divided into three stages:
the process in detail and the challenges faced.
The challenges faced when converting legacy data devoid of • Triplification: the approach begins by taking as input
metadata and semantic markup into a machine-processable an RSS feed describing the publications by DCS mem-
form involves exposing such legacy data and then construct- bers and the DCS web site. Context windows are iden-
ing metadata models describing the data. In the case of tified within the RSS feed - where each context win-
the DCS web site our goal is to generate metadata describ- dow contains information about a single publication
ing members of the department, therefore we must extract - and in the HTML documents - where each context
this legacy data to enable the relevant metadata descrip- window contains information about a single person.
tions to be built. Work within the field of information ex- Information is extracted from these context windows
traction provides similar scenarios to the problems which and is then converted into triples, describing instances
we face, For instance extraction of person information from of people and publications within the department.
within HTML documents has been addressed in [14] by seg-
menting HTML documents into components based on the • Coreference Resolution: SPARQL queries are processed
Document DOM of the web pages. Person information is over the entire graph to discover coreferring entities:
then extracted using induced wrappers from labelled per- e.g. the same people appearing in different web pages.
sonal pages. [15] uses manually created name patterns to
match person names within a web page and then, using a • Linking to the Web of Linked Data: the Web of Linked
context window surrounding the match, extract contextu- Data Cloud is queried for coferring entities and related
ally relevant information surrounding the name. The DOM information resources, and links are created from the
of HTML documents is utilised in work by [17] to provide produced dataset.
Figure 1: Three staged approach to convert legacy data to linked data
Each of the stages of the approach contains various steps and element. We must identify such context windows within a
processes which are essential to the production of a linked HTML document to enable the correct information to be
dataset. We will now present each of these stages in greater extracted. To address this problem we rely on the markup
detail, beginning with the triplication of legacy data. used within HTML documents to segment disjoint content.
For instance in many web pages layout elements such as
4. TRIPLIFICATION OF LEGACY DATA elements are used to contain information about a sin-
The DCS web site contains listings of members of the depart- gle entity. Another element is then used to contain
ment: staff, researchers and students, and their associated information about another entity. Using such elements pro-
information (name, email address, web address) provided vides the necessary means through which context windows
within HTML documents. Such documents lack metadata can be identified - through the use of layout elements within
descriptions which limits the applicability of automated pro- a DOM - and information extraction techniques can be ap-
cesses to parse and interpret the data. Therefore we require plied to leverage the legacy data. We now explain how we
some method to leverage legacy data which can then be con- generate context windows from HTML documents.
verted into triples to allow machine-processing, for instance
by associating a person with his/her name, email address, 4.1 Generating Context Windows
etc. For publications we are confronted with a slightly differ- To derive a set of context windows from a given HTML doc-
ent problem. We are provided with an RSS feed1 containing ument, we first tidy the HTML document into a parseable
the publications within the department, this feed should be form using Apache Maven’s HTML Parser2 . HTML is often
well structured with declarative elements for each attribute messy and contains poorly structured markup where HTML
of a publication (i.e. title, authors, year, etc). Instead we tags are opened and not closed. This reduces its ability to be
are returned the following: parsed where such techniques require a well-formed DOM.
Once tidied the DOM is used as input to Algorithm 1 as
follows: first a list of name patterns is loaded and applied to
Interlinking Distributed Social Graphs
http://publications.dcs.shef.ac.uk/show.php?record=4161 the DOM model, for each pattern the list of DOM elements
which that pattern matches are collected (line 5). The pat-
Proceedings of Linked Data on the Web Workshop, WWW 2009,
Madrid, Spain. (2009). Madrid, Madrid, Spain.
tect the appearance of a person name within a given body
Edited by Sarah Duffy on Tue, 08 Dec 2009 09:31:30 +0000.]]> of text (e.g. +). Each
of the collected DOM elements is then verified as having not
Mon, 07 Dec 2009 17:03:27 +0000
Sarah Duffy <s.duffy@dcs.shef.ac.uk> been been processed before (line 6) - as different name pat-
terns may match the same person name at the same position
in the document. The trigger string is extracted from the
In the above XML the element contains the title element (line 8) noting the person’s name that was matched
of the paper, however other paper attributes are not placed using the name pattern. The parent node type of the DOM
within suitable elements - i.e. using element for element (e) is then assessed to see if it is a hyperlink: it is
the author of the paper. Instead all the data which de- common for a person name to appear within a HTML docu-
scribes the paper is stored within the element. ment as a hyperlinked element. If it is hyperlinked then the
A technique is required which is able to extract informa- grandparent of the element is considered as a possible area
tion from the element which corresponds to from which the context window can be gathered. However
the relevant attributes of the paper, for instance by extract- should the parent node of the element (e) not be hyperlinked
ing ”Interlinking Distributed Social Graphs” for the title at- (line 12) then the parent is then passed onto the domManip
tribute. function for assessment together with the trigger string.
Unlike publications however, extracting person information Algorithm 2 (domManip) takes the trigger node and a node
from HTML documents requires the derivation of a context from within the DOM and manipulates the DOM structure
windows which contain person attributes - this akin to being to derive a suitable DOM element from which the context
provided with the content within the above window should be derived. First the node type is checked
1 2
http://pubs.dcs.shef.ac.uk http://htmlparser.sourceforge.net/
Algorithm 1 cwFind(dom) : Given the DOM of a HTML Algorithm 3 extractWin(trig,content) : Given a trigger
document, returns a set of context windows string and a DOM element’s content, extracts the window
Input: dom from the trigger onwards
Output: Set of context windows C Input: trig and content
1: N =person name patterns Output: window
2: C = ∅ 1: maps = ∅
3: visited = ∅ 2: N = person name patterns
4: for each n ∈ N do 3: remove(content, trig)
5: E = getElements(dom, n) 4: for each n ∈ N do
6: for each e ∈ E do 5: if match(content, n) then
7: if e.startIndex ∈/ visited then 6: maps = maps∪
8: trig = extract(e, n) 7: end if
9: if e.parent.type ==< a > then 8: end for
10: c = domManip(e.parent.content, e.parent.parent) 9: if |maps| > 0 then
11: C = C ∪c 10: order(maps)
12: else 11: = maps1
13: c = domManip(trig, e.parent) 12: return trig + content.substring(0, i)
14: C = C ∪c 13: else
15: end if 14: return content
16: visited = visited ∪ e.startIndex 15: end if
17: end if
18: end for
19: end for
20: return C the content string where the pattern match starts. Once all
patterns have been applied to the content string, the map-
pings, if there are any, are ordered by their start matching
Algorithm 2 domManip(trig,node) : Given a trigger string
points within the content string. The first mapping is then
and the node which contains the trigger, derives the suitable
chosen from the ordered set of mappings, given that this
DOM element to extract the window from
Input: trig and node
provides the nearest point to the start of the content string
Output: window where a person name appears. The content string is then
1: if node.type ==< td > then removed of the content following the earlier match (line 12),
2: window = extractWin(trig, node.parent.content.substring(trig)) the trigger string is appended back on to the start and this
3: else if node.type ==< style > then is returned as the context window. Should no mappings be
4: domManip(trig, node.parent)
5: else found (line 13) then the content string is returned as the
6: window = extractWin(trig, node.content.substring(trig)) context window.
7: end if
The derived context window from extractWin feeds back to
cwFind and populates the set of context window set for the
to see if it is a element - denoting that the trigger given HTML document. These algorithms for context win-
appeared within a table in the HTML document (line 1). dow derivation provide a conservative strategy to identify
If this is the case then the trigger string is passed to ex- areas of a HTML document from which person information
tractWin together within the parent of the element: can be extracted. It is conservative in the sense that it does
the element which is a child of. If the node is a not look above certain DOM element types (i.e. ) in-
style element (,,, etc) (line 3) then dom- stead it relies on the logical segmentation of the document to
Manip is recursively called using the trigger and the parent provide the necessary features which can be utilised to iden-
node. Such elements control the presentation and styling tify context windows. Applying this approach to a HTML
of a HTML document but do not control or segment the document containing the following markup would by trig-
layout like , or elements do. If neither of gered by the person name Matthew Rowe, the algorithms
the above cases are true, and the node is a layout element would traverse two nodes up from the element containing
via the process of elimation (line 5) then the content of the the trigger string until a element is found. The con-
node and the trigger string is passed on to the window ex- text window is then returned as the substring of the con-
tractor. It is worth noting that the content of the nodes tent within the element from the trigger string - the
which is passed onto extractWin contains HTML markup person name Matthew Rowe onwards - until the end of the
along with textual content. Unlike exisiting work within the element’s content, returning the following:
attribute extraction state of the art, markup is maintained
as it provides clues which can aid the process of information
extraction (i.e. HTML tags acting as delimiters between Matthew Rowe
person attributes).
Ph.D. Student
Algorithm 3 (extractWin) takes the trigger string (i.e. the -
person name) and HTML content string and derives the con-
text window. First a mapping set is initialised and the list of http://www.dcs.shef.ac.uk/ mrowe/
person name patterns is loaded (lines 1-2). The preamble of
the HTML content string contains the trigger string, there- -
fore this is removed to enable the name patterns to match
the remaining content. Each pattern is applied to the con- M.Rowe@dcs.shef.ac.uk
tent string (line 4), if a match occurs (line 5) then the name
pattern is added to the mapping along with the index within
Researching Identity Disambiguation and Web 2.0
4.2 Extracting Legacy Data using Hidden Markov
Models Figure 2: Topology of HMM for publication infor-
Given a set of context windows derived from a HTML docu-
mation extraction
ment person information must now be extracted from the
windows. Person information consists of four attributes:
name, email, web page and location. The appearance and or- to the person topology, the topology of the HMM for pub-
der in which these attributes appear in the context window lication information extraction also contains 4 major states
can vary (e.g. (name,email,www ) or (email,name,location). corresponding to the four publication attributes. 3 Minor
Context windows for publications are also provided using states are used to separate the major states and a single
the content from all of the elements within after state is included. Figure 2 shows the topology of the
the RSS publication feed. Publication information also con- HMM used for publication information extraction.
sists of four attributes: title, author, year and book. We use
the bookTitle attribute to define where the publication ap- 4.2.2 Parameter Estimations
pears, this could be a thesis - in which case it would be the Once the states have been decided for the information at-
university - or a journal paper - in which case it would be tributes the remaining parameters of the HMM must be esti-
the name of the journal publisher. mated. We must train the HMM to detect what transitions
are more likely than others and to calculate the probabil-
In order to extract both person and publication information ity of omitting a given symbol whilst in a given state. The
from their relevant context windows Hidden Markov Models transition probability matrix A is built from labelled train-
(HMMs) are used. HMMs provide a suitable solution to this ing data by counting the number of times a given state si
problem setting by taking as input a sequence of observa- transits to state sj . This count is then normalised by the
tions (e.g. tokens within a context window) and outputting total number of transitions from state si . Formally A is
the most likely sequence of states where each state corre- populated as follows:
sponds to a piece of information to be extracted. HMMs use
Markov chains to work out the likelihood of moving from one
state to another (si → sj ) and outputting symbol (σ when c(sj |si )
aij = P
s∈S c(s|si )
in state sj ). A HMM is described as hidden in that it is
given a known sequence observations with hidden states, it
must therefore label these hidden states which correspond to Similar to A, the omission probability matrix B is built from
person or publication attributes which are to be extracted. labelled training data. Counts are made of how many times
A HMM consists of a set of States; S = {s1 , s2 , ..., sm }, a given symbol is observed in a given state, this count is
a vocabulary of symbols; Σ = {σ1 , σ2 , ..., σn }, a transition then normalised by the total number of symbols omitted in
probability matrix; A (where aij = P (sj |si )), an omission that state. B is therefore defined as follows:
probability matrix; B (where biσ = P (σ|si )) and a start
probability vector (where π where πi = P (si |sstart )). These σn |c(si )
parameters must be built, or estimated, from known infor- biσn = P
σ∈Σ c(σ|si )
mation, essentially training the HMM from previous context
windows to allow information to be extracted from future
context windows. The start probability vector is built from the training data
by counting how many times a given state is started in.
This is then normalised by the total number of start states
4.2.1 HMM States observed:
The topology of the HMM defines what states are to be used
and how those states are connected together. States within c(si |sstart )
πi = P
the HMM fall into two categories: major and minor states. s∈S c(s|sstart )
For person information extraction there are 4 major states
which constitute the four person attributes. 13 minor states 4.2.3 Smoothing
are defined in order to provide clues to the HMM and en- When estimating the parameters of the HMM from labelled
hance the process of deriving the state sequence. Of those training data it is likely that certain state-to-state transi-
13 minor states there are 2 pre-major states (pre email and tions, or omissions whilst in a given state are not observed.
pre www ), 10 separator states (e.g. between email and name) The trained HMM, when applied to test data, may find pre-
and 1 after state which contains the symbols omitted at the viously unknown paths, or symbols omitted in states which
end of the window. Omissions made within the minor states have previously not been witnessed. The model must be
offer clues as to the order of the state sequence and what able to deal with such possibilities by smoothing the tran-
information is to follow. For instance using the omission of sition and omission probabilities to cope with unseen ob-
the token elements in the publication RSS
likelihood estimates of moving from one state to another and feed - and labelling each of the tokens in each window with
then omitting a given symbol. The Viterbi algorithm uses its respective states for training and choosing another 200
the learnt HMM, and its estimated parameters, as back- windows randomly for testing.
ground knowledge of known transitions and omissions and
assesses the input sequence to find clues as to the order of 4.5.1 Results
states. This allows consistencies in the layout and presen- As the results from Table 1 and Table 2 show Naive Smooth-
tation of person information to be utilised to extract in- ing achieves, on average, higher f-measure levels with respect
formation for future tasks. For instance, it is common for to the alternative smoothing method. Additive Smoothing
a person to hyperlink their name with their web address, yields poorer scores, particularly for labelling web addresses
learning such patterns allows for future similar information and locations. Both smoothing techniques perform poorly
extraction tasks to be recognised and the correct informa- in terms of recall when extracting location information. In
tion extracted. terms of publication information extraction the results are
4.5 Evaluation 3
http://www.dcs.shef.ac.uk
For publications we model extracted information using the
Table 1: Accuracy levels of extracting person infor- Bibtex ontology4 by creating an instance of bib:Entry for
mation using Hidden Markov Models with different each publication instance. We use a temporary URI for
smoothing methods each publication instance by taking the publication names-
Naive Smoothing Additive Smoothing
pace http://data.dcs.shef.ac.uk/paper/ and appending
Attribute P R F1 P R F1 an incremented integer for each each new publication. We
Name 0.903 0.875 0.889 0.928 0.703 0.8 then assign the relevant attributes to the instance using
Email 1 0.867 0.928 0.578 0.688 0.628 concepts from the Bibtex ontology. For the title we use
WWW 0.849 0.833 0.841 0.714 0.714 0.714
Location 0.888 0.444 0.592 0.421 0.211 0.281
bib:title, for the year we use bib:hasYear and for the book
Average 0.910 0.754 0.825 0.66 0.579 0.616 title we use bib:hasBookTitle. For each paper author we cre-
ate a blank node typed as an instance of foaf:Person and
assign the author name to the instance using foaf:name and
associate this instance with the publication instance using
Table 2: Accuracy levels of extracting publication foaf:maker. Referring back to the example from the begin-
information using Hidden Markov Models with dif- ning of this section, the RSS feed provided by the publication
ferent smoothing methods base contained publication information - containing all four
attributes - within a single element. This
Naive Smoothing Additive Smoothing legacy data, once extracted and converted into triples, is
Attribute P R F1 P R F1
Title 0.941 0.698 0.801 0.901 0.589 0.712
provided as follows (again using Notation 3 syntax):
Year 1 0.716 0.835 1.000 0.678 0.808
Author 0.952 0.717 0.818 0.934 0.687 0.792
Book Title 0.982 0.652 0.783 0.956 0.500 0.657
Average 0.969 0.696 0.810 0.948 0.614 0.745 rdf:type bib:Entry ;
bib:title "Interlinking Distributed Social Graphs." ;
bib:hasYear "2009" ;
bib:hasBookTitle "Proceedings of Linked Data on the
Web Workshop, WWW , Madrid, Spain." ;
similar to the performance when applying HMMs for per- foaf:maker _:a1 .
son information extraction. Naive smoothing yields higher _:a1
foaf:name "Matthew Rowe"
f-measure scores overall and almost perfect precision - indi-
cating that the extracted information rarely contains mis-
takes. However particularly for the paper title and the book 5. COREFERENCE RESOLUTION
title several tokens are missed leading to incomplete titles. Following conversion of the DCS web site and publication
This is something which must be addressed in future work as database we are provided with an RDF dataset containing
the error will scale up to become detrimental to data quality. 17896 foaf:Person instances and 1088 bib:Entry instances.
Using this dataset we must discover coreferring instances
4.6 Building RDF Models from Legacy Data such as equivalent people appearing in separate web pages
Using HMMs together with Naive Smoothing we build an and identify publications which people have published. This
RDF dataset describing all instances of people and publi- stage in the approach starts the process of compiling the
cations within the department. This dataset provides the linked dataset which will be deployed for consumption. There-
source dataset from which we build our linked dataset for fore we perform coreference resolution to identify equivalent
deployment. We apply the above techniques to the en- instances in the dataset and fuse data together - this will
tire dataset collected from the DCS web site in order to provide rich instance descriptions when a resource is looked
build metadata models describing person information found up in our linked dataset.
within each web document. We also apply the technique
to build RDF models describing publications within the de-
partment. In each case we use temporary URIs to provide 5.1 Building Research Groups
unique RDF instances constructed from the extracted legacy Our produced linked dataset is intended to contain informa-
data. We use a namespace to identify the RDF instance as tion about research groups and their members and publica-
denoting a person http://data.dcs.shef.ac.uk/person/ tions. Therefore we generate an instance of foaf:Group for
and append an incremented integer to form a new URI each research group and assign the group a minted URI using
for a given person. For each person found within a given the group namespace http://data.dcs.shef.ac.uk/group
HTML document we create instances of foaf:Person and and appending an abbreviation of the group name (e.g. nlp
assign their name to the instance using foaf:name, hashed for the Natural Language Processing Group). We then as-
emailed address using foaf:sha1 sum and homepage using sign a name to the group using foaf:name and the URL of
foaf:homepage. We associate the person instance to the the group web page using foaf:workplaceHomepage. This
web page within the department’s web site on which the produces the following:
instance appeared using foaf:topic. An example instance of
foaf:Person is as follows (using Notation 3 syntax).
rdf:type foaf:Group ;
foaf:name "Organisations, Information and Knowledge
Group" ;
rdf:type foaf:Person ;
foaf:workplaceHomepage
foaf:name "Matthew Rowe" .
4
foaf:topic http://zeitkunst.org/bibtex/0.1/bibtex.owl#
Once we have constructed all of the group instances we then CONSTRUCT {
query our source dataset for all the people who appear on ?x owl:sameAs ?y .
?x foaf:page ?p
each of the group personnel pages. This provides us with the }
members of the DCS whose information is going to the com- WHERE {
piles and deployed as linked data. This step in the approach foaf:topic ?x .
foaf:topic ?y .
acts as seeding the forthcoming coreference resolution pro- ?p foaf:topic ?z .
cesses by compiling a set of members. It worth noting how- ?p foaf:topic ?u .
ever that in doing so we are only considering a subset of ?x foaf:name ?n .
?y foaf:name ?m .
the entire collection of foaf:Person instances. We plan to ?z foaf:name ?n .
analyse this data in future work, however for now we are ?u foaf:name ?m .
concerned with producing linked data describing the DCS. FILTER ( != ?p)
}
5.2 Person Disambiguation
We are provided with a set of people who are members of Using the above rules identifies web pages within the dcs
the DCS, who either work or study there. We perform per- which cite the group members and their equivalent instances
son disambiguation to identify other instances of foaf:Person from those pages. New instances of foaf:Person are con-
in separate web documents which are in fact the same peo- structed for each member of the research groups within the
ple as the DCS members. Our first person disambiguation department. For each group member we take the instances
method uses Instance Smushing [13] to discover equivalent of foaf:Person and assign the information from the instance
instances. This technique works by matching resources as- description to the new foaf:Person instance. This fuses the
sociated with disparate RDF instances where the resources data from separate instances to provide a richer description.
are associated with the instances using properties which are Also for each group member we assign each page where an
defined as owl:inverseFunctionalProperty. An example of equivalent instance was found and relate this page to the
instance smushing is the identification of equivalent person new foaf:Person instance using foaf:page. When the instance
instances using the email address of the person. In essence is dereferenced this will provide links to all the web pages
Instance Smushing uses the declarative characteristics of which cites the person. For each group member we create
such properties to detect coreference. We smush instances a new minted URI according to ”Cool URIs for the Seman-
of foaf:Person which appear on research group personnel tic Web”5 . We use the same person namespace as for the
pages using the following SPARQL rule for foaf:homepage temporary URIs but with the person name as it appears on
property: the group personnel page (with titles removed) and append
this to the namespace to produce a URI for the DCS mem-
ber (e.g.
PREFIX owl: Rowe>).
CONSTRUCT {
?x owl:sameAs ?y .
?x foaf:page ?p 5.3 Assigning People to Publications
} Our linked dataset now contains instances of foaf:Group and
WHERE { foaf:Person describing research groups and their members.
foaf:topic ?x .
?x foaf:homepage ?h . We must now identify publications which have been writ-
?p foaf:topic ?y ten by the group members. We implement a basic strategy
?y foaf:homepage ?h of name matching using an abbreviated form of the names
FILTER ( != ?p)
} of the group members. For instance for the name ”Matthew
Rowe” we break down the name into several citation formats:
”M Rowe”, ”Rowe M”, ”M. Rowe”. The publication database
The triple within the CONSTRUCT clause infers an owl:sameAs has no single strategy for naming and therefore several dif-
relation between a member of the oak group and another ferent formats must be accounted for. It is worth noting the
person instance on a separate web page, and infers that the imprecision such a strategy would lead to if it was applied
page cites the group member - expressed using foaf:page. when interlinking data. However in this context it is appli-
cable to use such a technique given the localised context of
Our second person disambiguation technique employs per- the publication database - as it only stores publications by
son co-occurence to identify coreferring instances of foaf:Person. members of the department. Using the above example we
We assume that if a group member appears on a web page formulate queries based on several name abbreviations in or-
with a coworker then that page will refer to them - this is a der to match a group member with the publications he/she
basic intuition used throughout person disambiguation ap- has written. An example rule is as follows:
proaches. Therefore we define a SPARQL rule to infer the
same triples as the previous rule but this time modifying the
PREFIX foaf:
graph pattern within the WHERE clause to match the name CONSTRUCT {
of a member of the OAK group - listed on the group’s per-
sonnel page - and the name of a colleague on a separate foaf:made ?p
}
page. WHERE {
?p rdf:type bib:Entry .
5
PREFIX foaf: http://www.w3.org/TR/2007/WD-cooluris-
PREFIX owl: 20071217/#cooluris
?p foaf:maker ?x . ?group foaf:member ?q .
?x foaf:name ?n ?group foaf:member ?p .
FILTER regex(?n, "M.*Rowe", "i") ?q foaf:name ?n .
} ?p foaf:name ?c .
GRAPH
{
?paper dc:creator ?x .
This rule finds an instance of bib:Entry which has an author ?x foaf:name ?n .
whose name matches the above regular expression. The in- ?paper dc:creator ?y .
ferred triple then constructs a relation between the group ?y foaf:name ?c .
}
member and the publication using foaf:made - indicating FILTER (?p != ?q)
that the paper was produced by the person. For each paper }
that is found to have been authored by a group member we
place the paper and its description within the linked dataset.
We maintain the same URI as before (containing the paper For each group within the linked dataset the above SPARQL
namespace and the increment of the paper count). rule gathers all the group members and checks their names
against the networked graph for publications where those
Figure 3 shows a snippet of the compiled dataset. By enrich- people have worked together. The URI of the paper which
ing data with formal semantics - where the data is leveraged matches the query is then assigned to the group members
from heterogeneous sources - we are provided with a rich in using the foaf:made relation. The authors of the paper
interpretation of legacy data. This allows SPARQL queries within the DBLP dataset are also detected as referring to
to be performed over the dataset in order to extract knowl- the group members and are associated to those foaf:Person
edge - this was previously limited without a large amount instances using owl:sameAs. Using such a query produces
of manual processing. For instance we can ask for all the the following relations.
groups have have worked together on papers and what were
the papers called.
owl:sameAs
6. LINKING TO THE WEB OF LINKED DATA ;
foaf:made
At this stage in our approach we have extracted legacy
ing people within the DCS, their publications and the re- foaf:made
be linked into the Web of Data to provide relations with
equivalent resources and related information in distributed
datasets. The advantage of this - from the perspective of In order to expose linked data we have deployed our dataset
members of the DCS - is that once equivalent person in- using static RDF files according to Recipe 1 from ”How to
stances are found within external bibliography databases Publish Linked Data”7 and Recipe 2 for slash namespaces
then all the papers written by that person, and do not ap- from the ”Best Practices for Publishing RDF vocabularies”8 .
pear on the DCS publication database, will be provided by This serves our purpose as URIs are dereferenceable in our
looking up the URI of the DCS member. published data and will allow this deployment to be up-
graded to more advanced setups, such as Drupal, in the near
According to [8] author disambiguation is one of the com- future without the URIs returning a 404 response.
mon problems faced by the linked data community. In cer-
tain cases, wrongly created owl:sameAs links result in the
incorrect collection of publications being returned when an
7. CONCLUSIONS
This paper has presented an approach currently in use to
author URI is looked up. For now we implement a conserva-
convert legacy data to linked data. The paper places em-
tion strategy to link members of the DCS with publications
phasis on the first stage of the process triplification of legacy
which they have authored which are contained within exter-
data as this has been the most thoroughly investigated por-
nal datasets. We use a similar person co-occurence strategy
tion of the work. We believe that the results from the evalu-
as when detecting equivalent foaf:Person instances previ-
ation demonstrate the effectiveness of using trained Hidden
ously. We assume that a person will author a paper with the
Markov Models to extract legacy data from HTML docu-
same people that they work with. We construct a SPARQL
ments and RSS feeds. Although we have used such tech-
rules which uses the notion of a networked graph [12] to
niques to extract person information, the approach could be
query the DBLP linked dataset6 . The rule works as follows:
applied to other domains in which legacy data is locked away
within HTML documents and devoid of machine-processable
PREFIX foaf: markup. In such cases the HMMs would be trained for the
PREFIX dc: specific information which is to be extracted. The triplifica-
PREFIX owl: tion and coreference resolution stages of the approach have
CONSTRUCT {
?q foaf:made ?paper . provided a stable testbed on which we plan to explore sta-
?p foaf:made ?paper . tistical methods for interlinking our dataset into the Web of
?q owl:sameAs ?x . Linked Data. Our future work will investigate such methods
?p owl:sameAs ?y
} in order to contribute to the Linked Data community. Once
WHERE { 7
http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/
6 8
http://www4.wiwiss.fu-berlin.de/dblp/ http://www.w3.org/TR/swbp-vocab-pub/
Figure 3: A snippet of the interlinked dataset following coreference resolution
we have linked our linked dataset to additional datasets then People Search Evaluation Workshop (WePS 2009),
we plan also provide VoiD descriptions [1] of those links to 18th WWW Conference, 2009.
enable easier consumption of the data. At present our top [10] K. Möller, T. Heath, S. Handschuh, and J. Domingue.
level components in the produced dataset are the research Recipes for semantic web dog food - the eswc and iswc
groups. We plan to use this project as the blueprint for metadata projects. In 6th International and 2nd Asian
producing linked data from all departments and faculties Semantic Web Conference (ISWC2007+ASWC2007),
in the university, described using the Academic Institution pages 795–808, November 2007.
Internal Structure Ontology9 . [11] E. Prud’hommeaux and A. Seaborne. SPARQL Query
Language for RDF. Technical report, W3C, 2006.
8. REFERENCES [12] S. Schenk and S. Staab. Networked graphs: a
[1] K. Alexander, R. Cyganiak, M. Hausenblas, and declarative mechanism for sparql rules, sparql views
J. Zhao. Describing Linked Datasets - On the Design and rdf data integration on the web. In WWW ’08:
and Usage of voiD, the ’Vocabulary of Interlinked Proceeding of the 17th international conference on
Datasets’. In WWW 2009 Workshop: Linked Data on World Wide Web, pages 585–594, New York, NY,
the Web (LDOW2009), Madrid, Spain, 2009. USA, 2008. ACM.
[2] S. Auer, S. Dietzold, J. Lehmann, S. Hellmann, and [13] L. Shi, D. Berrueta, S. Fernandez, L. Polo, and
D. Aumüller. Triplify light-weight linked data S. Ferna?dez. Smushing rdf instances: are alice and
publication from relational databases. In 18th bob the same open source developer? In ISWC2008
International World Wide Web Conference workshop on Personal Identification and
(WWW2009), April 2009. Collaborations: Knowledge Mediation and Extraction
[3] F. Ciravegna, S. Chapman, A. Dingli, and Y. Wilks. (PICKME 2008), October 2008.
Learning to harvest information for the semantic web. [14] D. Thamvijit, H. Chanlekha, C. Sirigayon,
In Proceedings of the 1st European Semantic Web T. Permpool, and A. Kawtrakul. Person information
Symposium (ESWS-2004), May 2004. extraction from the web. In 6th Symposium on 6th
[4] P. Coetzee, T. Heath, and E. Motta. Sparqplug: Symposium of Natural Language Processing, 2005.
Generating linked data from legacy html, sparql and [15] X. Wan, J. Gao, M. Li, and B. Ding. Person resolution
the dom. In Linked Data on the Web (LDOW2008), in person search results: Webhawk. In CIKM ’05:
2008. Proceedings of the 14th ACM international conference
[5] G. D. Forney. The viterbi algorithm. Proceedings of on Information and knowledge management, pages
the IEEE, 61(3):268–278, 1973. 163–170, New York, NY, USA, 2005. ACM.
[6] D. Freitag and A. K. Mccallum. Information [16] K. Watanabe, D. Bollegala, Y. Matsuo, and
extraction with hmms and shrinkage. In In Proceedings M. Ishizuka. A two-step approach to extracting
of the AAAI-99 Workshop on Machine Learning for attributes for people on the web. In 2nd Web People
Information Extraction, pages 31–36, 1999. Search Evaluation Workshop (WePS 2009), 18th
[7] E. Hetzner. A simple method for citation metadata WWW Conference, 2009.
extraction using hidden markov models. In JCDL ’08: [17] B. Zhou, W. Liu, Y. Yang, W. Wang, and M. Zhang.
Proceedings of the 8th ACM/IEEE-CS joint conference Effective metadata extraction from irregularly
on Digital libraries, pages 280–284, New York, NY, structured web content. Technical report, HP
USA, 2008. ACM. Laboratories, 2008.
[8] A. Jaffri, H. Glaser, and I. Millard. Uri [18] J. Zou, D. Le, and G. R. Thoma. Structure and
disambiguation in the context of linked data. In content analysis for html medical articles: a hidden
Linked Data on the Web (LDOW2008), 2008. markov model approach. In DocEng ’07: Proceedings
[9] M. Lan, Y. Z. Zhang, Y. Lu, J. Su, and C. L. Tan. of the 2007 ACM symposium on Document
Which who are they? people attribute extraction and engineering, pages 199–201, New York, NY, USA,
disambiguation in web search results. In 2nd Web 2007. ACM.
9
http://vocab.org/aiiso/schema#