Data.dcs: Converting Legacy Data into Linked Data∗                                                            ∗


                                                      Matthew Rowe
                                                      OAK Group
                                             Department of Computer Science
                                                  University of Sheffield
                                            Regent Court, 211 Portobello Street
                                            S1 4DP Sheffield, United Kingdom
                                                m.rowe@dcs.shef.ac.uk

ABSTRACT                                                         such data. This therefore makes the process of producing
Data.dcs is a project intended to produce Linked Data de-        linked data limited.
scribing the University of Sheffield’s Department of Com-
puter Science. At present the department’s web site con-         In this paper we use the case of the University of Sheffield’s
tains important legacy data describing people, publications      Department of Computer Science (DCS). The DCS web site
and research groups. This data is distributed and is pro-        contains information about people - such as their name,
vided in heterogeneous formats (e.g. HTML documents,             email address, web address and location, research groups
RSS feeds), making it hard for machines to make sense            and publications. The department provides a publication
of such data and query it. This paper presents an ap-            database located separately from the main site on which
proach to convert such legacy data from its current form         DCS members manually upload their papers. Each mem-
into a machine-readable representation which is linked into      ber of the department is responsible for their own personal
the Web of Linked Data. The approach describes the triplifi-     web page, this has lead to the formatting and presentation
cation of legacy data, coreference resolution and interlinking   of legacy data to vary greatly between pages, where some
with external linked datasets.                                   pages contain RDFa and others are plain HTML documents
                                                                 with the bare minimum of markup. This impacts greatly
Categories and Subject Descriptors                               on the usability of the site in general and the slow process
                                                                 by which information can be acquired. For instance find-
H.4 [Information Systems Applications]: General
                                                                 ing all the publications which two or more research groups
                                                                 have worked on in the past year would take a large amount
General Terms                                                    of filtering and data processing. Furthermore the publica-
Linked Data                                                      tion database is rarely updated to reflect publications by the
                                                                 department and its members.
Keywords
Linked Data, Triplification, Coreference Resolution              This use case presents a clear motivation for generating a
                                                                 richer representation of legacy data describing the DCS. We
                                                                 define legacy data as data which is present in proprietary
1.   INTRODUCTION                                                formats and which describes important information about
Recent work has addressed the issue of producing linked data     the department - i.e. publications. Leveraging legacy data
from data sources conforming to well-stuctured relational        from HTML documents which make up the DCS web site
databases [2]. In such cases data already follows a logical      and converting this data into a machine-readable form us-
schema making the creation of linked data a case of schema       ing formal semantics would link together related informa-
mapping and data transformation. The majority of the Web         tion. It would link people with their publications, research
however does not conform to such a rigid representation,         groups with their members and allow co-authors of research
instead the heterogeneous structures and formats which it        papers to be found. Furthermore by linking the dataset into
exhibits makes it hard for machines to parse and interpret       the Web of Linked Data would allow additional information
∗
  The research leading to these results has received funding     to be inferred such as the conferences which members of the
from the EU project WeKnowIt10 (ICT-215453).                     DCS have attended and provide up-to-date publications list-
∗Copyright is held by the author/owner(s).                       ings - thereby avoiding the current slow update process by
LDOW2010, April 27, 2010, Raleigh, USA.                          linking to popular bibliographic databases such as DBLP.

                                                                 In this paper we document our current efforts to convert
                                                                 this legacy data to linked data. We present our approach to
                                                                 pursue this goal which is comprised of three stages: first we
                                                                 perform triplification of legacy data found within the DCS
                                                                 - by extracting person information from HTML documents
                                                                 and publication information from the current bibliography
                                                                 system. Second we perform coreference resolution and inter-
                                                                 linking of the produced triples - thereby linking people with
their publications and fusing data within separate HTML          clues to regions within the documents from which person
documents together. Third we connect our produced dataset        information should be extracted. Once regions of extrac-
to distributed linked datasets in order to provide additional    tion have been identified then extraction patterns are used
information to agents and humans browsing the dataset.           to extract relevant information based on its proximity in the
                                                                 document. An effort to extract personal information (name,
We have structured the paper as follows: section 2 describes     email, homepage, telephone number) from within web pages
related work in the field of producing linked data from legacy   has been presented in [3] using a system called ”Armadillo”.
data and discusses similar efforts to our problem setting ex-    A lexicon of seed person names is compiled from several
plored within the information extraction community. Sec-         repositories which are then used to guide the information
tion 3 presents a brief overview of our approach and the         extraction process. Heuristics are used to extract person in-
pipeline of the architecture which is employed. Section 4 de-    formation surrounding a name which appears within a given
scribes the triplication process which generates triples from    web page.
legacy data within HTML documents and the publication
database. Section 5 presents the SPARQL rules we em-             Work by [18] has explored the application of Hidden Markov
ployed to discover coreferring entities. Section 6 describes     Models to extract medical citations from a citation reposi-
our preliminary method for weaving our dataset into the          tory by inputting a sequence of tokens and then outputting
linked data cloud. Section 7 finishes the paper with the con-    the relevant labels for those tokens based on the HMM’s
clusions which we have learnt from this work and our plans       predicted states: Title, Author, Affiliation, Abstract and
for future work.                                                 Reference. Prior to applying the HMMs, windows within
                                                                 HTML documents are derived known as component zones,
2.   RELATED WORK                                                or context windows, these zones within the HTML document
                                                                 are considered for analysis in order to extract information
Recent efforts to construct linked data from legacy data
                                                                 from. Similar work by [7] has applied HMMs to the task
include Sparqplug [4] where linked data models are con-
                                                                 of extracting citation information. Work within the field
structed based on Document Object Model (DOM) struc-
                                                                 of attribute extraction has placed emphasis on the need to
tures of HTML documents. The DOM is parsed into an
                                                                 extract information describing a given person from within
RDF model which then permits SPARQL [11] queries to
                                                                 web pages. For instance [9] uses extraction patterns (i.e.
be processed over the model and relevant information re-
                                                                 regular expressions) defined for different person attributes
turned. Although this work is novel in its approach to se-
                                                                 to match content within HTML documents. An approach
mantifying web documents, the approach is limited by its
                                                                 by [16] to extract person attributes from HTML documents
lack of rich metadata descriptions attributed to elements
                                                                 first identifies a list of candidate attributes within a given
within the DOM. Existing work by [2] presents an approach
                                                                 web page using hand crafted regular expressions - these are
to expose linked data from relational databases by creating
                                                                 related to different individuals. All HTML markup is then
lightweight mapping vocabularies. The effect is such that
                                                                 filtered out leaving the textual content of the documents.
data which previously corresponded to a bespoke schema is
                                                                 Attributes which appear closest to a given person name are
provided as RDF according to common ontological concepts.
                                                                 then assigned to that name.
Metadata generation - so called triplification - is discussed
extensively in [10] in order to generate metadata describing
conferences, their proceedings, attendees and organisations      3. CONVERTING LEGACY DATA INTO
participating. Due to the wide variation in the provided            LINKED DATA
data formats - i.e. excel spreadsheets, table documents -        In order to convert legacy data into linked data we have im-
metadata was generated by hand. Despite this such work           plemented a pipeline approach. Figure 1 shows the overview
provides a blue print for generating metadata by describing      of this approach which is divided into three stages:
the process in detail and the challenges faced.

The challenges faced when converting legacy data devoid of          • Triplification: the approach begins by taking as input
metadata and semantic markup into a machine-processable               an RSS feed describing the publications by DCS mem-
form involves exposing such legacy data and then construct-           bers and the DCS web site. Context windows are iden-
ing metadata models describing the data. In the case of               tified within the RSS feed - where each context win-
the DCS web site our goal is to generate metadata describ-            dow contains information about a single publication
ing members of the department, therefore we must extract              - and in the HTML documents - where each context
this legacy data to enable the relevant metadata descrip-             window contains information about a single person.
tions to be built. Work within the field of information ex-           Information is extracted from these context windows
traction provides similar scenarios to the problems which             and is then converted into triples, describing instances
we face, For instance extraction of person information from           of people and publications within the department.
within HTML documents has been addressed in [14] by seg-
menting HTML documents into components based on the                 • Coreference Resolution: SPARQL queries are processed
Document DOM of the web pages. Person information is                  over the entire graph to discover coreferring entities:
then extracted using induced wrappers from labelled per-              e.g. the same people appearing in different web pages.
sonal pages. [15] uses manually created name patterns to
match person names within a web page and then, using a              • Linking to the Web of Linked Data: the Web of Linked
context window surrounding the match, extract contextu-               Data Cloud is queried for coferring entities and related
ally relevant information surrounding the name. The DOM               information resources, and links are created from the
of HTML documents is utilised in work by [17] to provide              produced dataset.
                        Figure 1: Three staged approach to convert legacy data to linked data


Each of the stages of the approach contains various steps and          element. We must identify such context windows within a
processes which are essential to the production of a linked            HTML document to enable the correct information to be
dataset. We will now present each of these stages in greater           extracted. To address this problem we rely on the markup
detail, beginning with the triplication of legacy data.                used within HTML documents to segment disjoint content.
                                                                       For instance in many web pages layout elements such as
4.      TRIPLIFICATION OF LEGACY DATA                                  <div> elements are used to contain information about a sin-
The DCS web site contains listings of members of the depart-           gle entity. Another <div> element is then used to contain
ment: staff, researchers and students, and their associated            information about another entity. Using such elements pro-
information (name, email address, web address) provided                vides the necessary means through which context windows
within HTML documents. Such documents lack metadata                    can be identified - through the use of layout elements within
descriptions which limits the applicability of automated pro-          a DOM - and information extraction techniques can be ap-
cesses to parse and interpret the data. Therefore we require           plied to leverage the legacy data. We now explain how we
some method to leverage legacy data which can then be con-             generate context windows from HTML documents.
verted into triples to allow machine-processing, for instance
by associating a person with his/her name, email address,              4.1 Generating Context Windows
etc. For publications we are confronted with a slightly differ-        To derive a set of context windows from a given HTML doc-
ent problem. We are provided with an RSS feed1 containing              ument, we first tidy the HTML document into a parseable
the publications within the department, this feed should be            form using Apache Maven’s HTML Parser2 . HTML is often
well structured with declarative elements for each attribute           messy and contains poorly structured markup where HTML
of a publication (i.e. title, authors, year, etc). Instead we          tags are opened and not closed. This reduces its ability to be
are returned the following:                                            parsed where such techniques require a well-formed DOM.
                                                                       Once tidied the DOM is used as input to Algorithm 1 as
                                                                       follows: first a list of name patterns is loaded and applied to
<title>Interlinking Distributed Social Graphs</title>
<link>http://publications.dcs.shef.ac.uk/show.php?record=4161</link>   the DOM model, for each pattern the list of DOM elements
<description>                                                          which that pattern matches are collected (line 5). The pat-
  <![CDATA[Rowe, M. (2009). Interlinking Distributed Social Graphs.    tern list contains a set of regular expressions designed to de-
  In <i>Proceedings of Linked Data on the Web Workshop, WWW 2009,
  Madrid, Spain. (2009)</i>. Madrid, Madrid, Spain.<br>                tect the appearance of a person name within a given body
  <br>Edited by Sarah Duffy on Tue, 08 Dec 2009 09:31:30 +0000.]]>     of text (e.g. <first-name> <capitalised-word>+). Each
</description>                                                         of the collected DOM elements is then verified as having not
<pubDate>Mon, 07 Dec 2009 17:03:27 +0000</pubDate>
<author>Sarah Duffy &lt;s.duffy@dcs.shef.ac.uk&gt;</author>            been been processed before (line 6) - as different name pat-
                                                                       terns may match the same person name at the same position
                                                                       in the document. The trigger string is extracted from the
In the above XML the <title> element contains the title                element (line 8) noting the person’s name that was matched
of the paper, however other paper attributes are not placed            using the name pattern. The parent node type of the DOM
within suitable elements - i.e. using <author> element for             element (e) is then assessed to see if it is a hyperlink: it is
the author of the paper. Instead all the data which de-                common for a person name to appear within a HTML docu-
scribes the paper is stored within the <descrption> element.           ment as a hyperlinked element. If it is hyperlinked then the
A technique is required which is able to extract informa-              grandparent of the element is considered as a possible area
tion from the <description> element which corresponds to               from which the context window can be gathered. However
the relevant attributes of the paper, for instance by extract-         should the parent node of the element (e) not be hyperlinked
ing ”Interlinking Distributed Social Graphs” for the title at-         (line 12) then the parent is then passed onto the domManip
tribute.                                                               function for assessment together with the trigger string.

Unlike publications however, extracting person information             Algorithm 2 (domManip) takes the trigger node and a node
from HTML documents requires the derivation of a context               from within the DOM and manipulates the DOM structure
windows which contain person attributes - this akin to being           to derive a suitable DOM element from which the context
provided with the content within the above <description>               window should be derived. First the node type is checked
1                                                                      2
    http://pubs.dcs.shef.ac.uk                                             http://htmlparser.sourceforge.net/
Algorithm 1 cwFind(dom) : Given the DOM of a HTML                    Algorithm 3 extractWin(trig,content) : Given a trigger
document, returns a set of context windows                           string and a DOM element’s content, extracts the window
Input: dom                                                           from the trigger onwards
Output: Set of context windows C                                     Input: trig and content
 1: N =person name patterns                                          Output: window
 2: C = ∅                                                             1: maps = ∅
 3: visited = ∅                                                       2: N = person name patterns
 4: for each n ∈ N do                                                 3: remove(content, trig)
 5: E = getElements(dom, n)                                           4: for each n ∈ N do
 6: for each e ∈ E do                                                 5: if match(content, n) then
 7:       if e.startIndex ∈/ visited then                             6:      maps = maps∪<n.startMatchIndex, n>
 8:          trig = extract(e, n)                                     7: end if
 9:          if e.parent.type ==< a > then                            8: end for
10:              c = domManip(e.parent.content, e.parent.parent)      9: if |maps| > 0 then
11:              C = C ∪c                                            10: order(maps)
12:           else                                                   11: <i, n> = maps1
13:              c = domManip(trig, e.parent)                        12: return trig + content.substring(0, i)
14:              C = C ∪c                                            13: else
15:           end if                                                 14: return content
16:           visited = visited ∪ e.startIndex                       15: end if
17:        end if
18: end for
19: end for
20: return C                                                         the content string where the pattern match starts. Once all
                                                                     patterns have been applied to the content string, the map-
                                                                     pings, if there are any, are ordered by their start matching
Algorithm 2 domManip(trig,node) : Given a trigger string
                                                                     points within the content string. The first mapping is then
and the node which contains the trigger, derives the suitable
                                                                     chosen from the ordered set of mappings, given that this
DOM element to extract the window from
Input: trig and node
                                                                     provides the nearest point to the start of the content string
Output: window                                                       where a person name appears. The content string is then
 1: if node.type ==< td > then                                       removed of the content following the earlier match (line 12),
 2: window = extractWin(trig, node.parent.content.substring(trig))   the trigger string is appended back on to the start and this
 3: else if node.type ==< style > then                               is returned as the context window. Should no mappings be
 4: domManip(trig, node.parent)
 5: else                                                             found (line 13) then the content string is returned as the
 6: window = extractWin(trig, node.content.substring(trig))          context window.
 7: end if
                                                                     The derived context window from extractWin feeds back to
                                                                     cwFind and populates the set of context window set for the
to see if it is a <td> element - denoting that the trigger           given HTML document. These algorithms for context win-
appeared within a table in the HTML document (line 1).               dow derivation provide a conservative strategy to identify
If this is the case then the trigger string is passed to ex-         areas of a HTML document from which person information
tractWin together within the parent of the <td> element:             can be extracted. It is conservative in the sense that it does
the <tr> element which <td> is a child of. If the node is a          not look above certain DOM element types (i.e. <div>) in-
style element (<b>,<h1>,<span>, etc) (line 3) then dom-              stead it relies on the logical segmentation of the document to
Manip is recursively called using the trigger and the parent         provide the necessary features which can be utilised to iden-
node. Such elements control the presentation and styling             tify context windows. Applying this approach to a HTML
of a HTML document but do not control or segment the                 document containing the following markup would by trig-
layout like <div>, <li> or <td> elements do. If neither of           gered by the person name Matthew Rowe, the algorithms
the above cases are true, and the node is a layout element           would traverse two nodes up from the element containing
via the process of elimation (line 5) then the content of the        the trigger string until a <div> element is found. The con-
node and the trigger string is passed on to the window ex-           text window is then returned as the substring of the con-
tractor. It is worth noting that the content of the nodes            tent within the <div> element from the trigger string - the
which is passed onto extractWin contains HTML markup                 person name Matthew Rowe onwards - until the end of the
along with textual content. Unlike exisiting work within the         <div> element’s content, returning the following:
attribute extraction state of the art, markup is maintained
as it provides clues which can aid the process of information
                                                                     <div>
extraction (i.e. HTML tags acting as delimiters between                Matthew Rowe</h4>
person attributes).                                                    <img src="../images/people/matt.jpg"/>
                                                                       <p class="position">Ph.D. Student</p>
                                                                       <ul>
Algorithm 3 (extractWin) takes the trigger string (i.e. the              <li>
person name) and HTML content string and derives the con-                  <a href="http://www.dcs.shef.ac.uk/ mrowe/">
text window. First a mapping set is initialised and the list of              http://www.dcs.shef.ac.uk/ mrowe/
person name patterns is loaded (lines 1-2). The preamble of                </a>
                                                                         </li>
the HTML content string contains the trigger string, there-              <li>
fore this is removed to enable the name patterns to match                  <a href="mailto:m.rowe@dcs.shef.ac.uk">
the remaining content. Each pattern is applied to the con-                   M.Rowe@dcs.shef.ac.uk
                                                                           </a>
tent string (line 4), if a match occurs (line 5) then the name           </li>
pattern is added to the mapping along with the index within            </ul>
  <p>Researching Identity Disambiguation and Web 2.0</p>
</div>


4.2 Extracting Legacy Data using Hidden Markov
    Models                                     Figure 2: Topology of HMM for publication infor-
Given a set of context windows derived from a HTML docu-
                                                                  mation extraction
ment person information must now be extracted from the
windows. Person information consists of four attributes:
name, email, web page and location. The appearance and or-        to the person topology, the topology of the HMM for pub-
der in which these attributes appear in the context window        lication information extraction also contains 4 major states
can vary (e.g. (name,email,www ) or (email,name,location).        corresponding to the four publication attributes. 3 Minor
Context windows for publications are also provided using          states are used to separate the major states and a single
the content from all of the <description> elements within         after state is included. Figure 2 shows the topology of the
the RSS publication feed. Publication information also con-       HMM used for publication information extraction.
sists of four attributes: title, author, year and book. We use
the bookTitle attribute to define where the publication ap-       4.2.2 Parameter Estimations
pears, this could be a thesis - in which case it would be the     Once the states have been decided for the information at-
university - or a journal paper - in which case it would be       tributes the remaining parameters of the HMM must be esti-
the name of the journal publisher.                                mated. We must train the HMM to detect what transitions
                                                                  are more likely than others and to calculate the probabil-
In order to extract both person and publication information       ity of omitting a given symbol whilst in a given state. The
from their relevant context windows Hidden Markov Models          transition probability matrix A is built from labelled train-
(HMMs) are used. HMMs provide a suitable solution to this         ing data by counting the number of times a given state si
problem setting by taking as input a sequence of observa-         transits to state sj . This count is then normalised by the
tions (e.g. tokens within a context window) and outputting        total number of transitions from state si . Formally A is
the most likely sequence of states where each state corre-        populated as follows:
sponds to a piece of information to be extracted. HMMs use
Markov chains to work out the likelihood of moving from one
state to another (si → sj ) and outputting symbol (σ when                                          c(sj |si )
                                                                                        aij = P
                                                                                                  s∈S c(s|si )
in state sj ). A HMM is described as hidden in that it is
given a known sequence observations with hidden states, it
must therefore label these hidden states which correspond to      Similar to A, the omission probability matrix B is built from
person or publication attributes which are to be extracted.       labelled training data. Counts are made of how many times
A HMM consists of a set of States; S = {s1 , s2 , ..., sm },      a given symbol is observed in a given state, this count is
a vocabulary of symbols; Σ = {σ1 , σ2 , ..., σn }, a transition   then normalised by the total number of symbols omitted in
probability matrix; A (where aij = P (sj |si )), an omission      that state. B is therefore defined as follows:
probability matrix; B (where biσ = P (σ|si )) and a start
probability vector (where π where πi = P (si |sstart )). These                                      σn |c(si )
parameters must be built, or estimated, from known infor-                              biσn = P
                                                                                                   σ∈Σ c(σ|si )
mation, essentially training the HMM from previous context
windows to allow information to be extracted from future
context windows.                                                  The start probability vector is built from the training data
                                                                  by counting how many times a given state is started in.
                                                                  This is then normalised by the total number of start states
4.2.1 HMM States                                                  observed:
The topology of the HMM defines what states are to be used
and how those states are connected together. States within                                       c(si |sstart )
                                                                                       πi = P
the HMM fall into two categories: major and minor states.                                       s∈S c(s|sstart )

For person information extraction there are 4 major states
which constitute the four person attributes. 13 minor states      4.2.3 Smoothing
are defined in order to provide clues to the HMM and en-          When estimating the parameters of the HMM from labelled
hance the process of deriving the state sequence. Of those        training data it is likely that certain state-to-state transi-
13 minor states there are 2 pre-major states (pre email and       tions, or omissions whilst in a given state are not observed.
pre www ), 10 separator states (e.g. between email and name)      The trained HMM, when applied to test data, may find pre-
and 1 after state which contains the symbols omitted at the       viously unknown paths, or symbols omitted in states which
end of the window. Omissions made within the minor states         have previously not been witnessed. The model must be
offer clues as to the order of the state sequence and what        able to deal with such possibilities by smoothing the tran-
information is to follow. For instance using the omission of      sition and omission probabilities to cope with unseen ob-
the token <a would indicate that a proceeding field might         servations and transitions within the training data. One
contain an email address or a web address. There is also          such smoothing technique is known as Naive Smoothing [7]
a single start state in which the HMM begins. The start           . Naive Smoothing functions by setting zero probabilities in
probability vector (π) contains the transition probabilities      A and B to a very low constant of 10−7 . A and B are esti-
                                                                                                             P
of moving from this state to another given state. Similar         mated using normalised values such that |A|   j=1 aij = 1 and
P|B|
   k=1 biσk = 1, therefore when smoothing the zero prob-          The success of our triplification technique depends on its
abilities in both A and B the non-zero probabilities must         ability to extract the maximum amount of person and publi-
also be adjusted to ensure that that the distributions hold.      cation information whilst ensuring that the extracted legacy
                                                       n
Therefore all non-zero probabilities have the value of m 10−7     data is accurate and contains no errors - as this could be
subtracted from the current value where n is number of zero-      detrimental to the linking of this data into the Web of Linked
probability events and m is the number of non-zero proba-         Data. Therefore we evaluate our triplification approach us-
bility events.                                                    ing the evaluation measures precision and recall defined as
                                                                  precision = |A ∩ B|/|B| and recall = |A ∩ B|/|A| where A
Another smoothing method using Additive Smoothing (Laplace        denotes the set of relevant tokens, and B denotes the set
Smoothing) described in [6] increments all zero counts when       of retrieved tokens. Precision measures the proportion of
building A and B by 1. This ensures that the respective           tokens which were labelled correctly. Recall measures the
transitions and omissions are then assigned a low probabil-       proportion of correct tokens which were found. These mea-
ity which is non-zero. It is worth noting that the use of         sures gauge the accuracy of the labels and the ability of
smoothing is only applicable if the training data does not        the technique to find person information within the HTML
sufficiently cover possible transitions which are likely to ap-   document - as lower levels of recall indicate that person in-
pear and observations which are present in test data.             formation is missed. Evaluation is therefore performed on
                                                                  a per token basis for person information extraction within
4.3 Vocabulary Dimension Reduction                                a given HTML document and per token basis within the
One of the parameters of a HMM is its vocabulary of sym-          publication feed for publication information extraction by
bols - where the term symbol is used to refer to a given          assessing the accuracy of the HMM in labelling tokens with
observation i.e. word, token, etc. This vocabulary contains       their respective major state labels and the ability of the tech-
all the possible observations or omissions which might be         nique to detect context windows. F-measure (referred to in
found within an input sequence. In [18] the vocabulary is         the results as F1) provides the harmonic mean of precision
compiled from a large corpus of words which make up all           and recall as follows:
the possible symbols that could be observed. This works
well where the dictionary is a finite size, however in the
case of HTML markup leads to new combinations. To solve                            f − measure = 2×precison×recall
                                                                                                  precision+recall

this problem we control the dimension of the vocabulary
that only a fixed number of symbols are used. Dimension
control is performed using transformation functions as fol-       The evaluation dataset was compiled by crawling the De-
lows: a given input - i.e. the context window - is tokenized,     partment of Computer Science web site3 - in a similar vein to
each token is then transformed into its respective symbols        work by [3]. All internal pages within the web site were col-
using transformation functions. The transformed input se-         lected, totaling 12,000 HTML documents, of this collection
quence of symbols is then used to derive the correct state        3,500 documents were found to contain person information.
sequence. The vocabulary contains 16 symbols, where each          Context windows were derived for each of these documents
symbol has a transformation function which matches the to-        and 200 randomly selected context windows were used as
ken to a given symbol. For instance the symbol First Name         training data for the HMM. Each window is already tok-
(FN) is used to identify a person’s name. Symbols are also        enized, however for training each token is labeled with the
used for different HTML tags such as an opening tag (e.g.         state in which it appears. The test data was also compiled
<a). Web data is noisy and contains a large amount of varia-      by randomly selecting 40 URLs from the dataset and their
tion in content form. Controlling the vocabulary of symbols       respective context windows, therefore totalling 203 context
allows previously unseen tokens to be handled appropriately.      windows. A gold standard was then created for these win-
                                                                  dows by manually labelling the tokens within their respec-
4.4 Deriving Transition Paths                                     tive states. Each URL was also manually analysed to find
Given a tokenized context window which has been converted         context windows which were missed by the context window
into symbols, the Viterbi algorithm [5] is then used to cal-      derivation algorithms, these were then added to the gold
culate the most probable state path through the window.           standard. We performed the same setup for publications
This path is found using A and B: given the sequence of ob-       by generating 200 tokenised context windows - using con-
servations the path is returned composed of the maximum           tent from <description> elements in the publication RSS
likelihood estimates of moving from one state to another and      feed - and labelling each of the tokens in each window with
then omitting a given symbol. The Viterbi algorithm uses          its respective states for training and choosing another 200
the learnt HMM, and its estimated parameters, as back-            windows randomly for testing.
ground knowledge of known transitions and omissions and
assesses the input sequence to find clues as to the order of      4.5.1 Results
states. This allows consistencies in the layout and presen-       As the results from Table 1 and Table 2 show Naive Smooth-
tation of person information to be utilised to extract in-        ing achieves, on average, higher f-measure levels with respect
formation for future tasks. For instance, it is common for        to the alternative smoothing method. Additive Smoothing
a person to hyperlink their name with their web address,          yields poorer scores, particularly for labelling web addresses
learning such patterns allows for future similar information      and locations. Both smoothing techniques perform poorly
extraction tasks to be recognised and the correct informa-        in terms of recall when extracting location information. In
tion extracted.                                                   terms of publication information extraction the results are

4.5 Evaluation                                                    3
                                                                      http://www.dcs.shef.ac.uk
                                                                 For publications we model extracted information using the
Table 1: Accuracy levels of extracting person infor-             Bibtex ontology4 by creating an instance of bib:Entry for
mation using Hidden Markov Models with different                 each publication instance. We use a temporary URI for
smoothing methods                                                each publication instance by taking the publication names-
                 Naive Smoothing         Additive Smoothing
                                                                 pace http://data.dcs.shef.ac.uk/paper/ and appending
Attribute      P        R       F1       P        R      F1      an incremented integer for each each new publication. We
 Name        0.903    0.875    0.889   0.928    0.703    0.8     then assign the relevant attributes to the instance using
 Email         1      0.867    0.928   0.578    0.688   0.628    concepts from the Bibtex ontology. For the title we use
 WWW         0.849    0.833    0.841   0.714    0.714   0.714
Location     0.888    0.444    0.592   0.421    0.211   0.281
                                                                 bib:title, for the year we use bib:hasYear and for the book
Average      0.910    0.754   0.825    0.66    0.579    0.616    title we use bib:hasBookTitle. For each paper author we cre-
                                                                 ate a blank node typed as an instance of foaf:Person and
                                                                 assign the author name to the instance using foaf:name and
                                                                 associate this instance with the publication instance using
Table 2: Accuracy levels of extracting publication               foaf:maker. Referring back to the example from the begin-
information using Hidden Markov Models with dif-                 ning of this section, the RSS feed provided by the publication
ferent smoothing methods                                         base contained publication information - containing all four
                                                                 attributes - within a single <description> element. This
                 Naive Smoothing         Additive Smoothing      legacy data, once extracted and converted into triples, is
Attribute      P        R       F1       P        R      F1
  Title      0.941    0.698   0.801    0.901    0.589   0.712
                                                                 provided as follows (again using Notation 3 syntax):
  Year         1      0.716   0.835    1.000    0.678   0.808
 Author      0.952    0.717   0.818    0.934    0.687   0.792
Book Title   0.982    0.652   0.783    0.956    0.500   0.657    <http://data.dcs.shef.ac.uk/paper/239>
Average      0.969    0.696   0.810    0.948    0.614   0.745      rdf:type bib:Entry ;
                                                                   bib:title "Interlinking Distributed Social Graphs." ;
                                                                   bib:hasYear "2009" ;
                                                                   bib:hasBookTitle "Proceedings of Linked Data on the
                                                                      Web Workshop, WWW , Madrid, Spain." ;
similar to the performance when applying HMMs for per-             foaf:maker _:a1 .
son information extraction. Naive smoothing yields higher        _:a1
                                                                   foaf:name "Matthew Rowe"
f-measure scores overall and almost perfect precision - indi-
cating that the extracted information rarely contains mis-
takes. However particularly for the paper title and the book     5. COREFERENCE RESOLUTION
title several tokens are missed leading to incomplete titles.    Following conversion of the DCS web site and publication
This is something which must be addressed in future work as      database we are provided with an RDF dataset containing
the error will scale up to become detrimental to data quality.   17896 foaf:Person instances and 1088 bib:Entry instances.
                                                                 Using this dataset we must discover coreferring instances
4.6 Building RDF Models from Legacy Data                         such as equivalent people appearing in separate web pages
Using HMMs together with Naive Smoothing we build an             and identify publications which people have published. This
RDF dataset describing all instances of people and publi-        stage in the approach starts the process of compiling the
cations within the department. This dataset provides the         linked dataset which will be deployed for consumption. There-
source dataset from which we build our linked dataset for        fore we perform coreference resolution to identify equivalent
deployment. We apply the above techniques to the en-             instances in the dataset and fuse data together - this will
tire dataset collected from the DCS web site in order to         provide rich instance descriptions when a resource is looked
build metadata models describing person information found        up in our linked dataset.
within each web document. We also apply the technique
to build RDF models describing publications within the de-
partment. In each case we use temporary URIs to provide          5.1 Building Research Groups
unique RDF instances constructed from the extracted legacy       Our produced linked dataset is intended to contain informa-
data. We use a namespace to identify the RDF instance as         tion about research groups and their members and publica-
denoting a person http://data.dcs.shef.ac.uk/person/             tions. Therefore we generate an instance of foaf:Group for
and append an incremented integer to form a new URI              each research group and assign the group a minted URI using
for a given person. For each person found within a given         the group namespace http://data.dcs.shef.ac.uk/group
HTML document we create instances of foaf:Person and             and appending an abbreviation of the group name (e.g. nlp
assign their name to the instance using foaf:name, hashed        for the Natural Language Processing Group). We then as-
emailed address using foaf:sha1 sum and homepage using           sign a name to the group using foaf:name and the URL of
foaf:homepage. We associate the person instance to the           the group web page using foaf:workplaceHomepage. This
web page within the department’s web site on which the           produces the following:
instance appeared using foaf:topic. An example instance of
foaf:Person is as follows (using Notation 3 syntax).
                                                                 <http://data.dcs.shef.ac.uk/group/oak>
                                                                   rdf:type foaf:Group ;
                                                                   foaf:name "Organisations, Information and Knowledge
<http://data.dcs.shef.ac.uk/person/12025>
                                                                     Group" ;
  rdf:type foaf:Person ;
                                                                   foaf:workplaceHomepage <http://oak.dcs.shef.ac.uk>
  foaf:name "Matthew Rowe" .
<http://www.dcs.shef.ac.uk/~mrowe/publications.html>
                                                                 4
  foaf:topic <http://data.dcs.shef.ac.uk/person/12025>               http://zeitkunst.org/bibtex/0.1/bibtex.owl#
Once we have constructed all of the group instances we then          CONSTRUCT {
query our source dataset for all the people who appear on              ?x owl:sameAs ?y .
                                                                        ?x foaf:page ?p
each of the group personnel pages. This provides us with the         }
members of the DCS whose information is going to the com-            WHERE {
piles and deployed as linked data. This step in the approach           <http://oak.dcs.shef.ac.uk/people> foaf:topic ?x .
                                                                       <http://oak.dcs.shef.ac.uk/people> foaf:topic ?y .
acts as seeding the forthcoming coreference resolution pro-            ?p foaf:topic ?z .
cesses by compiling a set of members. It worth noting how-             ?p foaf:topic ?u .
ever that in doing so we are only considering a subset of              ?x foaf:name ?n .
                                                                       ?y foaf:name ?m .
the entire collection of foaf:Person instances. We plan to             ?z foaf:name ?n .
analyse this data in future work, however for now we are               ?u foaf:name ?m .
concerned with producing linked data describing the DCS.               FILTER (<http://oak.dcs.shef.ac.uk/people> != ?p)
                                                                     }

5.2 Person Disambiguation
We are provided with a set of people who are members of              Using the above rules identifies web pages within the dcs
the DCS, who either work or study there. We perform per-             which cite the group members and their equivalent instances
son disambiguation to identify other instances of foaf:Person        from those pages. New instances of foaf:Person are con-
in separate web documents which are in fact the same peo-            structed for each member of the research groups within the
ple as the DCS members. Our first person disambiguation              department. For each group member we take the instances
method uses Instance Smushing [13] to discover equivalent            of foaf:Person and assign the information from the instance
instances. This technique works by matching resources as-            description to the new foaf:Person instance. This fuses the
sociated with disparate RDF instances where the resources            data from separate instances to provide a richer description.
are associated with the instances using properties which are         Also for each group member we assign each page where an
defined as owl:inverseFunctionalProperty. An example of              equivalent instance was found and relate this page to the
instance smushing is the identification of equivalent person         new foaf:Person instance using foaf:page. When the instance
instances using the email address of the person. In essence          is dereferenced this will provide links to all the web pages
Instance Smushing uses the declarative characteristics of            which cites the person. For each group member we create
such properties to detect coreference. We smush instances            a new minted URI according to ”Cool URIs for the Seman-
of foaf:Person which appear on research group personnel              tic Web”5 . We use the same person namespace as for the
pages using the following SPARQL rule for foaf:homepage              temporary URIs but with the person name as it appears on
property:                                                            the group personnel page (with titles removed) and append
                                                                     this to the namespace to produce a URI for the DCS mem-
                                                                     ber (e.g. <http://data.dcs.shef.ac.uk/person/Matthew-
PREFIX foaf:<http://xmlns.com/foaf/0.1/>
PREFIX owl:<http://www.w3.org/2002/07/owl#>                          Rowe>).
CONSTRUCT {
  ?x owl:sameAs ?y .
  ?x foaf:page ?p                                                    5.3 Assigning People to Publications
}                                                                    Our linked dataset now contains instances of foaf:Group and
WHERE {                                                              foaf:Person describing research groups and their members.
  <http://oak.dcs.shef.ac.uk/people> foaf:topic ?x .
  ?x foaf:homepage ?h .                                              We must now identify publications which have been writ-
  ?p foaf:topic ?y                                                   ten by the group members. We implement a basic strategy
  ?y foaf:homepage ?h                                                of name matching using an abbreviated form of the names
  FILTER (<http://oak.dcs.shef.ac.uk/people> != ?p)
}                                                                    of the group members. For instance for the name ”Matthew
                                                                     Rowe” we break down the name into several citation formats:
                                                                     ”M Rowe”, ”Rowe M”, ”M. Rowe”. The publication database
The triple within the CONSTRUCT clause infers an owl:sameAs          has no single strategy for naming and therefore several dif-
relation between a member of the oak group and another               ferent formats must be accounted for. It is worth noting the
person instance on a separate web page, and infers that the          imprecision such a strategy would lead to if it was applied
page cites the group member - expressed using foaf:page.             when interlinking data. However in this context it is appli-
                                                                     cable to use such a technique given the localised context of
Our second person disambiguation technique employs per-              the publication database - as it only stores publications by
son co-occurence to identify coreferring instances of foaf:Person.   members of the department. Using the above example we
We assume that if a group member appears on a web page               formulate queries based on several name abbreviations in or-
with a coworker then that page will refer to them - this is a        der to match a group member with the publications he/she
basic intuition used throughout person disambiguation ap-            has written. An example rule is as follows:
proaches. Therefore we define a SPARQL rule to infer the
same triples as the previous rule but this time modifying the
                                                                     PREFIX foaf:<http://xmlns.com/foaf/0.1/>
graph pattern within the WHERE clause to match the name              CONSTRUCT {
of a member of the OAK group - listed on the group’s per-              <http://data.dcs.shef.ac.uk/person/Matthew-Rowe>
sonnel page - and the name of a colleague on a separate                  foaf:made ?p
                                                                     }
page.                                                                WHERE {
                                                                       ?p rdf:type bib:Entry .
                                                                     5
PREFIX foaf:<http://xmlns.com/foaf/0.1/>                               http://www.w3.org/TR/2007/WD-cooluris-
PREFIX owl:<http://www.w3.org/2002/07/owl#>                          20071217/#cooluris
    ?p foaf:maker ?x .                                               ?group foaf:member ?q .
    ?x foaf:name ?n                                                  ?group foaf:member ?p .
    FILTER regex(?n, "M.*Rowe", "i")                                 ?q foaf:name ?n .
}                                                                    ?p foaf:name ?c .
                                                                     GRAPH <http://www4.wiwiss.fu-berlin.de/dblp/>
                                                                     {
                                                                       ?paper dc:creator ?x .
This rule finds an instance of bib:Entry which has an author           ?x foaf:name ?n .
whose name matches the above regular expression. The in-               ?paper dc:creator ?y .
ferred triple then constructs a relation between the group             ?y foaf:name ?c .
                                                                     }
member and the publication using foaf:made - indicating              FILTER (?p != ?q)
that the paper was produced by the person. For each paper        }
that is found to have been authored by a group member we
place the paper and its description within the linked dataset.
We maintain the same URI as before (containing the paper         For each group within the linked dataset the above SPARQL
namespace and the increment of the paper count).                 rule gathers all the group members and checks their names
                                                                 against the networked graph for publications where those
Figure 3 shows a snippet of the compiled dataset. By enrich-     people have worked together. The URI of the paper which
ing data with formal semantics - where the data is leveraged     matches the query is then assigned to the group members
from heterogeneous sources - we are provided with a rich         in using the foaf:made relation. The authors of the paper
interpretation of legacy data. This allows SPARQL queries        within the DBLP dataset are also detected as referring to
to be performed over the dataset in order to extract knowl-      the group members and are associated to those foaf:Person
edge - this was previously limited without a large amount        instances using owl:sameAs. Using such a query produces
of manual processing. For instance we can ask for all the        the following relations.
groups have have worked together on papers and what were
the papers called.
                                                                 <http://data.dcs.shef.ac.uk/person/Fabio-Ciravegna>
                                                                   owl:sameAs
6.     LINKING TO THE WEB OF LINKED DATA                             <http://www4.wiwiss.fu-berlin.de/dblp/resource/person/169384> ;
                                                                   foaf:made
At this stage in our approach we have extracted legacy               <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/icml
data as triples and have built an interlinked dataset describ-         /IresonCCFKL05>
ing people within the DCS, their publications and the re-          foaf:made
                                                                     <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/ijcai
search groups they are members of. This dataset must now               /BrewsterCW01>
be linked into the Web of Data to provide relations with
equivalent resources and related information in distributed
datasets. The advantage of this - from the perspective of        In order to expose linked data we have deployed our dataset
members of the DCS - is that once equivalent person in-          using static RDF files according to Recipe 1 from ”How to
stances are found within external bibliography databases         Publish Linked Data”7 and Recipe 2 for slash namespaces
then all the papers written by that person, and do not ap-       from the ”Best Practices for Publishing RDF vocabularies”8 .
pear on the DCS publication database, will be provided by        This serves our purpose as URIs are dereferenceable in our
looking up the URI of the DCS member.                            published data and will allow this deployment to be up-
                                                                 graded to more advanced setups, such as Drupal, in the near
According to [8] author disambiguation is one of the com-        future without the URIs returning a 404 response.
mon problems faced by the linked data community. In cer-
tain cases, wrongly created owl:sameAs links result in the
incorrect collection of publications being returned when an
                                                                 7. CONCLUSIONS
                                                                 This paper has presented an approach currently in use to
author URI is looked up. For now we implement a conserva-
                                                                 convert legacy data to linked data. The paper places em-
tion strategy to link members of the DCS with publications
                                                                 phasis on the first stage of the process triplification of legacy
which they have authored which are contained within exter-
                                                                 data as this has been the most thoroughly investigated por-
nal datasets. We use a similar person co-occurence strategy
                                                                 tion of the work. We believe that the results from the evalu-
as when detecting equivalent foaf:Person instances previ-
                                                                 ation demonstrate the effectiveness of using trained Hidden
ously. We assume that a person will author a paper with the
                                                                 Markov Models to extract legacy data from HTML docu-
same people that they work with. We construct a SPARQL
                                                                 ments and RSS feeds. Although we have used such tech-
rules which uses the notion of a networked graph [12] to
                                                                 niques to extract person information, the approach could be
query the DBLP linked dataset6 . The rule works as follows:
                                                                 applied to other domains in which legacy data is locked away
                                                                 within HTML documents and devoid of machine-processable
PREFIX foaf:<http://xmlns.com/foaf/0.1/>                         markup. In such cases the HMMs would be trained for the
PREFIX dc:<http://purl.org/dc/terms/>                            specific information which is to be extracted. The triplifica-
PREFIX owl:<http://www.w3.org/2002/07/owl#>                      tion and coreference resolution stages of the approach have
CONSTRUCT {
  ?q foaf:made ?paper .                                          provided a stable testbed on which we plan to explore sta-
  ?p foaf:made ?paper .                                          tistical methods for interlinking our dataset into the Web of
  ?q owl:sameAs ?x .                                             Linked Data. Our future work will investigate such methods
  ?p owl:sameAs ?y
}                                                                in order to contribute to the Linked Data community. Once
WHERE {                                                          7
                                                                     http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/
6                                                                8
    http://www4.wiwiss.fu-berlin.de/dblp/                            http://www.w3.org/TR/swbp-vocab-pub/
                   Figure 3: A snippet of the interlinked dataset following coreference resolution


we have linked our linked dataset to additional datasets then        People Search Evaluation Workshop (WePS 2009),
we plan also provide VoiD descriptions [1] of those links to         18th WWW Conference, 2009.
enable easier consumption of the data. At present our top       [10] K. Möller, T. Heath, S. Handschuh, and J. Domingue.
level components in the produced dataset are the research            Recipes for semantic web dog food - the eswc and iswc
groups. We plan to use this project as the blueprint for             metadata projects. In 6th International and 2nd Asian
producing linked data from all departments and faculties             Semantic Web Conference (ISWC2007+ASWC2007),
in the university, described using the Academic Institution          pages 795–808, November 2007.
Internal Structure Ontology9 .                                  [11] E. Prud’hommeaux and A. Seaborne. SPARQL Query
                                                                     Language for RDF. Technical report, W3C, 2006.
8.     REFERENCES                                               [12] S. Schenk and S. Staab. Networked graphs: a
    [1] K. Alexander, R. Cyganiak, M. Hausenblas, and                declarative mechanism for sparql rules, sparql views
        J. Zhao. Describing Linked Datasets - On the Design          and rdf data integration on the web. In WWW ’08:
        and Usage of voiD, the ’Vocabulary of Interlinked            Proceeding of the 17th international conference on
        Datasets’. In WWW 2009 Workshop: Linked Data on              World Wide Web, pages 585–594, New York, NY,
        the Web (LDOW2009), Madrid, Spain, 2009.                     USA, 2008. ACM.
    [2] S. Auer, S. Dietzold, J. Lehmann, S. Hellmann, and      [13] L. Shi, D. Berrueta, S. Fernandez, L. Polo, and
        D. Aumüller. Triplify light-weight linked data              S. Ferna?dez. Smushing rdf instances: are alice and
        publication from relational databases. In 18th               bob the same open source developer? In ISWC2008
        International World Wide Web Conference                      workshop on Personal Identification and
        (WWW2009), April 2009.                                       Collaborations: Knowledge Mediation and Extraction
    [3] F. Ciravegna, S. Chapman, A. Dingli, and Y. Wilks.           (PICKME 2008), October 2008.
        Learning to harvest information for the semantic web.   [14] D. Thamvijit, H. Chanlekha, C. Sirigayon,
        In Proceedings of the 1st European Semantic Web              T. Permpool, and A. Kawtrakul. Person information
        Symposium (ESWS-2004), May 2004.                             extraction from the web. In 6th Symposium on 6th
    [4] P. Coetzee, T. Heath, and E. Motta. Sparqplug:               Symposium of Natural Language Processing, 2005.
        Generating linked data from legacy html, sparql and     [15] X. Wan, J. Gao, M. Li, and B. Ding. Person resolution
        the dom. In Linked Data on the Web (LDOW2008),               in person search results: Webhawk. In CIKM ’05:
        2008.                                                        Proceedings of the 14th ACM international conference
    [5] G. D. Forney. The viterbi algorithm. Proceedings of          on Information and knowledge management, pages
        the IEEE, 61(3):268–278, 1973.                               163–170, New York, NY, USA, 2005. ACM.
    [6] D. Freitag and A. K. Mccallum. Information              [16] K. Watanabe, D. Bollegala, Y. Matsuo, and
        extraction with hmms and shrinkage. In In Proceedings        M. Ishizuka. A two-step approach to extracting
        of the AAAI-99 Workshop on Machine Learning for              attributes for people on the web. In 2nd Web People
        Information Extraction, pages 31–36, 1999.                   Search Evaluation Workshop (WePS 2009), 18th
    [7] E. Hetzner. A simple method for citation metadata            WWW Conference, 2009.
        extraction using hidden markov models. In JCDL ’08:     [17] B. Zhou, W. Liu, Y. Yang, W. Wang, and M. Zhang.
        Proceedings of the 8th ACM/IEEE-CS joint conference          Effective metadata extraction from irregularly
        on Digital libraries, pages 280–284, New York, NY,           structured web content. Technical report, HP
        USA, 2008. ACM.                                              Laboratories, 2008.
    [8] A. Jaffri, H. Glaser, and I. Millard. Uri               [18] J. Zou, D. Le, and G. R. Thoma. Structure and
        disambiguation in the context of linked data. In             content analysis for html medical articles: a hidden
        Linked Data on the Web (LDOW2008), 2008.                     markov model approach. In DocEng ’07: Proceedings
    [9] M. Lan, Y. Z. Zhang, Y. Lu, J. Su, and C. L. Tan.            of the 2007 ACM symposium on Document
        Which who are they? people attribute extraction and          engineering, pages 199–201, New York, NY, USA,
        disambiguation in web search results. In 2nd Web             2007. ACM.
9
    http://vocab.org/aiiso/schema#