Data.dcs: Converting Legacy Data into Linked Data∗ ∗ Matthew Rowe OAK Group Department of Computer Science University of Sheffield Regent Court, 211 Portobello Street S1 4DP Sheffield, United Kingdom m.rowe@dcs.shef.ac.uk ABSTRACT such data. This therefore makes the process of producing Data.dcs is a project intended to produce Linked Data de- linked data limited. scribing the University of Sheffield’s Department of Com- puter Science. At present the department’s web site con- In this paper we use the case of the University of Sheffield’s tains important legacy data describing people, publications Department of Computer Science (DCS). The DCS web site and research groups. This data is distributed and is pro- contains information about people - such as their name, vided in heterogeneous formats (e.g. HTML documents, email address, web address and location, research groups RSS feeds), making it hard for machines to make sense and publications. The department provides a publication of such data and query it. This paper presents an ap- database located separately from the main site on which proach to convert such legacy data from its current form DCS members manually upload their papers. Each mem- into a machine-readable representation which is linked into ber of the department is responsible for their own personal the Web of Linked Data. The approach describes the triplifi- web page, this has lead to the formatting and presentation cation of legacy data, coreference resolution and interlinking of legacy data to vary greatly between pages, where some with external linked datasets. pages contain RDFa and others are plain HTML documents with the bare minimum of markup. This impacts greatly Categories and Subject Descriptors on the usability of the site in general and the slow process by which information can be acquired. For instance find- H.4 [Information Systems Applications]: General ing all the publications which two or more research groups have worked on in the past year would take a large amount General Terms of filtering and data processing. Furthermore the publica- Linked Data tion database is rarely updated to reflect publications by the department and its members. Keywords Linked Data, Triplification, Coreference Resolution This use case presents a clear motivation for generating a richer representation of legacy data describing the DCS. We define legacy data as data which is present in proprietary 1. INTRODUCTION formats and which describes important information about Recent work has addressed the issue of producing linked data the department - i.e. publications. Leveraging legacy data from data sources conforming to well-stuctured relational from HTML documents which make up the DCS web site databases [2]. In such cases data already follows a logical and converting this data into a machine-readable form us- schema making the creation of linked data a case of schema ing formal semantics would link together related informa- mapping and data transformation. The majority of the Web tion. It would link people with their publications, research however does not conform to such a rigid representation, groups with their members and allow co-authors of research instead the heterogeneous structures and formats which it papers to be found. Furthermore by linking the dataset into exhibits makes it hard for machines to parse and interpret the Web of Linked Data would allow additional information ∗ The research leading to these results has received funding to be inferred such as the conferences which members of the from the EU project WeKnowIt10 (ICT-215453). DCS have attended and provide up-to-date publications list- ∗Copyright is held by the author/owner(s). ings - thereby avoiding the current slow update process by LDOW2010, April 27, 2010, Raleigh, USA. linking to popular bibliographic databases such as DBLP. In this paper we document our current efforts to convert this legacy data to linked data. We present our approach to pursue this goal which is comprised of three stages: first we perform triplification of legacy data found within the DCS - by extracting person information from HTML documents and publication information from the current bibliography system. Second we perform coreference resolution and inter- linking of the produced triples - thereby linking people with their publications and fusing data within separate HTML clues to regions within the documents from which person documents together. Third we connect our produced dataset information should be extracted. Once regions of extrac- to distributed linked datasets in order to provide additional tion have been identified then extraction patterns are used information to agents and humans browsing the dataset. to extract relevant information based on its proximity in the document. An effort to extract personal information (name, We have structured the paper as follows: section 2 describes email, homepage, telephone number) from within web pages related work in the field of producing linked data from legacy has been presented in [3] using a system called ”Armadillo”. data and discusses similar efforts to our problem setting ex- A lexicon of seed person names is compiled from several plored within the information extraction community. Sec- repositories which are then used to guide the information tion 3 presents a brief overview of our approach and the extraction process. Heuristics are used to extract person in- pipeline of the architecture which is employed. Section 4 de- formation surrounding a name which appears within a given scribes the triplication process which generates triples from web page. legacy data within HTML documents and the publication database. Section 5 presents the SPARQL rules we em- Work by [18] has explored the application of Hidden Markov ployed to discover coreferring entities. Section 6 describes Models to extract medical citations from a citation reposi- our preliminary method for weaving our dataset into the tory by inputting a sequence of tokens and then outputting linked data cloud. Section 7 finishes the paper with the con- the relevant labels for those tokens based on the HMM’s clusions which we have learnt from this work and our plans predicted states: Title, Author, Affiliation, Abstract and for future work. Reference. Prior to applying the HMMs, windows within HTML documents are derived known as component zones, 2. RELATED WORK or context windows, these zones within the HTML document are considered for analysis in order to extract information Recent efforts to construct linked data from legacy data from. Similar work by [7] has applied HMMs to the task include Sparqplug [4] where linked data models are con- of extracting citation information. Work within the field structed based on Document Object Model (DOM) struc- of attribute extraction has placed emphasis on the need to tures of HTML documents. The DOM is parsed into an extract information describing a given person from within RDF model which then permits SPARQL [11] queries to web pages. For instance [9] uses extraction patterns (i.e. be processed over the model and relevant information re- regular expressions) defined for different person attributes turned. Although this work is novel in its approach to se- to match content within HTML documents. An approach mantifying web documents, the approach is limited by its by [16] to extract person attributes from HTML documents lack of rich metadata descriptions attributed to elements first identifies a list of candidate attributes within a given within the DOM. Existing work by [2] presents an approach web page using hand crafted regular expressions - these are to expose linked data from relational databases by creating related to different individuals. All HTML markup is then lightweight mapping vocabularies. The effect is such that filtered out leaving the textual content of the documents. data which previously corresponded to a bespoke schema is Attributes which appear closest to a given person name are provided as RDF according to common ontological concepts. then assigned to that name. Metadata generation - so called triplification - is discussed extensively in [10] in order to generate metadata describing conferences, their proceedings, attendees and organisations 3. CONVERTING LEGACY DATA INTO participating. Due to the wide variation in the provided LINKED DATA data formats - i.e. excel spreadsheets, table documents - In order to convert legacy data into linked data we have im- metadata was generated by hand. Despite this such work plemented a pipeline approach. Figure 1 shows the overview provides a blue print for generating metadata by describing of this approach which is divided into three stages: the process in detail and the challenges faced. The challenges faced when converting legacy data devoid of • Triplification: the approach begins by taking as input metadata and semantic markup into a machine-processable an RSS feed describing the publications by DCS mem- form involves exposing such legacy data and then construct- bers and the DCS web site. Context windows are iden- ing metadata models describing the data. In the case of tified within the RSS feed - where each context win- the DCS web site our goal is to generate metadata describ- dow contains information about a single publication ing members of the department, therefore we must extract - and in the HTML documents - where each context this legacy data to enable the relevant metadata descrip- window contains information about a single person. tions to be built. Work within the field of information ex- Information is extracted from these context windows traction provides similar scenarios to the problems which and is then converted into triples, describing instances we face, For instance extraction of person information from of people and publications within the department. within HTML documents has been addressed in [14] by seg- menting HTML documents into components based on the • Coreference Resolution: SPARQL queries are processed Document DOM of the web pages. Person information is over the entire graph to discover coreferring entities: then extracted using induced wrappers from labelled per- e.g. the same people appearing in different web pages. sonal pages. [15] uses manually created name patterns to match person names within a web page and then, using a • Linking to the Web of Linked Data: the Web of Linked context window surrounding the match, extract contextu- Data Cloud is queried for coferring entities and related ally relevant information surrounding the name. The DOM information resources, and links are created from the of HTML documents is utilised in work by [17] to provide produced dataset. Figure 1: Three staged approach to convert legacy data to linked data Each of the stages of the approach contains various steps and element. We must identify such context windows within a processes which are essential to the production of a linked HTML document to enable the correct information to be dataset. We will now present each of these stages in greater extracted. To address this problem we rely on the markup detail, beginning with the triplication of legacy data. used within HTML documents to segment disjoint content. For instance in many web pages layout elements such as 4. TRIPLIFICATION OF LEGACY DATA
elements are used to contain information about a sin- The DCS web site contains listings of members of the depart- gle entity. Another
element is then used to contain ment: staff, researchers and students, and their associated information about another entity. Using such elements pro- information (name, email address, web address) provided vides the necessary means through which context windows within HTML documents. Such documents lack metadata can be identified - through the use of layout elements within descriptions which limits the applicability of automated pro- a DOM - and information extraction techniques can be ap- cesses to parse and interpret the data. Therefore we require plied to leverage the legacy data. We now explain how we some method to leverage legacy data which can then be con- generate context windows from HTML documents. verted into triples to allow machine-processing, for instance by associating a person with his/her name, email address, 4.1 Generating Context Windows etc. For publications we are confronted with a slightly differ- To derive a set of context windows from a given HTML doc- ent problem. We are provided with an RSS feed1 containing ument, we first tidy the HTML document into a parseable the publications within the department, this feed should be form using Apache Maven’s HTML Parser2 . HTML is often well structured with declarative elements for each attribute messy and contains poorly structured markup where HTML of a publication (i.e. title, authors, year, etc). Instead we tags are opened and not closed. This reduces its ability to be are returned the following: parsed where such techniques require a well-formed DOM. Once tidied the DOM is used as input to Algorithm 1 as follows: first a list of name patterns is loaded and applied to Interlinking Distributed Social Graphs http://publications.dcs.shef.ac.uk/show.php?record=4161 the DOM model, for each pattern the list of DOM elements which that pattern matches are collected (line 5). The pat- Proceedings of Linked Data on the Web Workshop, WWW 2009, Madrid, Spain. (2009). Madrid, Madrid, Spain.
tect the appearance of a person name within a given body
Edited by Sarah Duffy on Tue, 08 Dec 2009 09:31:30 +0000.]]> of text (e.g. +). Each
of the collected DOM elements is then verified as having not Mon, 07 Dec 2009 17:03:27 +0000 Sarah Duffy <s.duffy@dcs.shef.ac.uk> been been processed before (line 6) - as different name pat- terns may match the same person name at the same position in the document. The trigger string is extracted from the In the above XML the element contains the title element (line 8) noting the person’s name that was matched of the paper, however other paper attributes are not placed using the name pattern. The parent node type of the DOM within suitable elements - i.e. using <author> element for element (e) is then assessed to see if it is a hyperlink: it is the author of the paper. Instead all the data which de- common for a person name to appear within a HTML docu- scribes the paper is stored within the <descrption> element. ment as a hyperlinked element. If it is hyperlinked then the A technique is required which is able to extract informa- grandparent of the element is considered as a possible area tion from the <description> element which corresponds to from which the context window can be gathered. However the relevant attributes of the paper, for instance by extract- should the parent node of the element (e) not be hyperlinked ing ”Interlinking Distributed Social Graphs” for the title at- (line 12) then the parent is then passed onto the domManip tribute. function for assessment together with the trigger string. Unlike publications however, extracting person information Algorithm 2 (domManip) takes the trigger node and a node from HTML documents requires the derivation of a context from within the DOM and manipulates the DOM structure windows which contain person attributes - this akin to being to derive a suitable DOM element from which the context provided with the content within the above <description> window should be derived. First the node type is checked 1 2 http://pubs.dcs.shef.ac.uk http://htmlparser.sourceforge.net/ Algorithm 1 cwFind(dom) : Given the DOM of a HTML Algorithm 3 extractWin(trig,content) : Given a trigger document, returns a set of context windows string and a DOM element’s content, extracts the window Input: dom from the trigger onwards Output: Set of context windows C Input: trig and content 1: N =person name patterns Output: window 2: C = ∅ 1: maps = ∅ 3: visited = ∅ 2: N = person name patterns 4: for each n ∈ N do 3: remove(content, trig) 5: E = getElements(dom, n) 4: for each n ∈ N do 6: for each e ∈ E do 5: if match(content, n) then 7: if e.startIndex ∈/ visited then 6: maps = maps∪<n.startMatchIndex, n> 8: trig = extract(e, n) 7: end if 9: if e.parent.type ==< a > then 8: end for 10: c = domManip(e.parent.content, e.parent.parent) 9: if |maps| > 0 then 11: C = C ∪c 10: order(maps) 12: else 11: <i, n> = maps1 13: c = domManip(trig, e.parent) 12: return trig + content.substring(0, i) 14: C = C ∪c 13: else 15: end if 14: return content 16: visited = visited ∪ e.startIndex 15: end if 17: end if 18: end for 19: end for 20: return C the content string where the pattern match starts. Once all patterns have been applied to the content string, the map- pings, if there are any, are ordered by their start matching Algorithm 2 domManip(trig,node) : Given a trigger string points within the content string. The first mapping is then and the node which contains the trigger, derives the suitable chosen from the ordered set of mappings, given that this DOM element to extract the window from Input: trig and node provides the nearest point to the start of the content string Output: window where a person name appears. The content string is then 1: if node.type ==< td > then removed of the content following the earlier match (line 12), 2: window = extractWin(trig, node.parent.content.substring(trig)) the trigger string is appended back on to the start and this 3: else if node.type ==< style > then is returned as the context window. Should no mappings be 4: domManip(trig, node.parent) 5: else found (line 13) then the content string is returned as the 6: window = extractWin(trig, node.content.substring(trig)) context window. 7: end if The derived context window from extractWin feeds back to cwFind and populates the set of context window set for the to see if it is a <td> element - denoting that the trigger given HTML document. These algorithms for context win- appeared within a table in the HTML document (line 1). dow derivation provide a conservative strategy to identify If this is the case then the trigger string is passed to ex- areas of a HTML document from which person information tractWin together within the parent of the <td> element: can be extracted. It is conservative in the sense that it does the <tr> element which <td> is a child of. If the node is a not look above certain DOM element types (i.e. <div>) in- style element (<b>,<h1>,<span>, etc) (line 3) then dom- stead it relies on the logical segmentation of the document to Manip is recursively called using the trigger and the parent provide the necessary features which can be utilised to iden- node. Such elements control the presentation and styling tify context windows. Applying this approach to a HTML of a HTML document but do not control or segment the document containing the following markup would by trig- layout like <div>, <li> or <td> elements do. If neither of gered by the person name Matthew Rowe, the algorithms the above cases are true, and the node is a layout element would traverse two nodes up from the element containing via the process of elimation (line 5) then the content of the the trigger string until a <div> element is found. The con- node and the trigger string is passed on to the window ex- text window is then returned as the substring of the con- tractor. It is worth noting that the content of the nodes tent within the <div> element from the trigger string - the which is passed onto extractWin contains HTML markup person name Matthew Rowe onwards - until the end of the along with textual content. Unlike exisiting work within the <div> element’s content, returning the following: attribute extraction state of the art, markup is maintained as it provides clues which can aid the process of information <div> extraction (i.e. HTML tags acting as delimiters between Matthew Rowe</h4> person attributes). <img src="../images/people/matt.jpg"/> <p class="position">Ph.D. Student</p> <ul> Algorithm 3 (extractWin) takes the trigger string (i.e. the <li> person name) and HTML content string and derives the con- <a href="http://www.dcs.shef.ac.uk/ mrowe/"> text window. First a mapping set is initialised and the list of http://www.dcs.shef.ac.uk/ mrowe/ person name patterns is loaded (lines 1-2). The preamble of </a> </li> the HTML content string contains the trigger string, there- <li> fore this is removed to enable the name patterns to match <a href="mailto:m.rowe@dcs.shef.ac.uk"> the remaining content. Each pattern is applied to the con- M.Rowe@dcs.shef.ac.uk </a> tent string (line 4), if a match occurs (line 5) then the name </li> pattern is added to the mapping along with the index within </ul> <p>Researching Identity Disambiguation and Web 2.0</p> </div> 4.2 Extracting Legacy Data using Hidden Markov Models Figure 2: Topology of HMM for publication infor- Given a set of context windows derived from a HTML docu- mation extraction ment person information must now be extracted from the windows. Person information consists of four attributes: name, email, web page and location. The appearance and or- to the person topology, the topology of the HMM for pub- der in which these attributes appear in the context window lication information extraction also contains 4 major states can vary (e.g. (name,email,www ) or (email,name,location). corresponding to the four publication attributes. 3 Minor Context windows for publications are also provided using states are used to separate the major states and a single the content from all of the <description> elements within after state is included. Figure 2 shows the topology of the the RSS publication feed. Publication information also con- HMM used for publication information extraction. sists of four attributes: title, author, year and book. We use the bookTitle attribute to define where the publication ap- 4.2.2 Parameter Estimations pears, this could be a thesis - in which case it would be the Once the states have been decided for the information at- university - or a journal paper - in which case it would be tributes the remaining parameters of the HMM must be esti- the name of the journal publisher. mated. We must train the HMM to detect what transitions are more likely than others and to calculate the probabil- In order to extract both person and publication information ity of omitting a given symbol whilst in a given state. The from their relevant context windows Hidden Markov Models transition probability matrix A is built from labelled train- (HMMs) are used. HMMs provide a suitable solution to this ing data by counting the number of times a given state si problem setting by taking as input a sequence of observa- transits to state sj . This count is then normalised by the tions (e.g. tokens within a context window) and outputting total number of transitions from state si . Formally A is the most likely sequence of states where each state corre- populated as follows: sponds to a piece of information to be extracted. HMMs use Markov chains to work out the likelihood of moving from one state to another (si → sj ) and outputting symbol (σ when c(sj |si ) aij = P s∈S c(s|si ) in state sj ). A HMM is described as hidden in that it is given a known sequence observations with hidden states, it must therefore label these hidden states which correspond to Similar to A, the omission probability matrix B is built from person or publication attributes which are to be extracted. labelled training data. Counts are made of how many times A HMM consists of a set of States; S = {s1 , s2 , ..., sm }, a given symbol is observed in a given state, this count is a vocabulary of symbols; Σ = {σ1 , σ2 , ..., σn }, a transition then normalised by the total number of symbols omitted in probability matrix; A (where aij = P (sj |si )), an omission that state. B is therefore defined as follows: probability matrix; B (where biσ = P (σ|si )) and a start probability vector (where π where πi = P (si |sstart )). These σn |c(si ) parameters must be built, or estimated, from known infor- biσn = P σ∈Σ c(σ|si ) mation, essentially training the HMM from previous context windows to allow information to be extracted from future context windows. The start probability vector is built from the training data by counting how many times a given state is started in. This is then normalised by the total number of start states 4.2.1 HMM States observed: The topology of the HMM defines what states are to be used and how those states are connected together. States within c(si |sstart ) πi = P the HMM fall into two categories: major and minor states. s∈S c(s|sstart ) For person information extraction there are 4 major states which constitute the four person attributes. 13 minor states 4.2.3 Smoothing are defined in order to provide clues to the HMM and en- When estimating the parameters of the HMM from labelled hance the process of deriving the state sequence. Of those training data it is likely that certain state-to-state transi- 13 minor states there are 2 pre-major states (pre email and tions, or omissions whilst in a given state are not observed. pre www ), 10 separator states (e.g. between email and name) The trained HMM, when applied to test data, may find pre- and 1 after state which contains the symbols omitted at the viously unknown paths, or symbols omitted in states which end of the window. Omissions made within the minor states have previously not been witnessed. The model must be offer clues as to the order of the state sequence and what able to deal with such possibilities by smoothing the tran- information is to follow. For instance using the omission of sition and omission probabilities to cope with unseen ob- the token <a would indicate that a proceeding field might servations and transitions within the training data. One contain an email address or a web address. There is also such smoothing technique is known as Naive Smoothing [7] a single start state in which the HMM begins. The start . Naive Smoothing functions by setting zero probabilities in probability vector (π) contains the transition probabilities A and B to a very low constant of 10−7 . A and B are esti- P of moving from this state to another given state. Similar mated using normalised values such that |A| j=1 aij = 1 and P|B| k=1 biσk = 1, therefore when smoothing the zero prob- The success of our triplification technique depends on its abilities in both A and B the non-zero probabilities must ability to extract the maximum amount of person and publi- also be adjusted to ensure that that the distributions hold. cation information whilst ensuring that the extracted legacy n Therefore all non-zero probabilities have the value of m 10−7 data is accurate and contains no errors - as this could be subtracted from the current value where n is number of zero- detrimental to the linking of this data into the Web of Linked probability events and m is the number of non-zero proba- Data. Therefore we evaluate our triplification approach us- bility events. ing the evaluation measures precision and recall defined as precision = |A ∩ B|/|B| and recall = |A ∩ B|/|A| where A Another smoothing method using Additive Smoothing (Laplace denotes the set of relevant tokens, and B denotes the set Smoothing) described in [6] increments all zero counts when of retrieved tokens. Precision measures the proportion of building A and B by 1. This ensures that the respective tokens which were labelled correctly. Recall measures the transitions and omissions are then assigned a low probabil- proportion of correct tokens which were found. These mea- ity which is non-zero. It is worth noting that the use of sures gauge the accuracy of the labels and the ability of smoothing is only applicable if the training data does not the technique to find person information within the HTML sufficiently cover possible transitions which are likely to ap- document - as lower levels of recall indicate that person in- pear and observations which are present in test data. formation is missed. Evaluation is therefore performed on a per token basis for person information extraction within 4.3 Vocabulary Dimension Reduction a given HTML document and per token basis within the One of the parameters of a HMM is its vocabulary of sym- publication feed for publication information extraction by bols - where the term symbol is used to refer to a given assessing the accuracy of the HMM in labelling tokens with observation i.e. word, token, etc. This vocabulary contains their respective major state labels and the ability of the tech- all the possible observations or omissions which might be nique to detect context windows. F-measure (referred to in found within an input sequence. In [18] the vocabulary is the results as F1) provides the harmonic mean of precision compiled from a large corpus of words which make up all and recall as follows: the possible symbols that could be observed. This works well where the dictionary is a finite size, however in the case of HTML markup leads to new combinations. To solve f − measure = 2×precison×recall precision+recall this problem we control the dimension of the vocabulary that only a fixed number of symbols are used. Dimension control is performed using transformation functions as fol- The evaluation dataset was compiled by crawling the De- lows: a given input - i.e. the context window - is tokenized, partment of Computer Science web site3 - in a similar vein to each token is then transformed into its respective symbols work by [3]. All internal pages within the web site were col- using transformation functions. The transformed input se- lected, totaling 12,000 HTML documents, of this collection quence of symbols is then used to derive the correct state 3,500 documents were found to contain person information. sequence. The vocabulary contains 16 symbols, where each Context windows were derived for each of these documents symbol has a transformation function which matches the to- and 200 randomly selected context windows were used as ken to a given symbol. For instance the symbol First Name training data for the HMM. Each window is already tok- (FN) is used to identify a person’s name. Symbols are also enized, however for training each token is labeled with the used for different HTML tags such as an opening tag (e.g. state in which it appears. The test data was also compiled <a). Web data is noisy and contains a large amount of varia- by randomly selecting 40 URLs from the dataset and their tion in content form. Controlling the vocabulary of symbols respective context windows, therefore totalling 203 context allows previously unseen tokens to be handled appropriately. windows. A gold standard was then created for these win- dows by manually labelling the tokens within their respec- 4.4 Deriving Transition Paths tive states. Each URL was also manually analysed to find Given a tokenized context window which has been converted context windows which were missed by the context window into symbols, the Viterbi algorithm [5] is then used to cal- derivation algorithms, these were then added to the gold culate the most probable state path through the window. standard. We performed the same setup for publications This path is found using A and B: given the sequence of ob- by generating 200 tokenised context windows - using con- servations the path is returned composed of the maximum tent from <description> elements in the publication RSS likelihood estimates of moving from one state to another and feed - and labelling each of the tokens in each window with then omitting a given symbol. The Viterbi algorithm uses its respective states for training and choosing another 200 the learnt HMM, and its estimated parameters, as back- windows randomly for testing. ground knowledge of known transitions and omissions and assesses the input sequence to find clues as to the order of 4.5.1 Results states. This allows consistencies in the layout and presen- As the results from Table 1 and Table 2 show Naive Smooth- tation of person information to be utilised to extract in- ing achieves, on average, higher f-measure levels with respect formation for future tasks. For instance, it is common for to the alternative smoothing method. Additive Smoothing a person to hyperlink their name with their web address, yields poorer scores, particularly for labelling web addresses learning such patterns allows for future similar information and locations. Both smoothing techniques perform poorly extraction tasks to be recognised and the correct informa- in terms of recall when extracting location information. In tion extracted. terms of publication information extraction the results are 4.5 Evaluation 3 http://www.dcs.shef.ac.uk For publications we model extracted information using the Table 1: Accuracy levels of extracting person infor- Bibtex ontology4 by creating an instance of bib:Entry for mation using Hidden Markov Models with different each publication instance. We use a temporary URI for smoothing methods each publication instance by taking the publication names- Naive Smoothing Additive Smoothing pace http://data.dcs.shef.ac.uk/paper/ and appending Attribute P R F1 P R F1 an incremented integer for each each new publication. We Name 0.903 0.875 0.889 0.928 0.703 0.8 then assign the relevant attributes to the instance using Email 1 0.867 0.928 0.578 0.688 0.628 concepts from the Bibtex ontology. For the title we use WWW 0.849 0.833 0.841 0.714 0.714 0.714 Location 0.888 0.444 0.592 0.421 0.211 0.281 bib:title, for the year we use bib:hasYear and for the book Average 0.910 0.754 0.825 0.66 0.579 0.616 title we use bib:hasBookTitle. For each paper author we cre- ate a blank node typed as an instance of foaf:Person and assign the author name to the instance using foaf:name and associate this instance with the publication instance using Table 2: Accuracy levels of extracting publication foaf:maker. Referring back to the example from the begin- information using Hidden Markov Models with dif- ning of this section, the RSS feed provided by the publication ferent smoothing methods base contained publication information - containing all four attributes - within a single <description> element. This Naive Smoothing Additive Smoothing legacy data, once extracted and converted into triples, is Attribute P R F1 P R F1 Title 0.941 0.698 0.801 0.901 0.589 0.712 provided as follows (again using Notation 3 syntax): Year 1 0.716 0.835 1.000 0.678 0.808 Author 0.952 0.717 0.818 0.934 0.687 0.792 Book Title 0.982 0.652 0.783 0.956 0.500 0.657 <http://data.dcs.shef.ac.uk/paper/239> Average 0.969 0.696 0.810 0.948 0.614 0.745 rdf:type bib:Entry ; bib:title "Interlinking Distributed Social Graphs." ; bib:hasYear "2009" ; bib:hasBookTitle "Proceedings of Linked Data on the Web Workshop, WWW , Madrid, Spain." ; similar to the performance when applying HMMs for per- foaf:maker _:a1 . son information extraction. Naive smoothing yields higher _:a1 foaf:name "Matthew Rowe" f-measure scores overall and almost perfect precision - indi- cating that the extracted information rarely contains mis- takes. However particularly for the paper title and the book 5. COREFERENCE RESOLUTION title several tokens are missed leading to incomplete titles. Following conversion of the DCS web site and publication This is something which must be addressed in future work as database we are provided with an RDF dataset containing the error will scale up to become detrimental to data quality. 17896 foaf:Person instances and 1088 bib:Entry instances. Using this dataset we must discover coreferring instances 4.6 Building RDF Models from Legacy Data such as equivalent people appearing in separate web pages Using HMMs together with Naive Smoothing we build an and identify publications which people have published. This RDF dataset describing all instances of people and publi- stage in the approach starts the process of compiling the cations within the department. This dataset provides the linked dataset which will be deployed for consumption. There- source dataset from which we build our linked dataset for fore we perform coreference resolution to identify equivalent deployment. We apply the above techniques to the en- instances in the dataset and fuse data together - this will tire dataset collected from the DCS web site in order to provide rich instance descriptions when a resource is looked build metadata models describing person information found up in our linked dataset. within each web document. We also apply the technique to build RDF models describing publications within the de- partment. In each case we use temporary URIs to provide 5.1 Building Research Groups unique RDF instances constructed from the extracted legacy Our produced linked dataset is intended to contain informa- data. We use a namespace to identify the RDF instance as tion about research groups and their members and publica- denoting a person http://data.dcs.shef.ac.uk/person/ tions. Therefore we generate an instance of foaf:Group for and append an incremented integer to form a new URI each research group and assign the group a minted URI using for a given person. For each person found within a given the group namespace http://data.dcs.shef.ac.uk/group HTML document we create instances of foaf:Person and and appending an abbreviation of the group name (e.g. nlp assign their name to the instance using foaf:name, hashed for the Natural Language Processing Group). We then as- emailed address using foaf:sha1 sum and homepage using sign a name to the group using foaf:name and the URL of foaf:homepage. We associate the person instance to the the group web page using foaf:workplaceHomepage. This web page within the department’s web site on which the produces the following: instance appeared using foaf:topic. An example instance of foaf:Person is as follows (using Notation 3 syntax). <http://data.dcs.shef.ac.uk/group/oak> rdf:type foaf:Group ; foaf:name "Organisations, Information and Knowledge <http://data.dcs.shef.ac.uk/person/12025> Group" ; rdf:type foaf:Person ; foaf:workplaceHomepage <http://oak.dcs.shef.ac.uk> foaf:name "Matthew Rowe" . <http://www.dcs.shef.ac.uk/~mrowe/publications.html> 4 foaf:topic <http://data.dcs.shef.ac.uk/person/12025> http://zeitkunst.org/bibtex/0.1/bibtex.owl# Once we have constructed all of the group instances we then CONSTRUCT { query our source dataset for all the people who appear on ?x owl:sameAs ?y . ?x foaf:page ?p each of the group personnel pages. This provides us with the } members of the DCS whose information is going to the com- WHERE { piles and deployed as linked data. This step in the approach <http://oak.dcs.shef.ac.uk/people> foaf:topic ?x . <http://oak.dcs.shef.ac.uk/people> foaf:topic ?y . acts as seeding the forthcoming coreference resolution pro- ?p foaf:topic ?z . cesses by compiling a set of members. It worth noting how- ?p foaf:topic ?u . ever that in doing so we are only considering a subset of ?x foaf:name ?n . ?y foaf:name ?m . the entire collection of foaf:Person instances. We plan to ?z foaf:name ?n . analyse this data in future work, however for now we are ?u foaf:name ?m . concerned with producing linked data describing the DCS. FILTER (<http://oak.dcs.shef.ac.uk/people> != ?p) } 5.2 Person Disambiguation We are provided with a set of people who are members of Using the above rules identifies web pages within the dcs the DCS, who either work or study there. We perform per- which cite the group members and their equivalent instances son disambiguation to identify other instances of foaf:Person from those pages. New instances of foaf:Person are con- in separate web documents which are in fact the same peo- structed for each member of the research groups within the ple as the DCS members. Our first person disambiguation department. For each group member we take the instances method uses Instance Smushing [13] to discover equivalent of foaf:Person and assign the information from the instance instances. This technique works by matching resources as- description to the new foaf:Person instance. This fuses the sociated with disparate RDF instances where the resources data from separate instances to provide a richer description. are associated with the instances using properties which are Also for each group member we assign each page where an defined as owl:inverseFunctionalProperty. An example of equivalent instance was found and relate this page to the instance smushing is the identification of equivalent person new foaf:Person instance using foaf:page. When the instance instances using the email address of the person. In essence is dereferenced this will provide links to all the web pages Instance Smushing uses the declarative characteristics of which cites the person. For each group member we create such properties to detect coreference. We smush instances a new minted URI according to ”Cool URIs for the Seman- of foaf:Person which appear on research group personnel tic Web”5 . We use the same person namespace as for the pages using the following SPARQL rule for foaf:homepage temporary URIs but with the person name as it appears on property: the group personnel page (with titles removed) and append this to the namespace to produce a URI for the DCS mem- ber (e.g. <http://data.dcs.shef.ac.uk/person/Matthew- PREFIX foaf:<http://xmlns.com/foaf/0.1/> PREFIX owl:<http://www.w3.org/2002/07/owl#> Rowe>). CONSTRUCT { ?x owl:sameAs ?y . ?x foaf:page ?p 5.3 Assigning People to Publications } Our linked dataset now contains instances of foaf:Group and WHERE { foaf:Person describing research groups and their members. <http://oak.dcs.shef.ac.uk/people> foaf:topic ?x . ?x foaf:homepage ?h . We must now identify publications which have been writ- ?p foaf:topic ?y ten by the group members. We implement a basic strategy ?y foaf:homepage ?h of name matching using an abbreviated form of the names FILTER (<http://oak.dcs.shef.ac.uk/people> != ?p) } of the group members. For instance for the name ”Matthew Rowe” we break down the name into several citation formats: ”M Rowe”, ”Rowe M”, ”M. Rowe”. The publication database The triple within the CONSTRUCT clause infers an owl:sameAs has no single strategy for naming and therefore several dif- relation between a member of the oak group and another ferent formats must be accounted for. It is worth noting the person instance on a separate web page, and infers that the imprecision such a strategy would lead to if it was applied page cites the group member - expressed using foaf:page. when interlinking data. However in this context it is appli- cable to use such a technique given the localised context of Our second person disambiguation technique employs per- the publication database - as it only stores publications by son co-occurence to identify coreferring instances of foaf:Person. members of the department. Using the above example we We assume that if a group member appears on a web page formulate queries based on several name abbreviations in or- with a coworker then that page will refer to them - this is a der to match a group member with the publications he/she basic intuition used throughout person disambiguation ap- has written. An example rule is as follows: proaches. Therefore we define a SPARQL rule to infer the same triples as the previous rule but this time modifying the PREFIX foaf:<http://xmlns.com/foaf/0.1/> graph pattern within the WHERE clause to match the name CONSTRUCT { of a member of the OAK group - listed on the group’s per- <http://data.dcs.shef.ac.uk/person/Matthew-Rowe> sonnel page - and the name of a colleague on a separate foaf:made ?p } page. WHERE { ?p rdf:type bib:Entry . 5 PREFIX foaf:<http://xmlns.com/foaf/0.1/> http://www.w3.org/TR/2007/WD-cooluris- PREFIX owl:<http://www.w3.org/2002/07/owl#> 20071217/#cooluris ?p foaf:maker ?x . ?group foaf:member ?q . ?x foaf:name ?n ?group foaf:member ?p . FILTER regex(?n, "M.*Rowe", "i") ?q foaf:name ?n . } ?p foaf:name ?c . GRAPH <http://www4.wiwiss.fu-berlin.de/dblp/> { ?paper dc:creator ?x . This rule finds an instance of bib:Entry which has an author ?x foaf:name ?n . whose name matches the above regular expression. The in- ?paper dc:creator ?y . ferred triple then constructs a relation between the group ?y foaf:name ?c . } member and the publication using foaf:made - indicating FILTER (?p != ?q) that the paper was produced by the person. For each paper } that is found to have been authored by a group member we place the paper and its description within the linked dataset. We maintain the same URI as before (containing the paper For each group within the linked dataset the above SPARQL namespace and the increment of the paper count). rule gathers all the group members and checks their names against the networked graph for publications where those Figure 3 shows a snippet of the compiled dataset. By enrich- people have worked together. The URI of the paper which ing data with formal semantics - where the data is leveraged matches the query is then assigned to the group members from heterogeneous sources - we are provided with a rich in using the foaf:made relation. The authors of the paper interpretation of legacy data. This allows SPARQL queries within the DBLP dataset are also detected as referring to to be performed over the dataset in order to extract knowl- the group members and are associated to those foaf:Person edge - this was previously limited without a large amount instances using owl:sameAs. Using such a query produces of manual processing. For instance we can ask for all the the following relations. groups have have worked together on papers and what were the papers called. <http://data.dcs.shef.ac.uk/person/Fabio-Ciravegna> owl:sameAs 6. LINKING TO THE WEB OF LINKED DATA <http://www4.wiwiss.fu-berlin.de/dblp/resource/person/169384> ; foaf:made At this stage in our approach we have extracted legacy <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/icml data as triples and have built an interlinked dataset describ- /IresonCCFKL05> ing people within the DCS, their publications and the re- foaf:made <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/ijcai search groups they are members of. This dataset must now /BrewsterCW01> be linked into the Web of Data to provide relations with equivalent resources and related information in distributed datasets. The advantage of this - from the perspective of In order to expose linked data we have deployed our dataset members of the DCS - is that once equivalent person in- using static RDF files according to Recipe 1 from ”How to stances are found within external bibliography databases Publish Linked Data”7 and Recipe 2 for slash namespaces then all the papers written by that person, and do not ap- from the ”Best Practices for Publishing RDF vocabularies”8 . pear on the DCS publication database, will be provided by This serves our purpose as URIs are dereferenceable in our looking up the URI of the DCS member. published data and will allow this deployment to be up- graded to more advanced setups, such as Drupal, in the near According to [8] author disambiguation is one of the com- future without the URIs returning a 404 response. mon problems faced by the linked data community. In cer- tain cases, wrongly created owl:sameAs links result in the incorrect collection of publications being returned when an 7. CONCLUSIONS This paper has presented an approach currently in use to author URI is looked up. For now we implement a conserva- convert legacy data to linked data. The paper places em- tion strategy to link members of the DCS with publications phasis on the first stage of the process triplification of legacy which they have authored which are contained within exter- data as this has been the most thoroughly investigated por- nal datasets. We use a similar person co-occurence strategy tion of the work. We believe that the results from the evalu- as when detecting equivalent foaf:Person instances previ- ation demonstrate the effectiveness of using trained Hidden ously. We assume that a person will author a paper with the Markov Models to extract legacy data from HTML docu- same people that they work with. We construct a SPARQL ments and RSS feeds. Although we have used such tech- rules which uses the notion of a networked graph [12] to niques to extract person information, the approach could be query the DBLP linked dataset6 . The rule works as follows: applied to other domains in which legacy data is locked away within HTML documents and devoid of machine-processable PREFIX foaf:<http://xmlns.com/foaf/0.1/> markup. In such cases the HMMs would be trained for the PREFIX dc:<http://purl.org/dc/terms/> specific information which is to be extracted. The triplifica- PREFIX owl:<http://www.w3.org/2002/07/owl#> tion and coreference resolution stages of the approach have CONSTRUCT { ?q foaf:made ?paper . provided a stable testbed on which we plan to explore sta- ?p foaf:made ?paper . tistical methods for interlinking our dataset into the Web of ?q owl:sameAs ?x . Linked Data. Our future work will investigate such methods ?p owl:sameAs ?y } in order to contribute to the Linked Data community. Once WHERE { 7 http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/ 6 8 http://www4.wiwiss.fu-berlin.de/dblp/ http://www.w3.org/TR/swbp-vocab-pub/ Figure 3: A snippet of the interlinked dataset following coreference resolution we have linked our linked dataset to additional datasets then People Search Evaluation Workshop (WePS 2009), we plan also provide VoiD descriptions [1] of those links to 18th WWW Conference, 2009. enable easier consumption of the data. At present our top [10] K. Möller, T. Heath, S. Handschuh, and J. Domingue. level components in the produced dataset are the research Recipes for semantic web dog food - the eswc and iswc groups. We plan to use this project as the blueprint for metadata projects. In 6th International and 2nd Asian producing linked data from all departments and faculties Semantic Web Conference (ISWC2007+ASWC2007), in the university, described using the Academic Institution pages 795–808, November 2007. Internal Structure Ontology9 . [11] E. Prud’hommeaux and A. Seaborne. SPARQL Query Language for RDF. Technical report, W3C, 2006. 8. REFERENCES [12] S. Schenk and S. Staab. Networked graphs: a [1] K. Alexander, R. Cyganiak, M. Hausenblas, and declarative mechanism for sparql rules, sparql views J. Zhao. Describing Linked Datasets - On the Design and rdf data integration on the web. In WWW ’08: and Usage of voiD, the ’Vocabulary of Interlinked Proceeding of the 17th international conference on Datasets’. In WWW 2009 Workshop: Linked Data on World Wide Web, pages 585–594, New York, NY, the Web (LDOW2009), Madrid, Spain, 2009. USA, 2008. ACM. [2] S. Auer, S. Dietzold, J. Lehmann, S. Hellmann, and [13] L. Shi, D. Berrueta, S. Fernandez, L. Polo, and D. Aumüller. Triplify light-weight linked data S. Ferna?dez. Smushing rdf instances: are alice and publication from relational databases. In 18th bob the same open source developer? In ISWC2008 International World Wide Web Conference workshop on Personal Identification and (WWW2009), April 2009. Collaborations: Knowledge Mediation and Extraction [3] F. Ciravegna, S. Chapman, A. Dingli, and Y. Wilks. (PICKME 2008), October 2008. Learning to harvest information for the semantic web. [14] D. Thamvijit, H. Chanlekha, C. Sirigayon, In Proceedings of the 1st European Semantic Web T. Permpool, and A. Kawtrakul. Person information Symposium (ESWS-2004), May 2004. extraction from the web. In 6th Symposium on 6th [4] P. Coetzee, T. Heath, and E. Motta. Sparqplug: Symposium of Natural Language Processing, 2005. Generating linked data from legacy html, sparql and [15] X. Wan, J. Gao, M. Li, and B. Ding. Person resolution the dom. In Linked Data on the Web (LDOW2008), in person search results: Webhawk. In CIKM ’05: 2008. Proceedings of the 14th ACM international conference [5] G. D. Forney. The viterbi algorithm. Proceedings of on Information and knowledge management, pages the IEEE, 61(3):268–278, 1973. 163–170, New York, NY, USA, 2005. ACM. [6] D. Freitag and A. K. Mccallum. Information [16] K. Watanabe, D. Bollegala, Y. Matsuo, and extraction with hmms and shrinkage. In In Proceedings M. Ishizuka. A two-step approach to extracting of the AAAI-99 Workshop on Machine Learning for attributes for people on the web. In 2nd Web People Information Extraction, pages 31–36, 1999. Search Evaluation Workshop (WePS 2009), 18th [7] E. Hetzner. A simple method for citation metadata WWW Conference, 2009. extraction using hidden markov models. In JCDL ’08: [17] B. Zhou, W. Liu, Y. Yang, W. Wang, and M. Zhang. Proceedings of the 8th ACM/IEEE-CS joint conference Effective metadata extraction from irregularly on Digital libraries, pages 280–284, New York, NY, structured web content. Technical report, HP USA, 2008. ACM. Laboratories, 2008. [8] A. Jaffri, H. Glaser, and I. Millard. Uri [18] J. Zou, D. Le, and G. R. Thoma. Structure and disambiguation in the context of linked data. In content analysis for html medical articles: a hidden Linked Data on the Web (LDOW2008), 2008. markov model approach. In DocEng ’07: Proceedings [9] M. Lan, Y. Z. Zhang, Y. Lu, J. Su, and C. L. Tan. of the 2007 ACM symposium on Document Which who are they? people attribute extraction and engineering, pages 199–201, New York, NY, USA, disambiguation in web search results. In 2nd Web 2007. ACM. 9 http://vocab.org/aiiso/schema#