Data.dcs: Converting Legacy Data into Linked Data∗ ∗ Matthew Rowe OAK Group Department of Computer Science University of Sheffield Regent Court, 211 Portobello Street S1 4DP Sheffield, United Kingdom m.rowe@dcs.shef.ac.uk ABSTRACT such data. This therefore makes the process of producing Data.dcs is a project intended to produce Linked Data de- linked data limited. scribing the University of Sheffield’s Department of Com- puter Science. At present the department’s web site con- In this paper we use the case of the University of Sheffield’s tains important legacy data describing people, publications Department of Computer Science (DCS). The DCS web site and research groups. This data is distributed and is pro- contains information about people - such as their name, vided in heterogeneous formats (e.g. HTML documents, email address, web address and location, research groups RSS feeds), making it hard for machines to make sense and publications. The department provides a publication of such data and query it. This paper presents an ap- database located separately from the main site on which proach to convert such legacy data from its current form DCS members manually upload their papers. Each mem- into a machine-readable representation which is linked into ber of the department is responsible for their own personal the Web of Linked Data. The approach describes the triplifi- web page, this has lead to the formatting and presentation cation of legacy data, coreference resolution and interlinking of legacy data to vary greatly between pages, where some with external linked datasets. pages contain RDFa and others are plain HTML documents with the bare minimum of markup. This impacts greatly Categories and Subject Descriptors on the usability of the site in general and the slow process by which information can be acquired. For instance find- H.4 [Information Systems Applications]: General ing all the publications which two or more research groups have worked on in the past year would take a large amount General Terms of filtering and data processing. Furthermore the publica- Linked Data tion database is rarely updated to reflect publications by the department and its members. Keywords Linked Data, Triplification, Coreference Resolution This use case presents a clear motivation for generating a richer representation of legacy data describing the DCS. We define legacy data as data which is present in proprietary 1. INTRODUCTION formats and which describes important information about Recent work has addressed the issue of producing linked data the department - i.e. publications. Leveraging legacy data from data sources conforming to well-stuctured relational from HTML documents which make up the DCS web site databases [2]. In such cases data already follows a logical and converting this data into a machine-readable form us- schema making the creation of linked data a case of schema ing formal semantics would link together related informa- mapping and data transformation. The majority of the Web tion. It would link people with their publications, research however does not conform to such a rigid representation, groups with their members and allow co-authors of research instead the heterogeneous structures and formats which it papers to be found. Furthermore by linking the dataset into exhibits makes it hard for machines to parse and interpret the Web of Linked Data would allow additional information ∗ The research leading to these results has received funding to be inferred such as the conferences which members of the from the EU project WeKnowIt10 (ICT-215453). DCS have attended and provide up-to-date publications list- ∗Copyright is held by the author/owner(s). ings - thereby avoiding the current slow update process by LDOW2010, April 27, 2010, Raleigh, USA. linking to popular bibliographic databases such as DBLP. In this paper we document our current efforts to convert this legacy data to linked data. We present our approach to pursue this goal which is comprised of three stages: first we perform triplification of legacy data found within the DCS - by extracting person information from HTML documents and publication information from the current bibliography system. Second we perform coreference resolution and inter- linking of the produced triples - thereby linking people with their publications and fusing data within separate HTML clues to regions within the documents from which person documents together. Third we connect our produced dataset information should be extracted. Once regions of extrac- to distributed linked datasets in order to provide additional tion have been identified then extraction patterns are used information to agents and humans browsing the dataset. to extract relevant information based on its proximity in the document. An effort to extract personal information (name, We have structured the paper as follows: section 2 describes email, homepage, telephone number) from within web pages related work in the field of producing linked data from legacy has been presented in [3] using a system called ”Armadillo”. data and discusses similar efforts to our problem setting ex- A lexicon of seed person names is compiled from several plored within the information extraction community. Sec- repositories which are then used to guide the information tion 3 presents a brief overview of our approach and the extraction process. Heuristics are used to extract person in- pipeline of the architecture which is employed. Section 4 de- formation surrounding a name which appears within a given scribes the triplication process which generates triples from web page. legacy data within HTML documents and the publication database. Section 5 presents the SPARQL rules we em- Work by [18] has explored the application of Hidden Markov ployed to discover coreferring entities. Section 6 describes Models to extract medical citations from a citation reposi- our preliminary method for weaving our dataset into the tory by inputting a sequence of tokens and then outputting linked data cloud. Section 7 finishes the paper with the con- the relevant labels for those tokens based on the HMM’s clusions which we have learnt from this work and our plans predicted states: Title, Author, Affiliation, Abstract and for future work. Reference. Prior to applying the HMMs, windows within HTML documents are derived known as component zones, 2. RELATED WORK or context windows, these zones within the HTML document are considered for analysis in order to extract information Recent efforts to construct linked data from legacy data from. Similar work by [7] has applied HMMs to the task include Sparqplug [4] where linked data models are con- of extracting citation information. Work within the field structed based on Document Object Model (DOM) struc- of attribute extraction has placed emphasis on the need to tures of HTML documents. The DOM is parsed into an extract information describing a given person from within RDF model which then permits SPARQL [11] queries to web pages. For instance [9] uses extraction patterns (i.e. be processed over the model and relevant information re- regular expressions) defined for different person attributes turned. Although this work is novel in its approach to se- to match content within HTML documents. An approach mantifying web documents, the approach is limited by its by [16] to extract person attributes from HTML documents lack of rich metadata descriptions attributed to elements first identifies a list of candidate attributes within a given within the DOM. Existing work by [2] presents an approach web page using hand crafted regular expressions - these are to expose linked data from relational databases by creating related to different individuals. All HTML markup is then lightweight mapping vocabularies. The effect is such that filtered out leaving the textual content of the documents. data which previously corresponded to a bespoke schema is Attributes which appear closest to a given person name are provided as RDF according to common ontological concepts. then assigned to that name. Metadata generation - so called triplification - is discussed extensively in [10] in order to generate metadata describing conferences, their proceedings, attendees and organisations 3. CONVERTING LEGACY DATA INTO participating. Due to the wide variation in the provided LINKED DATA data formats - i.e. excel spreadsheets, table documents - In order to convert legacy data into linked data we have im- metadata was generated by hand. Despite this such work plemented a pipeline approach. Figure 1 shows the overview provides a blue print for generating metadata by describing of this approach which is divided into three stages: the process in detail and the challenges faced. The challenges faced when converting legacy data devoid of • Triplification: the approach begins by taking as input metadata and semantic markup into a machine-processable an RSS feed describing the publications by DCS mem- form involves exposing such legacy data and then construct- bers and the DCS web site. Context windows are iden- ing metadata models describing the data. In the case of tified within the RSS feed - where each context win- the DCS web site our goal is to generate metadata describ- dow contains information about a single publication ing members of the department, therefore we must extract - and in the HTML documents - where each context this legacy data to enable the relevant metadata descrip- window contains information about a single person. tions to be built. Work within the field of information ex- Information is extracted from these context windows traction provides similar scenarios to the problems which and is then converted into triples, describing instances we face, For instance extraction of person information from of people and publications within the department. within HTML documents has been addressed in [14] by seg- menting HTML documents into components based on the • Coreference Resolution: SPARQL queries are processed Document DOM of the web pages. Person information is over the entire graph to discover coreferring entities: then extracted using induced wrappers from labelled per- e.g. the same people appearing in different web pages. sonal pages. [15] uses manually created name patterns to match person names within a web page and then, using a • Linking to the Web of Linked Data: the Web of Linked context window surrounding the match, extract contextu- Data Cloud is queried for coferring entities and related ally relevant information surrounding the name. The DOM information resources, and links are created from the of HTML documents is utilised in work by [17] to provide produced dataset. Figure 1: Three staged approach to convert legacy data to linked data Each of the stages of the approach contains various steps and element. We must identify such context windows within a processes which are essential to the production of a linked HTML document to enable the correct information to be dataset. We will now present each of these stages in greater extracted. To address this problem we rely on the markup detail, beginning with the triplication of legacy data. used within HTML documents to segment disjoint content. For instance in many web pages layout elements such as 4. TRIPLIFICATION OF LEGACY DATA
Ph.D. Student
Researching Identity Disambiguation and Web 2.0