<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Toward the Semantic Web -An Approach to Reverse Engineering of Relational Databases to Ontologies</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Irina</forename><surname>Astrova</surname></persName>
							<email>irinaastrova@yahoo.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Tallinn University of Technology</orgName>
								<address>
									<addrLine>Ehitajate tee 5</addrLine>
									<postCode>19086</postCode>
									<settlement>Tallinn</settlement>
									<country key="EE">Estonia</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Toward the Semantic Web -An Approach to Reverse Engineering of Relational Databases to Ontologies</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">C2C5680A495395C0F1FF49491382A134</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T11:14+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We propose a novel approach to reverse engineering of relational databases to ontologies. Our approach incorporates two main sources of semantics: HTML pages and a relational schema. This incorporation results in that:</p><p>(1) only minimal information about a relational database is required to build an ontology; and (2) the ontology is no longer "impaired" by bad-database design, and by optimization and de-normalization of the relational schema. Our approach can be used for migrating HTML pages (especially those that are dynamically generated from a relational database) to the ontology-based Semantic Web. The main reason for this migration is to make the relational database information that is available on the Web machine-processable.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>One of the main driving forces for the Semantic Web has always been the expression, on the Web, of the vast amount of relational database information in a way that can be processed by machines <ref type="bibr" target="#b0">[1]</ref>. Indeed, most information on the Web is not machineprocessable, because it is often represented in HTML (Hypertext Markup Language) <ref type="bibr" target="#b1">[2]</ref>. This language describes how the information looks like and not what it is. In order for machines to process the information, it must be represented in an ontology language -e.g. Frame Logic (F-Logic) <ref type="bibr" target="#b2">[3]</ref> -and linked to ontologies. An ontology can be used for annotating HTML pages with semantics.</p><p>Manual or semi-automatic semantic annotation <ref type="bibr" target="#b3">[4]</ref> is time-consuming, subjective and error-prone. It is even impossible on scale of the Web that contains billions of pages. Most pages even do not exist until they are dynamically generated from relational databases at the time of submitting HTML forms.</p><p>An alternative to the semantic annotation is automatic or semi-automatic reverse engineering of relational databases to ontologies <ref type="bibr" target="#b4">[5]</ref>. However, because of the novelty of that area, there are few approaches that consider an ontology as the target for reverse engineering. A majority of the work has been done on extracting a conceptual schema such as an entity-relationship model from relational databases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Common Problems of Reverse Engineering</head><p>At first glance, it seems easy to reverse engineer a relational database to an ontology: just map each relation to a class, each attribute in the relation to an attribute in the class, each tuple to an instance, and each constraint to an axiom <ref type="bibr" target="#b5">[6]</ref>. This provides simple and fully automatic (i.e. without user interaction) reverse engineering. So, why would not we want to do this? The easy approach ignores common problems of reverse engineering; e.g.: − Optimization and de-normalization: A relational schema is often optimized and de-normalized for performance reasons <ref type="bibr" target="#b6">[7]</ref>. − Unrealistic assumptions: Many organizations believe in keeping all their data in third normal form. However, every database designer has war stories about finding the entire relational schema in first normal form instead of third normal form <ref type="bibr" target="#b6">[7]</ref>. − Bad database design: A relational schema is often bad-designed, because it may be done by novice and untrained database designers who are not familiar with database theory and database methodology <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9]</ref>. − Non-translated constructs: Since a relational schema does not support all constructs of a conceptual schema, some of the semantics captured in the conceptual schema -e.g. inheritance -will necessarily be lost when translating the schema from conceptual to relational. Indeed, this translation usually results in "semantic degradation" of the schema that becomes simpler, less complete, less understandable, and less expressive <ref type="bibr" target="#b9">[10]</ref>. − Implicit semantics: Semantics may be not in a relational schema, but rather in data or even in the heads of users who query a relational database <ref type="bibr" target="#b9">[10]</ref>. − Meaningless names: Relations and attributes in a relational schema are often assigned names that are a maze of cryptic abbreviations; e.g. YRTREBUT, B_423_SPD or FRED <ref type="bibr" target="#b6">[7]</ref>. However, it is difficult or even impossible to deduce the meaning (i.e. semantics) of data from those names.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Related Work</head><p>Existing approaches to reverse engineering of relational databases to ontologies fall roughly into one of the three categories: − Approaches based on an analysis of relational schema: E.g. Stojanovic et al's approach <ref type="bibr" target="#b4">[5]</ref> provides a set of rules for mapping constructs in the relational database (i.e. relations, attributes, tuples, and constraints) to semantically equivalent constructs in the ontology (i.e. classes, attributes, instances, and axioms). These rules are based on an analysis of relations, attributes, primary and foreign keys, and inclusion dependencies. − Approaches based on an analysis of data: E.g. Astrova's approach <ref type="bibr" target="#b10">[11]</ref> builds an ontology based on an analysis of relational schema. However, since a relational schema often captures little explicit semantics <ref type="bibr" target="#b11">[12]</ref>, this approach also analyzes data in the relational database.</p><p>− Approaches based on an analysis of user queries: E.g. Kashyap's approach <ref type="bibr" target="#b12">[13]</ref> builds an ontology based on an analysis of relational schema; the ontology is then refined by user queries. However, this approach does not create axioms that are part of the ontology. Not all of the common problems of reverse engineering can be solved using the existing approaches. In particular, the existing approaches can be limited in terms of requiring more input information than it is possible to provide in practice and making unrealistic assumptions about the input. E.g. they typically assume that a relational schema is in third normal form.</p><p>The search for a solution leads us to a novel approach where HTML pages are analyzed. So far, this analysis has been focused on generation of wrappers; see e.g. <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b15">16,</ref><ref type="bibr" target="#b16">17]</ref>. A wrapper is a program that extracts the relational database information from HTML pages. There are wrappers that are based on ontologies; see e.g. <ref type="bibr" target="#b17">[18]</ref>.</p><p>Wrappers have the main advantage of reconstructing a (part of) relational database "hidden" behind HTML forms, when a relational schema is unknown. The backside of this advantage is that any changes to structures of HTML pages -e.g. adding or deleting fields in the pages -can break the wrappers and thus, the ontologies they are based on. HTML pages are volatile by nature, meaning that they are often redesigned <ref type="bibr" target="#b18">[19]</ref> -typically more than twice a year <ref type="bibr" target="#b19">[20]</ref>.</p><p>The biggest problem of wrappers is that they rely on structures of HTML pages to extract semantics, thus throwing away all the advantages of analyzing a relational schema; e.g.: − An analysis of HTML pages often leads to a "brittle" ontology. Since a relational schema is more stable than HTML pages, its analysis guarantees that the ontology need not be rebuilt every time the pages change their structures. − HTML pages represent views of the relational database; i.e. different ways of viewing the relational database information on the Web. Thus, some semantics may be not in the pages, but rather in the relational schema.</p><p>Apart from these, the relational schema is a formal explicit agreement between database designers and users about data and its meaning in an organization. Thus, the relational schema provides an important source of semantics to be extracted into an ontology <ref type="bibr" target="#b24">[25]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Our Approach</head><p>As an attempt to solve the common problems of reverse engineering, we propose a novel approach. Our approach is based on an analysis of HTML pages. There are two main reasons for this analysis. One is that a relational schema often captures little explicit semantics <ref type="bibr" target="#b11">[12]</ref>, while a conceptual schema is usually unavailable or out-ofdate <ref type="bibr" target="#b9">[10]</ref>. Another reason for analyzing HTML pages is to benefit from their userfriendliness. This user-friendliness results in that: − HTML pages partially represent a logical structure of the relational database, rather than its physical structure (i.e. a relational schema). Indeed, they often provide a user-friendly interface to the relational database. Behind this interface, a relational schema can be bad-designed, optimized, and de-normalized. − Table and field names in HTML pages are often more explicit and more meaningful than the corresponding relation and attribute names in a relational schema. Given the reasons for analyzing HTML pages, let's now consider more precisely what our approach is and then illustrate it by example. Suppose going to a website http://www.bobhowardhonda.com and searching for information about a used vehicle; e.g. Ford Mustang. Since such information is stored in a relational database, we fill out an HTML form in Figure <ref type="figure" target="#fig_0">1</ref> (located in the upper frame of the page) and submit it. After submitting the form, search results will be returned in an HTML page in Figure <ref type="figure" target="#fig_0">1</ref>. This page is dynamically generated from a relational database and contains specifications of Ford Mustang and its features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Extracting Form Model Schema</head><p>The first step of our approach is extracting a form model schema <ref type="bibr" target="#b9">[10]</ref>. This schema was originally proposed to extract an entity-relationship model from database forms. Basically, the form model schema contains: − Form field: This is an aggregation of name and entry associated to it <ref type="foot" target="#foot_0">1</ref> . A name is pre-displayed and serves as a clue to what will be entered by users or displayed by HTML pages. An entry is the actual data; it roughly corresponds to an attribute in the relational schema. We use the term of linked attribute for such an entry to distinguish it from other entries that are computed or simply unlinked with the relational schema. − Structural unit: This is a logical group of closely related form fields. It roughly corresponds to a relation in the relational schema. − Relationship: This is a connection between structural units that relates one structural unit to another (or back to itself). There are two kinds of relationship: association and inheritance. − Constraint: This is a rule that defines what data is valid for a given linked attribute. A cardinality constraint specifies for an association relationship the number of instances that a structural unit can participate in. − Underlying source: This is a physical structure of the relational database (i.e. a relational schema) that defines relations and attributes with their data types. − Form type: This is a collection of empty form fields. − Form template: This is a particular representation of form type. Each form template has a layout (i.e. its graphical representation) and a title that provides its general description. − Form instance: This is an occurrence of form type, when its template is filled in with the actual data. E.g. Figure <ref type="figure" target="#fig_0">1</ref> is an instance of the form type. − Hierarchical tree: This is a hierarchical structure of form instance. There are two kinds of hierarchical tree: structured data tree and content tree <ref type="bibr" target="#b20">[21]</ref>. A structured data tree captures, for an HTML page, the hierarchy of HTML tags and data contents (i.e. the syntactic hierarchy). A content tree is the same, except that HTML tags are deleted. Thus, it captures only the hierarchy of data contents (i.e. the intended hierarchy).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1">Analysis of HTML Pages Structures and Relational Schema</head><p>Extracting a form model schema consists in an analysis of HTML pages (especially their structures) and a relational schema to identify constructs of the form model schema and to assign those constructs names using wrapper generation techniques <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b15">16,</ref><ref type="bibr" target="#b16">17,</ref><ref type="bibr" target="#b17">18]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1.1">Identifying Form Instances</head><p>HTML pages typically contain advertisements and navigational menus that can be viewed as "noisy" data <ref type="bibr" target="#b16">[17]</ref>. Thus, given an HTML page, the first task is to identify a data-rich section (i.e. a form instance). We can do this in three ways.</p><p>First, we can compare HTML pages for overlaps in structure <ref type="bibr" target="#b16">[17]</ref>. The implication is that all pages from a given website will organize their content in a similar way, regarding the location of advertisements and navigational menus.</p><p>Second, we can examine HTML code for block tags such as &lt;frame&gt; <ref type="bibr" target="#b17">[18]</ref>. A difficulty is that HTML pages often consist of multiple frames.</p><p>Third, we can search through all frames in the page to find the largest one <ref type="bibr" target="#b13">[14]</ref>. This approach typically implies that a frame that takes up the largest display area will be the most interesting to users. E.g. from Figure <ref type="figure" target="#fig_0">1</ref> we would identify that Vehicle Detail represents all data that is the subject of interest to users.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1.2">Identifying Structural Units</head><p>We can take three basic approaches to this. First, we can examine HTML code for block tags such as &lt;table&gt;, &lt;ul&gt; and &lt;ol&gt; <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b17">18]</ref>. The implication is that structural units will be represented by tables or lists in HTML pages. The biggest problem with this approach is that it relies on a physical structure of HTML pages. Thus, it fails if the pages change their structures frequently. There are many other situations, where the approach does not work either such as errors in the code and misuse of the block tags <ref type="bibr" target="#b13">[14]</ref>. E.g. not only is &lt;table&gt; used for representing a relation in the relational schema, but it is also the primary method for grouping data in HTML pages <ref type="bibr" target="#b14">[15]</ref>. The data is often grouped just for easier viewing it by users.</p><p>Second, we can use visual cues to determine a logical structure of HTML pagesthat is, the real meaning of the pages as they are understood by users <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b15">16]</ref>. E.g. the users might consider the year, make, model, price, mileage, …, and vin in Figure <ref type="figure" target="#fig_0">1</ref> as a whole group, just because they all are specifications. Third, we can look for structural units in relations of the relational schema (i.e. the underlying source). E.g. from Figure <ref type="figure" target="#fig_0">1</ref> we would identify two structural units: Vehicle and Feature. One contains specifications for a used vehicle (Year, Make, Model, Price, Mileage, …, and VIN); while another structural unit lists the vehicle features (Air Conditioning, Passenger Side Air Bag, …, and Tilt Steering Wheel).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1.3">Identifying Linked Attributes</head><p>We can take three basic approaches to this. First, we can examine HTML code for block tags such as &lt;thead&gt; and &lt;th&gt; <ref type="bibr" target="#b17">[18]</ref>. Again, this approach works as long as the code is well designed, correct, and stable. Moreover, the approach is viable only if fields in HTML pages are separated with the block tags; it does not work for merged data. E.g. the year, make, and model in Figure <ref type="figure" target="#fig_0">1</ref> are all merged data, meaning that they are combined in a single text string: "2002 Ford Mustang". </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Form model schema Ontology</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure 2. Summary of reverse engineering</head><p>Second, we can use visual cues <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b16">17]</ref>. This approach typically implies that there will be some separators (e.g. blank areas) that help users split the merged data. E.g. the year, make, and model in Figure <ref type="figure" target="#fig_0">1</ref> are space-separated. Sometimes we can also use data formats as visual cues to understand the meaning of data. E.g. the price in Figure <ref type="figure" target="#fig_0">1</ref> is also indicated with a dollar sign (i.e. "$").</p><p>Third, we can look for linked attributes in attributes of the relational schema. This is because a given HTML page may contain only a part of the total attributes in the relational schema.</p><p>Looking at a form model schema in Figure <ref type="figure">2</ref>, we can see that each structural unit is defined by a set of linked attributes. E.g. the structural unit Vehicle contains linked attributes definitions for year, make, model, price, mileage, …, and vin; while the structural unit Feature has a linked attribute name.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1.4">Identifying Relationships</head><p>We can take two basic approaches to this. First, we can look for relationships (usually many-to-many) in relations of the relational schema, then look for relationships (one-to-one and one-to-many) in foreign keys. A difficulty is that there are always relations with unknown foreign keys <ref type="bibr" target="#b7">[8]</ref>.</p><p>Second, since the relational database information typically does not reside on a single HTML page, we can try to find relationships in hyperlinks. E.g. from Figure <ref type="figure" target="#fig_0">1</ref> we would identify an association relationship between the structural units Vehicle and Feature: a used vehicle has features. The implication is that related structural units will appear at the same page.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1.5">Naming Structural Units</head><p>Structural units can be given names of the corresponding relations in the relational schema. But it is generally less confusing to users if the names are more meaningful. Looking back at the form model schema in Figure <ref type="figure">2</ref>, notice the adaptation of the name Feature to the structural unit. This can better convey the meaning of data than the original relation name Detail would.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1.6">Naming Linked Attributes</head><p>There are three basic approaches to this. One is to give linked attributes names of the corresponding attributes in the relational schema.</p><p>Since field names in HTML pages are often more explicit and more meaningful than the corresponding attribute names in the relational schema, another approach is to give linked attributes the field names. A difficulty is that the field names are not always encoded in HTML pages. E.g. the photo, year, make, and model in Figure <ref type="figure" target="#fig_0">1</ref> are given no names at all. However, missing names can be found in HTML forms. Since the forms are often used for querying a relational database, they provide a sketch (of part) of a relational schema <ref type="bibr" target="#b16">[17]</ref>. Assuming that a given website will do its best to return the most relevant data to users, search criteria submitted through an HTML form are likely to re-appear in the returned HTML pages. E.g. from an HTML form in Figure <ref type="figure" target="#fig_0">1</ref>, we could enter "Ford" and "Mustang" for fields Make and Model, respectively. Search results for the form will be returned in an HTML page in Figure <ref type="figure" target="#fig_0">1</ref>. This contains details of a used vehicle that matches the search criteria; i.e. Ford Mustang. Therefore, linked attributes corresponding to the fields, with "Ford" and "Mustang" re-appeared in their entries, could be named make and model, respectively.</p><p>Yet another approach is to give linked attributes data type names <ref type="bibr" target="#b16">[17]</ref>. E.g. a linked attribute represented by the photo in Figure <ref type="figure" target="#fig_0">1</ref> might be named image.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1.7">Naming Relationships</head><p>Relationships can be given names that are either names of the corresponding relations (usually for many-to-many relationships) or foreign key names (for one-to-one or one-to-many relationships). Again, users can give more meaningful names to the relationships.</p><p>The end result for the first step is the form model schema in Figure <ref type="figure">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.2">Data Analysis</head><p>In addition to the structures of HTML pages, we also analyze data in the pages to identify constraints. A data analysis includes a strategy of learning by examples, borrowed from machine learning techniques <ref type="bibr" target="#b21">[22,</ref><ref type="bibr" target="#b22">23]</ref>. In particular, it is performed as a sequence of learning tasks from the relational database. Each task is defined by: (1) task relevant data (e.g. data contained in the pages), ( <ref type="formula">2</ref>) problem background knowledge (e.g. application domain knowledge), and (3) expected representation of results of learning tasks (e.g. first order predicate logic). The results of learning tasks are related to a current state of the relational database. They will be generalized into knowledge about all states through an induction process <ref type="bibr" target="#b9">[10]</ref>. This process combines the semantics extracted from the pages with the application domain knowledge that is provided by users (i.e. the user "head knowledge"). Such knowledge controls the learning tasks to come to the best inductive conclusion, the conclusion that will be consistent with all states of the relational database. E.g. from Figure <ref type="figure" target="#fig_0">1</ref> we would identify a constraint NotNull on the linked attribute mileage. This contains non-null values for any used vehicle.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.3">Integration</head><p>There are typically several HTML pages (of different structures) for any given website. Thus, their analysis will produce several form model schemata. These will be merged into a single one through an integration process <ref type="bibr" target="#b9">[10]</ref>. This process performs as follows. First, the schemata are compared for overlaps in structure. This means looking for structural units and relationships with similar names, then looking for similar structures within structural units and relationships. Second, the schemata are compared for overlaps in meaning. This means looking for structural units that correspond to the same real-world objects but have different names. Third, naming conflicts (i.e. synonyms and homonyms) are resolved. Conflicts can also be in different constraints on the linked attributes and different cardinality constraints on the relationships. By performing these tasks, the integration process makes the schemata consistent with one another and brings them together into a single one that makes sense for all HTML pages from a given website.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Schema Transformation</head><p>The second step of our approach is transforming the form model schema into an ontology (i.e. "schema transformation"). Basically, this means replacing constructs of the form model schema to constructs of the ontology using mapping rules <ref type="bibr" target="#b23">[24]</ref>.</p><p>The ontology is formulated in F-Logic. This language has an object-oriented syntax. It provides support for classes, attributes with domain and range definitions, inheritance hierarchies of classes and attributes, and axioms that can be used for further characterizing relationships between instances.</p><p>Continuing the example, consider again a form model schema in Figure <ref type="figure">2</ref>. Here schema transformation is straightforward. First, we create a class for each structural unit in the form model schema. E.g. we create two classes: Vehicle and Feature. Within each class, we create an attribute for each linked attribute in the structural unit. E.g. for the class Vehicle, we add attributes year, make, model, price, mileage, …, and vin. We also add an attribute features. This associates the two classes. Finally, we create an axiom for each constraint (except cardinalities) in the form model schema. E.g. we add an axiom NotNull to the ontology.</p><p>The end result for the second step is the ontology in Figure <ref type="figure">2</ref>. The ontology is nearing completion. But there are still instances to create. These instances will populate a knowledge base, whose schema is defined by the ontology <ref type="bibr" target="#b11">[12]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Data Migration</head><p>The third step of our approach is creating instances from data contained in HTML pages (i.e. "data migration"). Basically, this means assigning values to the attributes in the ontology using wrapper generation techniques <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b17">18]</ref>.</p><p>Continuing the example, consider again an HTML page in Figure <ref type="figure" target="#fig_0">1</ref>. Here data migration is easy for the attributes year, make, model, price, mileage, …, and vin in the class Vehicle. However, we meet with a difficulty when trying to find a value for the attribute features that corresponds to the list of features in Figure <ref type="figure" target="#fig_0">1</ref>. We overcome this difficulty by creating an instance for each feature in Figure <ref type="figure" target="#fig_0">1</ref> and assigning it to the attribute features.</p><p>The end result for the third step is the ontology in Figure <ref type="figure">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>We have proposed a novel approach to reverse engineering of relational databases to ontologies. Our approach is based on the idea that semantics of data in a relational database can be extracted by analyzing HTML pages. These semantics are supplemented with those captured in the relational schema to build an ontology. There are three important advantages of our approach: − It requires minimal information about a relational database. This is important because the complete knowledge of the relational database is usually unavailable <ref type="bibr" target="#b8">[9]</ref>. E.g. there are always relations with unknown primary keys <ref type="bibr" target="#b7">[8]</ref>. − It makes no assumptions about a relational schema that can be bad-designed, optimized, and de-normalized. This is important because even database experts may occasionally break the rules of good database design <ref type="bibr" target="#b7">[8]</ref>. And many database designers improve performance by optimizing and de-normalizing the relational schema <ref type="bibr" target="#b6">[7]</ref>. − It appeals to users who likely understand HTML pages better than a relational schema. This is important because reverse engineering cannot be completely automated <ref type="bibr" target="#b4">[5]</ref>. There are always situations where user interaction is necessary. These advantages come in large part from an analysis of HTML pages. But this analysis has costs. One is the difficulty in automation <ref type="bibr" target="#b17">[18]</ref>. This is because HTML pages are designed for (human) users use only. E.g. data in the pages can be embedded in natural language text or hidden within graphical presentation primitives <ref type="bibr" target="#b18">[19]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Future Work</head><p>In the future, our approach can be used for migrating HTML pages (especially those that are dynamically generated from a relational database) to the ontology-based Semantic Web. The main reason for this migration is to make the relational database information that is available on the Web machine-processable <ref type="bibr" target="#b4">[5]</ref>.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 .</head><label>1</label><figDesc>Figure 1. HTML page</figDesc><graphic coords="4,135.42,236.64,320.46,310.14" type="bitmap" /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">However, we can also identify a form field with no name for its entry; e.g. the photo, year, make, and model in Figure1.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgement</head><p>This research is partly sponsored by ESF (Estonian Science Foundation) under the grant nr. 5766.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Relational Databases on the Semantic Web</title>
		<author>
			<persName><forename type="first">T</forename><surname>Berners-Lee</surname></persName>
		</author>
		<ptr target="http://www.w3.org/DesignIssues/RDB-RDF.html" />
		<imprint>
			<date type="published" when="2002">2002. 2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Raggett</surname></persName>
		</author>
		<ptr target="http://www.w3.org/TR/html401/(1999" />
		<title level="m">HTML 4.01 Specification</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Logical Foundations of Object-oriented and Frame-based Languages</title>
		<author>
			<persName><forename type="first">M</forename><surname>Kifer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lausen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal ACM</title>
		<imprint>
			<biblScope unit="volume">42</biblScope>
			<biblScope unit="page" from="741" to="843" />
			<date type="published" when="1995">1995</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">From Manual to Semi-automatic Semantic Annotation: About Ontology-based Text Annotation Tools</title>
		<author>
			<persName><forename type="first">M</forename><surname>Erdmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Maedche</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schnurr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Staab</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Linköping Electronic Articles in Computer and Information Science Journal (ETAI)</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">2</biblScope>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Migrating Data-intensive Web Sites into the Semantic Web</title>
		<author>
			<persName><forename type="first">L</forename><surname>Stojanovic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Stojanovic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Volz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 17 th ACM Symposium on Applied Computing (SAC)</title>
				<meeting>the 17 th ACM Symposium on Applied Computing (SAC)</meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="1100" to="1107" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Importing Relational Databases into the Semantic Web</title>
		<author>
			<persName><forename type="first">G</forename><surname>Dogan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Islamaj</surname></persName>
		</author>
		<ptr target="http://www.mindswap.org/webai/2002/fall/Importing_20Relational_20Databases_20into_20the_20Semantic_20Web.html" />
		<imprint>
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Database Design for Smarties: Using UML for Data Modeling</title>
		<author>
			<persName><forename type="first">R</forename><surname>Muller</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1999">1999</date>
			<publisher>Morgan Kaufmann</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">An Approach for Reverse Engineering of Relational Databases</title>
		<author>
			<persName><forename type="first">W</forename><surname>Premerlani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Blaha</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Communications of the ACM</title>
		<imprint>
			<biblScope unit="volume">37</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="42" to="49" />
			<date type="published" when="1994">1994</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Database Design Recovery</title>
		<author>
			<persName><forename type="first">J</forename><surname>Hainaut</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Henrard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Roland</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Englebert</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 8 th Conference on Advanced Information Systems Engineering (CAiSE), LNCS</title>
				<meeting>the 8 th Conference on Advanced Information Systems Engineering (CAiSE), LNCS</meeting>
		<imprint>
			<date type="published" when="1996">1996</date>
			<biblScope unit="volume">1080</biblScope>
			<biblScope unit="page" from="272" to="300" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Extracting Entity-Relationship Schemas from Relational Databases: A Formdriven Approach</title>
		<author>
			<persName><forename type="first">N</forename><surname>Mfourga</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 4 th Working Conference on Reverse Engineering (WCRE)</title>
				<meeting>the 4 th Working Conference on Reverse Engineering (WCRE)</meeting>
		<imprint>
			<date type="published" when="1997">1997</date>
			<biblScope unit="page" from="184" to="193" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Reverse Engineering of Relational Databases to Ontologies</title>
		<author>
			<persName><forename type="first">I</forename><surname>Astrova</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 1 st European Semantic Web Symposium (ESWS)</title>
				<meeting>the 1 st European Semantic Web Symposium (ESWS)</meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="volume">3053</biblScope>
			<biblScope unit="page" from="327" to="341" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Ontology Evolution: Not the Same as Schema Evolution</title>
		<author>
			<persName><forename type="first">N</forename><surname>Noy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Klein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Knowledge and Information Systems</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">5</biblScope>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Design and Creation of Ontologies for Environmental Information Retrieval</title>
		<author>
			<persName><forename type="first">V</forename><surname>Kashyap</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 12 th Workshop on Knowledge Acquisition, Modeling and Management (KAW)</title>
				<meeting>the 12 th Workshop on Knowledge Acquisition, Modeling and Management (KAW)</meeting>
		<imprint>
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">HTML Page Analysis Based on Visual Cues</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 6 th International Conference on Document Analysis and Recognition (ICDAR)</title>
				<meeting>the 6 th International Conference on Document Analysis and Recognition (ICDAR)</meeting>
		<imprint>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="859" to="864" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">A Machine Learning Based Approach for Table Detection on the Web</title>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 11 th International Conference on World Wide Web (WWW)</title>
				<meeting>the 11 th International Conference on World Wide Web (WWW)</meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="242" to="250" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Visual Based Content Understanding towards Web Adaptation</title>
		<author>
			<persName><forename type="first">X.-D</forename><surname>Gu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W-Y</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">. G.-L</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2 nd International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems (AH)</title>
				<meeting>the 2 nd International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems (AH)</meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="29" to="31" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Data Extraction and Label Assignment for Web Databases</title>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Lochovsky</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of 12 th International Conference on World Wide Web (WWW)</title>
				<meeting>12 th International Conference on World Wide Web (WWW)</meeting>
		<imprint>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="187" to="196" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Toward Semantic Understanding -An Approach Based on Information Extraction</title>
		<author>
			<persName><forename type="first">D</forename><surname>Embley</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 15 th Australasian Database Conference (ADC)</title>
				<meeting>the 15 th Australasian Database Conference (ADC)</meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="3" to="12" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Database Techniques for the World Wide Web: A Survey</title>
		<author>
			<persName><forename type="first">D</forename><surname>Florescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Levy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mendelzon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM SIGMOD Record</title>
		<imprint>
			<biblScope unit="volume">27</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="59" to="74" />
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Knoblock</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kambhampati</surname></persName>
		</author>
		<ptr target="http://rakaposhi.eas.asu.edu/aaai-i3-tut-all.pdf" />
		<title level="m">Information Integration on the Web</title>
				<imprint>
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Extracting Structures of HTML Documents</title>
		<author>
			<persName><forename type="first">S</forename><surname>Lim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 12 th International Conference on Information Networking (ICOIN)</title>
				<meeting>the 12 th International Conference on Information Networking (ICOIN)</meeting>
		<imprint>
			<date type="published" when="1998">1998</date>
			<biblScope unit="page" from="420" to="426" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Learning the Behavior of Dynamical Systems from Examples</title>
		<author>
			<persName><forename type="first">J</forename><surname>Paredis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 6 th International Workshop on Machine Learning (ICML)</title>
				<meeting>the 6 th International Workshop on Machine Learning (ICML)</meeting>
		<imprint>
			<date type="published" when="1989">1989</date>
			<biblScope unit="page" from="137" to="140" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">A Theory and Methodology of Inductive Learning</title>
		<author>
			<persName><forename type="first">R</forename><surname>Michalski</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Machine Learning: An Intelligence Approach</title>
				<imprint>
			<date type="published" when="1983">1983</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="83" to="134" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Reverse Engineering of Relational Databases to Ontologies: An Approach Based on an Analysis of HTML Forms</title>
		<author>
			<persName><forename type="first">I</forename><surname>Astrova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stantic</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Workshop W6 on Knowledge Discovery and Ontologies (KDO), 15 th European Conference on Machine Learning (ECML), 8 th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD)</title>
				<editor>
			<persName><forename type="first">P</forename><surname>Buitelaar</surname></persName>
		</editor>
		<meeting>the Workshop W6 on Knowledge Discovery and Ontologies (KDO), 15 th European Conference on Machine Learning (ECML), 8 th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD)</meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="73" to="78" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Ontologies and Databases: More than a Fleeting Resemblance</title>
		<author>
			<persName><forename type="first">R</forename><surname>Meersman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the International Workshop on Open Enterprise Solutions: Systems, Experiences, and Organizations (OES/SEO</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Atri</surname></persName>
		</editor>
		<meeting>the International Workshop on Open Enterprise Solutions: Systems, Experiences, and Organizations (OES/SEO<address><addrLine>Missikoff, M.</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
